Engineering

Productizing Uncertainty: How We Taught Our AI to Say “I Don’t Know”

Your LLM doesn’t know what it doesn’t know. So we built an AI system at Float that fixes that.

April 22, 2026


Your LLM doesn’t know what it doesn’t know.

Here’s how we fixed that.

We built an AI system at Float that automatically categorizes accounting transactions (GL codes, tax codes, the tedious stuff that eats hours of bookkeeper time every month).

The core insight that makes it work in production isn’t the LLM. It’s knowing when to shut up 🤐

Ask an LLM to classify a transaction and it will always give you an answer. Every time. With complete confidence. It doesn’t matter if the expense could legitimately fall under three different accounts; you’ll get a clean, authoritative-sounding response. Even when you ask it to express uncertainty, that score isn’t reliably calibrated.

In accounting, that’s not a feature. That’s a liability. A wrong journal entry has to be found, investigated, and corrected. And usually this burden falls on the same person the automation was supposed to help in the first place.

So we don’t let the LLM decide what to automate. We built a separate confidence layer that makes that call instead.

One model per customer

Here’s a wrinkle: every customer has a unique chart of accounts and unique set of tax codes. There’s no universal label set and there are no two businesses with the same set. It’s like a fingerprint. What this means is we don’t have one classification problem; we have one per customer, each with its own label space and its own patterns.

The architecture

The LLM looks at a transaction (merchant, amount, card name, even unstructured receipt data) alongside the customer’s chart of accounts and generates its best prediction. This is what LLMs are genuinely great at: understanding messy, real-world context and mapping it to a plausible category. But if that’s all our system did, we would get sub-par accuracy. The kind of accuracy our users can’t work with.

The instinct here is to prompt-engineer your way to better accuracy, or fine-tune. If the LLM gets it wrong sometimes, make the LLM better. But that misses the real problem: it’s not that the LLM is wrong too often; it’s that we can’t tell when it’s wrong. Even if you pull logprobs or ask the model to rate its confidence, those signals aren’t reliably calibrated to real-world correctness, so we treat confidence as a separate modelling problem.

And that’s where classical ML comes in.

A per-customer confidence model — a logistic regression trained on the customer’s own historical data — sits on top of the LLM and asks a simpler question: is the LLM likely to be right here? It learns where the LLM tends to nail it and where it tends to struggle. High confidence → automate. Low confidence → route to the bookkeeper.

This is called selective abstention. The threshold is calibrated so the system maintains ≥90% precision. When it acts, it aims to be correct at least 9 times out of 10. When it’s uncertain, it gets out of the way.

Importantly, the per-customer model is small and cheap to maintain. It retrains quickly on the customer’s own historical data.

And for cold start with new customers, the system naturally defaults to abstaining: early on it automates only the most obvious cases, then widens the gate as bookkeeper-reviewed transactions build up.

Why this matters

The transactions the system doesn’t touch are genuinely harder: ambiguous vendors, unusual line items, new account structures. That’s exactly the point. AI handles the predictable volume; bookkeepers focus their expertise where it actually matters.

The temptation in AI product development is to chase automation rate. To automate absolutely everything, show bigger numbers. But every mistake the AI makes becomes a human’s job to find and fix, which often costs more time than doing it manually in the first place.

That’s the product vision behind Autocoder: eliminate manual input from transaction coding entirely. But we think the only viable path there is precision first. If the AI still needs constant verification, you haven’t actually removed work, you’ve just moved it downstream. So we optimize for earning trust upfront, automating only when we’re highly confident, and letting the system gracefully abstain the rest of the time.

The system’s ability to say “I don’t know” is as important as its ability to give the right answer.

The broader pattern

LLM for reasoning. Classical ML for calibration. This generalizes well beyond accounting.

If you’re deploying an LLM for classification where errors are costly, you probably want a confidence model that’s separate from the generation model. LLMs are excellent at understanding context and generating plausible answers. They’re poor at quantifying their own uncertainty. Use each for what it’s actually good at.

LLM for reasoning.

Classical ML for confidence.

Humans for edge cases.


We’re hiring engineers who find this kind of problem interesting. If building ML systems that have to be right (not just impressive) sounds like your thing, check out career openings at Float.


Written by

Avatar photo
Alex Roy

All the resources

Corporate Cards

Top 5 Business Credit Cards with No Foreign Transaction Fees on USD Spend for Canadian Businesses in 2026

Foreign transaction fees are kind of like that tacky souvenir you really tried not to buy, but somehow ended up

Read More

Cash Flow Optimization

Best USD Business Solutions for Cross Border Banking in 2026

If cross-border banking has become a part of your everyday, you need a solution that benefits Canadian businesses first. Here's

Read More

Corporate Cards

Best Car Rental Insurance Credit Cards in Canada: 2026 Edition

Car insurance is kind of like a seatbelt: boring until it’s absolutely essential at the worst possible moment.

Read More