Engineering

Productizing Uncertainty: How We Taught Our AI to Say “I Don’t Know”

Your LLM doesn’t know what it doesn’t know. So we built an AI system at Float that fixes that.

April 22, 2026

Your LLM doesn’t know what it doesn’t know.

Here’s how we fixed that.

We built an AI system at Float that automatically categorizes accounting transactions (GL codes, tax codes, the tedious stuff that eats hours of bookkeeper time every month).

The core insight that makes it work in production isn’t the LLM. It’s knowing when to shut up 🤐

Ask an LLM to classify a transaction and it will always give you an answer. Every time. With complete confidence. It doesn’t matter if the expense could legitimately fall under three different accounts; you’ll get a clean, authoritative-sounding response. Even when you ask it to express uncertainty, that score isn’t reliably calibrated.

In accounting, that’s not a feature. That’s a liability. A wrong journal entry has to be found, investigated, and corrected. And usually this burden falls on the same person the automation was supposed to help in the first place.

So we don’t let the LLM decide what to automate. We built a separate confidence layer that makes that call instead.

One model per customer

Here’s a wrinkle: every customer has a unique chart of accounts and unique set of tax codes. There’s no universal label set and there are no two businesses with the same set. It’s like a fingerprint. What this means is we don’t have one classification problem; we have one per customer, each with its own label space and its own patterns.

The architecture

The LLM looks at a transaction (merchant, amount, card name, even unstructured receipt data) alongside the customer’s chart of accounts and generates its best prediction. This is what LLMs are genuinely great at: understanding messy, real-world context and mapping it to a plausible category. But if that’s all our system did, we would get sub-par accuracy. The kind of accuracy our users can’t work with.

The instinct here is to prompt-engineer your way to better accuracy, or fine-tune. If the LLM gets it wrong sometimes, make the LLM better. But that misses the real problem: it’s not that the LLM is wrong too often; it’s that we can’t tell when it’s wrong. Even if you pull logprobs or ask the model to rate its confidence, those signals aren’t reliably calibrated to real-world correctness, so we treat confidence as a separate modelling problem.

And that’s where classical ML comes in.

A per-customer confidence model sits on top of the LLM and asks a simpler question: is the LLM likely to be right here? It’s trained on the customer’s own historical data, learning where the LLM tends to nail it and where it tends to struggle. High confidence → automate. Low confidence → route to the bookkeeper.

This is called selective abstention. The threshold is calibrated so the system maintains ≥90% precision. When it acts, it aims to be correct at least 9 times out of 10. When it’s uncertain, it gets out of the way.

Importantly, the per-customer model is small and cheap to maintain. It retrains quickly on the customer’s own historical data.

And for cold start with new customers, the system naturally defaults to abstaining: early on it automates only the most obvious cases, then widens the gate as bookkeeper-reviewed transactions build up.

Why this matters

The transactions the system doesn’t touch are genuinely harder: ambiguous vendors, unusual line items, new account structures. That’s exactly the point. AI handles the predictable volume; bookkeepers focus their expertise where it actually matters.

The temptation in AI product development is to chase automation rate. To automate absolutely everything, show bigger numbers. But every mistake the AI makes becomes a human’s job find and fix, which often costs more time than doing it manually in the first place.

That’s the product vision behind Autocoder: eliminate manual input from transaction coding entirely. But we think the only viable path there is precision first. If the AI still needs constant verification, you haven’t actually removed work, you’ve just moved it downstream. So we optimize for earning trust upfront, automating only when we’re highly confident, and letting the system gracefully abstain the rest of the time.

The system’s ability to say “I don’t know” is as important as its ability to give the right answer.

The broader pattern

LLM for reasoning. Classical ML for calibration. This generalizes well beyond accounting.

If you’re deploying an LLM for classification where errors are costly, you probably want a confidence model that’s separate from the generation model. LLMs are excellent at understanding context and generating plausible answers. They’re poor at quantifying their own uncertainty. Use each for what it’s actually good at.

LLM for reasoning.

Classical ML for confidence.

Humans for edge cases.

We’re hiring engineers who find this kind of problem interesting. If building ML systems that have to be right (not just impressive) sounds like your thing, check out career openings at Float.

Written by

Alex Roy

All the resources

Corporate Cards

Best Car Rental Insurance Credit Cards in Canada: 2026 Edition

Car insurance is kind of like a seatbelt: boring until it’s absolutely essential at the worst possible moment.

hands typing on a computer, with a blue line art with accounting icons on it

Accounts Payable

AP Automation for Canadian Businesses: Streamline Accounts Payable

Let's look at how AP automation works, why it matters for Canadian finance teams and what to look for in