AI agents in production: 4 mistakes we see (and have made) every time

Eight months ago, one night in September 2025, an agent we'd put into production on an internal prototype did a very interesting thing. It sent a confirmation email to a test client, read the reply, interpreted "Okay, confirm!" as "execute everything", and deleted an entire configuration table.

It wasn't a real client. It wasn't a production table. But it was close enough to the real thing that the next morning we sat around a table and made a very honest list of everything we'd done wrong.

That list grew over the next seven months. Meanwhile we put Tuken into production — the events and parking management platform — and we started building BP (the business plan assistant, launching in the coming months). Each of these products added a chapter to that list.

Today we pass it on. Four mistakes. They're not the sexiest. They're the most frequent, the least talked about, and the ones that really make the difference between an agent that runs well and one that produces silent disasters.

Mistake #1 — Giving it too many powers because "we'll manage later"

The first agent you put into production, whatever framework you use, makes you want to configure it with broad permissions. Read everywhere, write almost everywhere, access to three or four external APIs, ability to send emails, ability to call payment tools in "test mode that's actually the same as production".

The logic sounds reasonable: "Let's see what it can do, then we'll restrict."

You never see. You never restrict.

What actually happens is that after two weeks you're used to seeing it work with those powers, you've built flows that take them for granted, and at that point restricting them means breaking them. So you don't restrict. And a month later the agent has access to half the system, nobody remembers exactly which half, and the only person who knew has resigned.

The principle we apply now, on all agents, on Tuken as on BP:

The agent starts with the minimum required to not do its job. Yes, you read that right: it must fail the first time you try it. Then you grant a permission, you try it, and only if it's really needed do you leave it on. Each new permission has an expiration date written clearly: if at that date nobody remembers why it has it, you take it away.

It's annoying. It costs you an extra day of work per agent. It saves you from a number of disasters that's hard to quantify because — beauty of avoided incidents — they don't happen.

Mistake #2 — Nobody knows what changed in the prompt the day before yesterday

Prompts are not code. You know it. I know it. We told each other.

But then this happens. A colleague changes a line in the system prompt of an agent because "it was answering badly to a specific client's question". It works better for that client. It works worse for the other five. Nobody notices for three days. When we notice, nobody remembers exactly what changed, when, and why.

The subtlest bug of all time is the bug you don't know exists, and it's exactly what unversioned prompts produce. Meanwhile: the model behind the scenes has changed (the vendor updates without telling you), the context has changed (new documents indexed, old ones removed), a temperature parameter has been touched. Five things move together. Metrics worsen in obscure ways.

What we do now:

Every prompt is in a git repo, has a changelog, a version, a commit message explaining why it was changed. Modifications go through pull request like any other code — even when "it's just a comma". When we release a new version, an automatic battery of tests (a set of questions/expected answers) runs and tells us immediately if something regressed.

It's a practice the community calls prompt-ops or, more seriously, LLM-ops. It's the thing that changed our lives on BP, where a single wrong prompt can produce a financial plan that looks like a real one but with made-up numbers. There you can't afford that.

Mistake #3 — Paying for inference, not measuring retries

This is the subtlest one, because it disguises itself well.

When you put an agent into production, the first metric you watch is cost per API call. It's proposed in dashboards, it's easy to graph, it's what everyone talks about. So you monitor it. How much does OpenAI cost us per month, how much Anthropic, how much Mistral.

That number, alone, lies.

The economic truth of an agent in production is: how many times does a real user have to retry an action before getting what they wanted? Call it cost-of-retry, call it abandonment rate, call it what you want. That number contains almost everything.

Real example, from an iteration on Tuken. An assistant agent helped an event manager configure pricing packages. Inference cost: super low, a fraction of a cent per request. Real cost: the manager had to ask the same thing on average 4 times before getting the right configuration. The real cost was the manager's time (a lot) and their trust in the tool (which was collapsing). Inference was irrelevant.

The remedy wasn't changing the model or writing longer prompts. It was measuring retry. Once measured, product decisions aligned themselves: less free-text where it didn't work, more guided choices where they were needed, "let me talk to a human" fallback where the agent wasn't reliable.

The KPI: every critical interaction must answer "did this user get where they wanted to go?". If you don't know, you're paying a cost you don't see.

Mistake #4 — Measuring the agent, not the product

We come to the worst one, because it's the most seductive.

When you work on an AI agent, you want to measure it. Right. You build dashboards that tell you: median latency, tokens consumed, tool call accuracy, model success rate on a test battery, percentage of fluent responses. They're good numbers. They give you the feeling of being in control.

And they're almost always the wrong measure.

The agent is not the product. It's inside the product. What matters is not "the model answered well". What matters is "the user completed the flow they came to do". They're two different things, and the second is much harder to measure, so we often don't. We measure the first, congratulate ourselves, and then one day notice users aren't renewing and we don't know why.

On Tuken we fell into exactly this trap for the first six weeks. We had beautiful dashboards on the agent: accuracy, latency, intent distribution. All green. Meanwhile the metric "did the manager publish the event after the conversation with the agent?" — the one that really told us if the tool served — we looked at once a week.

Now we look at it before all others. Agent metrics are secondary to product metrics. If the agent is a Nobel-prize model but users don't finish the flow, the agent needs redoing. If the agent is mediocre but users close the flow, the agent is fine.

Translated into an operational rule: define the product metric before you start building the agent. Not after. Not in parallel. Before. If you can't define a clear product metric, you probably don't need an AI agent — you need a tougher conversation with your team about what you're really building.

The checklist, distilled

For those in a hurry. Four questions to ask yourself before putting an agent into production.

Permissions. Does the agent start at minimum? Does each permission have an expiration?
Versioning. Is every prompt in git? Does every modification go through a PR? Is there a test battery running on each release?
Cost-of-retry. Do we know how many times a real user has to retry? Do we measure it? Do we graph it?
Product KPIs. The metric that counts is the user completing the flow, not the agent answering well. Have we defined it before starting?

If the answer to a single one of these four questions is "not yet", you're making the most expensive mistake of 2026: keeping in production a system that seems to work but is costing you in ways you don't see.

A small preview

Mistake number two — the versioned prompt — is only half the story. The other half is that a prompt, even a perfect one, is worth little if the context you pass to it is wrong. What you put in the context, where it comes from, with what freshness, with what permissions: that's what we call context engineering, and it's the real discipline of 2026.

We'll talk about it on June 11 in a dedicated piece. For now, let's stop here.

Monday, June 1, we'll go back to the startup pillar instead, with a piece that's been brewing for a while: Founder mode is dead. Long live founder mode. Two years after Paul Graham's famous essay, what's left of it in Italian startups that survived 2025.

See you Monday.

Got an AI agent that works but you don't quite trust yet? Our artificial intelligence for business service includes operational audits of agents already in production. Half a day, operational output, no PowerPoint slides. Let's talk.