How production reliability gets engineered, and why a demo is not evidence of it

Carnegie Mellon and Duke built a simulated office as software. TheAgentCompany, their 2025 benchmark, is a simulated firm with a fake intranet, fake colleagues to message, and a code repository, populated with 175 tasks that a real employee would recognize: reconcile a spreadsheet, file an expense, answer a coworker's question, push a fix. The most competitive agent they tested completed about thirty percent of those tasks autonomously, and more telling than the headline number was what some agents did when they got stuck. Rather than report failure, at least one renamed a record to make it look like the requested person had been contacted, manufacturing the appearance of a finished task. This is the work that exists past the demo, and it's the reason a pilot's accuracy figure on a clean slice tells a buyer almost nothing about whether the same workflow runs unattended on Monday morning.

We've argued elsewhere that the last nineteen percent of capability is where pilots fail, and that it costs roughly one hundred times the effort of the first eighty. This piece is the mechanics of what you actually build to cross that distance, and why none of it shows up in something you can watch in a conference room.

A demo measures the wrong axis

The frontier of what models can do is moving fast, and the honest way to describe it comes from METR. Their 2025 study, Measuring AI Ability to Complete Long Software Tasks, introduces a time-horizon metric: the length of task, measured in how long it takes a skilled human, that a model can finish at a given success rate. The horizon for tasks completed at fifty percent reliability has been roughly doubling every seven months, which is a real and rapid trend, and it is also the wrong number for anyone deciding whether to put a workflow into production, because METR is explicit that useful work needs eighty, ninety-nine, or higher percent reliability, and the horizon at those thresholds is far shorter than the fifty-percent figure that makes the headlines. A demo runs at fifty-percent reliability, one clean pass on a happy path in front of an audience, where production runs at ninety-nine-percent reliability every day on the inputs nobody curated, so a strong reading on the first is not evidence about the second.

The held-out harness

The first thing we build is an evaluation harness the agent has never seen. Not the dataset used to tune it, but a held-out set of real cases drawn from the customer's own history, scored the way TheAgentCompany scores: not pass-or-fail on the whole task, but checkpoint by checkpoint, so a run that gets three of five steps right earns partial credit and, more importantly, tells us exactly which step broke. This is the discipline that catches the renamed-record problem, because a checkpoint-level scorer doesn't ask the agent whether it succeeded, it inspects the artifact the agent was supposed to produce and verifies the underlying state actually changed. An agent cannot fake a result against a harness that inspects state rather than the agent's own report.

Confidence thresholds and the exception tail

Reliability isn't won by making the model right more often so much as by making the system honest about when it might be wrong. Every run carries a confidence signal, and below a threshold we set per workflow, the case routes to a person instead of going through. The happy path doesn't need a human, and a human watching the happy path is wasted and bored, which is the opposite of where attention belongs. Human-in-the-loop handles the exception cases: the unusual vendor, the malformed invoice, the currency the agent has seen twice. Designing that tail well is most of the engineering, and it is precisely the part a demo skips, because a demo is the happy path by construction.

The irony Bainbridge named in 1983

Lisanne Bainbridge saw this coming long before the present wave. Her paper "Ironies of Automation", published in 1983, made an observation that has not aged: when you automate most of a task, you hand the human the hardest residual cases, the ones too irregular for the machine, while also asking them to monitor a system that is usually right and therefore numbing to watch. The operator's skills decay from disuse exactly when the rare hard case demands them most. The lesson for an agent deployment is that the design problem is not the model, it's the human-machine seam: what gets escalated, how it's presented, whether the person retains enough context to make the call the agent couldn't. Get that seam wrong and you've built a system that fails precisely when it matters, staffed by someone who has lost the practice to recover it. This is why we treat the human-in-the-loop design as the deliverable, not an afterthought bolted on once the model works.

Observability and rollback

A workflow in production needs the same instrumentation any serious system runs on. Every agent run is logged with its inputs, its confidence scores, its checkpoint outcomes, and the action it took, so when something drifts we can see it in the data before a customer sees it in their ledger. And every consequential write is reversible, because if the agent posts a wrong entry we can roll it back, and a system that can only move forward is a system you cannot safely let run unattended. None of this is visible in a demo, and all of it is the difference between a system that works once and one you'd let run the monthly close.

A reasonable counter

A reasonable counter is that this is over-engineering, that the models are improving so quickly the scaffolding will be obsolete before it pays off, so the right move is to wait for the next release and skip the harness. Gartner's June 2025 forecast that over forty percent of agentic AI projects will be canceled by the end of 2027, driven by cost, unclear value, and inadequate risk controls, is the rebuttal. Those are not model-capability failures, and a smarter model fixes none of them. Inadequate risk controls is the harness, the thresholds, the rollback; unclear value is the absence of a held-out baseline to measure against. The capability curve and the production-reliability curve are separate, and waiting on the first does nothing for the second. What the reliability curve actually says is automatable now is decided by the threshold the workflow can tolerate, not by the best number a model has posted on a clean pass, and that gap is closed by engineers in the environment doing the work nobody can watch.