All posts

· Samuel Mirpuri· Evaluating AI delivery

When not to automate, and what an honest engagement does when the savings don't show up

A clear-eyed account of the workflows that make bad automation candidates, why first principles predict it, and what a vendor actually does when the projected hours never appear.

Most writing about workflow automation is relentlessly optimistic, which is why the most useful thing a vendor can say is rarely said out loud: some of the work in front of us should not be automated, and we can usually tell which before we build anything. The base rates make this unavoidable rather than contrarian. EY found that thirty to fifty percent of initial robotic-process-automation projects fail; Gartner projects that more than forty percent of agentic AI projects will be canceled by the end of 2027; and the MIT NANDA study, reported by Fortune, put the share of generative-AI pilots that deliver no measurable return at ninety-five percent. Those numbers are not three separate failures of engineering. They are, in large part, one repeated failure of selection, the failure to ask whether a given workflow should be automated at all before spending the effort to automate it.

The four kinds of a bad candidate

The workflows that resist automation share recognizable structure, and once you've seen them across enough engagements the pattern stops being a judgment call and becomes close to a calculation. The first kind is the task dominated by judgment and exceptions rather than repetition: a credit decision on a borderline customer, a pricing concession negotiated case by case, a dispute where the right answer depends on facts that only a person knows rather than facts in any record. You can automate the data-gathering that surrounds such a decision, but the decision itself is the work, and a system that handles eighty percent of the cases while routing the genuinely hard twenty percent back to a person hasn't removed the person. It has just changed what they spend their day on.

The second kind is the low-volume task where the engineering never pays back. A workflow that runs forty times a month can be miserable for the person who runs it and still be a poor automation target, because the effort to make an agent reliable across the long tail of how those forty cases differ costs more than the hours it returns. The third kind is the process that changes faster than any system can track, where the rules shift quarterly with a new payer, a new regulator, or a new product line, and the maintenance burden of keeping an agent current consumes the savings the agent was built to deliver. The fourth kind is the decision that genuinely requires accountability a model cannot hold: a sign-off that a named human has to own because a regulator, an auditor, or a court will eventually ask who decided, and "the system did" is not an answer anyone can stand behind.

Hammer's warning, read literally

There's an older principle underneath all four, and it's the one flowscope already writes from. Michael Hammer's argument in "Reengineering Work: Don't Automate, Obliterate" was that laying technology over a broken process yields a faster broken process, that the work has to be redesigned before it's mechanized, not reproduced in software exactly as it grew up. The failure-mode reading of Hammer is simply the contrapositive: if a process is broken, and you automate it anyway because automation is what the engagement was sold to deliver, you've built a faster version of something nobody should have been doing in the first place, and you've spent real money making it harder to fix, because now there's a system to unwind on top of the habit.

Bainbridge's irony, which has only gotten worse

The second principle is older still and almost never cited in this market, which is a shame, because it predicts the exact way well-built automations disappoint. Lisanne Bainbridge's 1983 paper "Ironies of Automation" observed that the more of a task you automate, the more critical and the more skill-degraded the residual human role becomes. The operator is left with the hardest cases, the ones the automation couldn't handle, but is now out of practice on exactly those cases, because the routine work that kept their judgment sharp has been taken away. An agent that handles ninety percent of an accounts-payable queue leaves a human responsible for the hardest ten percent while steadily reducing the fluency that made them good at that ten percent. This is not an argument against automation; it's an argument that the residual role has to be designed with as much care as the automated part, and a workflow where the residual role can't be made safe is a workflow you should think hard about before touching.

What an honest engagement does when the savings don't appear

The part vendors avoid naming is what happens after a candidate clears every filter and still goes wrong. Suppose the diagnostic was run, a candidate was chosen, the projected savings were modeled in hours against the real volumes, and then partway through it becomes clear the savings won't appear, because the exception rate is higher than the sampling suggested, or the upstream data is dirtier than anyone admitted, or the rules turn out to change more often than they hold. An honest vendor stops at the point the numbers stop supporting the work, rather than at the end after the build is delivered and the invoice is sent, and they say so in the operator's own terms: these are the hours we expected to return, here is why they won't return, here is what we'd have to spend to chase the remainder, and it isn't worth it. The diagnostic that should have caught it is the same baseline you'd build anyway, the volumes, the exception distribution, the true cost of the residual human role, and the discipline is using that baseline to kill a bad candidate before the build rather than to rationalize one after it. This is also why the eighty-to-ninety-nine-percent gap is a selection problem before it's an engineering problem: a workflow that can't clear the last stretch to production-grade reliability is one the diagnostic should have flagged, and chasing it anyway is how a thirty-to-fifty-percent failure rate gets made.

Why surfacing a bad candidate early is the aligned move

A reasonable counter is that any vendor who talks itself out of work is forgoing revenue, and that the incentive will always be to build something rather than admit there's nothing worth building. That's true under the old model, where the deliverable is billed hours and a project that ships is a project that pays regardless of whether it returns anything. It stops being true the moment the engagement is structured so the vendor is paid against the outcome rather than the activity, because then a bad candidate built anyway is a loss the vendor absorbs, and the cheapest possible mistake is the one caught before the build starts. The honesty isn't a virtue added on top of the business. It's what the business does when its own interest depends on the savings actually appearing, which is the only version of an AI engagement worth signing, and the version a skeptical operator can forward to a colleague with confidence.

Common questions

Which workflows are bad candidates for automation?
Tasks dominated by judgment and exceptions rather than repetition, low-volume tasks where the engineering never pays back, processes that change faster than a system can track, and decisions that genuinely require accountability a model cannot hold. Automating a broken process produces a faster broken process, so the redesign question comes before the automation question.
What happens in an honest engagement when the projected savings do not materialize?
The diagnostic should catch a bad candidate before anything is built. When it does not and the savings do not appear, the honest move is to stop rather than to keep building, and to have surfaced the risk early. The base rates make this discipline necessary: industry surveys put initial robotic-process-automation failure between 30 and 50 percent, and Gartner projects that more than 40 percent of agentic-AI projects will be canceled by the end of 2027.