When an operator asks us whether we can automate a given workflow, the honest answer carries a date stamp, because the answer changes every few months and a vendor who gives you a perennial yes is selling you the answer they wish were true rather than the one the measurements support. The clearest measurement we have of the present is METR's 2025 study, Measuring AI Ability to Complete Long Tasks, which doesn't ask whether models are getting smarter in the abstract but instead measures something an operator can actually use: how long a task, measured by how long it takes a skilled human to do it, a frontier model can complete. The headline finding is that the task length models handle at fifty percent reliability has been doubling roughly every seven months since 2019, and the report notes the recent rate may be faster, closer to every four months across 2024 and 2025. Dated and read carefully, that curve is the most useful thing an operator has for deciding what to automate this quarter and what to wait on.
Two numbers that get conflated
The first discipline the curve forces is separating two quantities that AI conversations routinely smear together: the length of task a model can attempt, and the reliability with which it finishes one. METR reports its horizon at fifty percent reliability, which means the model completes a task of that length about half the time. That is a perfectly good measure of capability and a terrible threshold to put into production, because a process that books the wrong amount to the wrong account half the time is worse than no automation at all. The same study reports that the horizon at eighty percent reliability is meaningfully shorter than at fifty percent, which is the quantitative version of something every operator already knows from experience. The gap between "the model can do this sometimes" and "the model does this every time" is enormous, and it is exactly the gap covered in our piece on the 80-to-99% problem. A model whose fifty-percent horizon now reaches a multi-hour task may still hit an eighty-percent horizon of only a few minutes, and the eighty-percent figure is the one that determines whether the work can run.
The curve is real, and it is steepening
The trajectory holds up under independent presentation. AI Digest's writeup, A new Moore's Law for AI agents, fits the same data and reports an R-squared of 0.83 for task length against success rate, which is a tight fit for a behavioral phenomenon, and it confirms the 2024-to-2025 acceleration toward roughly four months per doubling. There's a floor of empirical evidence under the curve as well. Xu, Neubig and colleagues built TheAgentCompany, a simulated firm of realistic knowledge-work tasks, and even the strongest agent they tested cleared only about a third of the work without help, its success rate dropping off sharply as tasks ran longer and grew more complex. So the curve is climbing fast, and the absolute level on long, complex, multi-step work is still low. Both facts are true at once, and a workflow-selection rule has to hold both.
What the curve says is automatable now
Translate this into a rule an operator can act on. A workflow is in range today when it is short-horizon, when its individual decisions are the kind a skilled person makes in minutes rather than hours, and when the reliability you need can be met with a human positioned on the exception tail rather than on every case. A surprising amount of everyday work fits this shape better than it first appears, because the daily reality of an AP queue or an order-entry desk or a month-end reconciliation is a long sequence of short, bounded decisions, and almost never one unbroken hours-long judgment. Each line item, each match, each coding decision is a short-horizon task, and the eighty-percent horizon already reaches many of them. Where a person reviews the small fraction the agent flags as uncertain, you get production reliability out of an imperfect model, which is the whole design of our work on the system of action and the substance of how production reliability actually gets engineered.
Where it is not automatable yet
The rule cuts the other way just as cleanly. A workflow is out of range when a single unattended decision must be right the first time over a long horizon, with no checkpoint where a human can catch a wrong turn before it compounds. A model attempting a six-hour judgment as one undivided task is operating where even the fifty-percent horizon barely reaches, never mind the eighty-percent one, and there is no exception tail to lean on when the failure is the whole task rather than a flagged subset of cases. Some of these workflows can be redesigned into shorter, checkpointed pieces that come back into range; some genuinely cannot, and the honest move is to say so rather than to promise a pilot that the curve says won't reach production this year.
A reasonable counter, and a date stamp
A reasonable counter is that this is too conservative, that frontier agents already string together long autonomous runs and the curve is about to make the whole distinction obsolete. There's something to it as a direction of travel, and the four-month doubling is the reason we date every capability claim rather than fix it. But the present level is what governs a deployment that ships this quarter, and the cost of betting on next year's horizon arriving on schedule is visible in Gartner's forecast that over forty percent of agentic AI projects will be canceled by the end of 2027, driven in part by capability claims that outran what the systems could reliably do. So here is the claim with its date attached. As of mid-2026, the workflows worth automating are the short-horizon, exception-tailed ones where eighty-percent reliability plus a human on the tail clears the bar, and a vendor who quotes you a capability without telling you which month they're standing in is selling you hope rather than a measurement. Ask us again next quarter; the reading will have moved, and we'll tell you which way.