All posts

· Samuel Mirpuri· Evaluating AI delivery

How to baseline a process well enough to stand behind a result

An outcome-aligned engagement is only as honest as its baseline, and most baselines are self-reported and wrong. Here is the measurement machinery that makes a before-and-after comparison defensible.

Ask a CFO how long the month-end close takes and you will get a confident answer that turns out, once you watch the close happen, to be off by days in one direction and to omit two reconciliations and a re-open entirely. The number she gave you was the close as designed, the version the controller carries in memory and on the calendar invite. The close as run is a different process, with handoffs that wait overnight, a journal entry that gets reversed twice, and a spreadsheet that one person maintains and nobody else can open. An engagement that promises to improve the close, and that ties its commitment to a result, inherits whichever of those two processes you measured. If you measured the designed one, your improvement is partly fictional before you have written a line of agent code, because some of the gap you will later claim to have closed was never real.

This is the part of an outcome-aligned engagement that gets the least attention and carries the most weight. The commercial structure of an aligned engagement only means something if the baseline it measures against is honest, and most baselines are neither measured nor honest. They are interview-derived, rounded up, and quietly self-serving on both sides: the buyer wants the before-state to look painful enough to justify the project, and a less careful vendor wants it to look painful enough that almost any after-state reads as a win. A baseline built that way is not a measurement, it is a negotiation, and it cannot underwrite a claim.

Measure the process that runs, not the one that's written

The reason interviews produce bad baselines is the same reason authored SOPs rot: the document and the recollection both describe the intended process, and the intended process is not the one that consumes the hours. An SOP is written once, at a moment of relative calm, by someone reasoning about how the work should go. The real work accretes exceptions. The vendor who pays a credit-memo customer differently, the entity whose intercompany eliminations never net to zero on the first pass, the one approver who is always traveling during close week, none of these appear in the SOP, and all of them are in the cycle time. A baseline that starts from the document or the interview starts from a process that, in the strict sense, does not exist.

Observation gets at the other process, the one with the exceptions in it. When you meter the actual sequence of actions, the keystrokes and the application switches and the waits, you are measuring the work as performed rather than as remembered, and the tacit knowledge that never made it into any document shows up as time. This is the same argument that justifies observing the work during discovery rather than interviewing about it, applied now to the narrower job of fixing a number you will later have to defend. The baseline must reflect the real process for exactly the reason the redesign must: both fail if they are built on the fiction.

The three metrics that have to be defined the same way every time

A defensible baseline rests on three measures, and each is only useful if its definition is frozen before measurement starts. Cycle time is elapsed time from the trigger to the completed output, including the waits, which is where most of the duration actually lives and where self-reports most reliably undercount. Touch count is the number of discrete human interactions the work requires, not the number of steps in the SOP but the number of times a person has to pick the item up, because that is what predicts both labor and error rate. Cost per unit is total resource consumed divided by units processed, defined per invoice or per close or per ticket, so that volume changes do not masquerade as efficiency changes.

The trap in all three is that an organization will define them differently in March than in September unless something holds the definition steady. Whether cycle time starts when the invoice arrives or when it's entered, and whether a re-touch after a correction counts as one touch or two, are exactly the questions that drift between readings if no one pins them down. This is the discipline that APQC's Open Standards Benchmarking and Process Classification Framework exist to enforce: a taxonomy that pins each process and its metrics to one definition so that a number from one period, or one organization, means the same thing as a number from another. Borrowing that rigor is what lets you say a before figure and an after figure are the same measurement taken twice, rather than two different measurements with the same name.

Meter it the same way before and after

The comparison is only valid if the instrument doesn't change between the two readings. A baseline captured by stopwatch and recollection, then compared against an after-state captured by automated logging, will show an improvement that is partly an artifact of the more honest instrument, and the larger an organization's tolerance for that sloppiness, the more its vendors will exploit it. The way to avoid it is to meter the process with the same instrumentation before and after, observed at the level of actual behavior in both readings, so that whatever the measurement misses, it misses symmetrically. This is the concrete instance of how flowscope works: the same capture that produces the before-state is the capture that produces the after-state, which means the comparison is internally consistent even where it is imperfect.

Don't let the redesign take credit the market handed you

The subtler failure is attribution. Suppose AP cycle time drops thirty percent over the engagement. Some of that is the redesigned process and the agent doing the entry, and some of it might be a quieter quarter, a vendor consolidation that happened anyway, or a volume dip that made every queue shorter. An honest baseline carries the discipline to separate these, holding the comparison to like-for-like volume and mix, and resisting the temptation to count an external improvement in demand as engineering credit. The cleanest version meters a control: a slice of the same work left on the old process, so that the difference between the changed and unchanged slices isolates the redesign from everything else moving in the background. Where a true control isn't feasible, at minimum the volume and mix get normalized, and the parts of the gain that can't be cleanly attributed get reported as unattributed rather than claimed.

A reasonable counter is that all of this is needless ceremony, that customers feel whether the close got easier and the precise baseline is a formality nobody reads. There's some truth in it for processes small enough to feel. But the engagements where outcomes are worth committing to are the ones large enough that intuition is unreliable, where a thirty-percent feeling and a twelve-percent reality are indistinguishable from the inside, and where the difference is the whole point. The discipline isn't there to impress the customer; it's there because the unbillable-hour logic of tying a vendor's standing to a result only works if the result is measured by someone willing to be proven wrong. A vendor that meters the baseline transparently, with frozen definitions and honest attribution, is underwriting its own claim and accepting the downside if the number doesn't move. One that hands you a baseline from an interview and asks you to trust the after-state is doing the opposite, and the asymmetry tells you most of what you need to know.

Common questions

Why can't we just use the cycle time our controller already reports for the month-end close?
Self-reported numbers describe the process as designed, which is the version someone carries in memory or on a calendar invite, not the process as it actually runs. The close that runs has handoffs that wait overnight, journal entries that get reversed, and steps that never made it into any document, and all of those exceptions live in the real cycle time. A baseline built from interviews starts from a process that does not exist, so any improvement measured against it is partly fictional before any work is done.
What metrics make a before-and-after process comparison defensible?
A defensible baseline rests on three measures, each frozen to one definition before measurement starts. Cycle time is elapsed time from trigger to completed output including the waits, where most of the duration actually lives. Touch count is the number of discrete human interactions the work requires, and cost per unit is total resource consumed divided by units processed, defined per invoice or per close or per ticket so that volume changes do not look like efficiency changes. The discipline of pinning each definition steady is what APQC's Open Standards Benchmarking and Process Classification Framework exist to enforce.
How do you make sure a reported improvement came from the redesign and not from external factors like a slower quarter?
When a number like cycle time drops, some of the change can come from the redesigned process and some from outside factors such as a quieter quarter, a vendor consolidation, or a volume dip that shortened every queue. The cleanest way to separate these is to meter a control, a slice of the same work left on the old process, so the difference between the changed and unchanged slices isolates the redesign. Where a true control is not feasible, volume and mix get normalized at minimum, and any portion of the gain that cannot be cleanly attributed is reported as unattributed rather than claimed.