All posts

· Samuel Mirpuri· Evaluating AI delivery

How to tell an AI delivery vendor that ships from one that demos

A buyer's procedure for separating vendors that put working software on your data from vendors that sell decks, built on the base rates that explain why most enterprise AI pilots never reach production.

Every vendor's slide deck says the same five things. They understand your industry, they have a proprietary methodology, they start with a discovery phase, they will be a partner to you over the long haul, and they will deliver outcomes. The words are interchangeable across firms, which means the words tell you nothing, and the only way to discriminate between a vendor that will put working software into your systems and one that will leave you a beautifully formatted assessment is to ignore what the deck claims and test for the things a deck-seller structurally cannot produce. The base rate makes the test worth running. MIT's NANDA initiative, in its mid-2025 study led by Aditya Challapally, found that about ninety-five percent of enterprise generative-AI pilots delivered no measurable impact on the profit-and-loss statement, and the failures traced not to weak models but to integration and what the authors call the organizational learning gap. You are buying against a ninety-five-percent failure rate, so the burden is on the vendor to prove they sit in the surviving five percent, and the procedure below is how you make them prove it.

Discovery duration tells you what they sell

The first signal comes before any work begins, in the discovery phase. A vendor who proposes months of stakeholder workshops, current-state mapping, and a future-state roadmap before anything runs against your data is telling you, whether they mean to or not, that their deliverable is the analysis itself. We have written at length about why the discovery phase does not need to cost three months, and the procurement consequence is direct: a long, deck-heavy discovery is the cost structure of a firm that bills for analysis, because if the analysis were a means to a shipped agent rather than the product itself, the firm would compress it to get to the part that produces a shipped agent. Ask how long discovery runs and what physically exists at the end of it. If the honest answer is a slide deck and a prioritized backlog, you have found a consultancy that bills for AI analysis as a product, and the deck is the deliverable.

Ask for a working artifact on your own data, in days

The single most discriminating question you can ask is whether the vendor will build something that runs against your own data, in your own environment, within a handful of days, before you sign the full engagement. The forward-deployed engineering model shows why this works: Palantir's practice, as Diogo Silva Santos documents in his analysis of the model, is to put a working capability on the customer's actual data within one to five days rather than a slide projecting what could be built. The test works precisely because it cannot be faked with positioning. A vendor who genuinely ships can hand you a thing that handles a thin slice of your real workload by the end of the week, and a vendor who sells assessments will counter that responsible work requires more upfront scoping, which is true of the assessment and untrue of the artifact. Demand the artifact, on your data, early, and watch which kind of objection comes back.

Test for substance behind the label

The vocabulary problem has gotten worse since the agent became the thing every vendor sells. Gartner, in its June 2025 prediction that more than forty percent of agentic-AI projects will be canceled by the end of 2027, named the practice driving much of the noise: agent washing, the rebranding of chatbots, robotic process automation, and ordinary assistants as autonomous agents. Gartner's analysts put the count of vendors offering something genuinely agentic at roughly one hundred and thirty out of the thousands claiming the label, which is to say the word on the slide tells you almost nothing and you have to test the substance underneath it. The way to test it is to ask what the system does when it encounters a case it was not built for, because a rebranded chatbot has no answer and a real agent has a designed one. Gartner's own counsel is that the durable value comes from rethinking the workflow rather than bolting an agent onto a legacy process, which is the same conclusion that follows from watching where the pilots fail.

The four production questions

Once a vendor clears the artifact test, the remaining questions are about what happens after go-live, because that is where the ninety-five percent actually fail. Ask who owns the running system in production, by name and by team, after the engagement closes, since a vendor whose model ends at handoff is selling you a pilot you will operate alone. Ask for the evaluation method and the accuracy threshold the system must clear before it touches a live transaction, because a vendor who cannot state the number does not have one. Ask what the rollback plan is when the agent makes a wrong call, and ask specifically how exceptions are handled and by whom, since the difference between a demo and a deployment is almost entirely the unglamorous work of the missing nineteen percent: the edge cases, the human-in-the-loop fallbacks, the instrumentation that nobody saw in the demo and that determines whether the thing keeps working across a real month. A vendor with crisp answers to these four has thought about production, whereas a vendor who treats them as implementation details to be worked out later has told you they have not.

A reasonable counter, answered

A reasonable counter is that this procedure favors a particular kind of vendor and risks screening out a genuinely capable firm whose strength is strategy rather than software, the firm that helps you choose the right problem before anyone builds anything. There is something to this, and choosing the wrong workflow does kill engagements that perfect engineering could not have saved. But the strategic firm and the delivery firm are answering different questions, and the operator buying against a ninety-five-percent failure rate is not short on strategy decks; the constraint is almost never knowing which workflow to fix, it is getting one fixed in a way that holds. The procedure does not screen out good strategy. It screens out vendors who have nothing but strategy and have learned to call it AI, which, given the base rate, is exactly the population you most need to screen out. The operators who get burned are rarely the ones who tested too hard; they are the ones who, faced with five identical decks, picked on rapport and discovered eighteen months later that nothing had shipped, which is the outcome this procedure exists to prevent, and the reason it is worth the awkwardness of asking a vendor to prove, in days and on your own data, the one thing the deck can only assert.

Common questions

How can I tell an AI vendor that will ship working software from one that will leave a deck?
Ask for a working artifact that runs against your own data, in your own environment, within a handful of days, before you sign the full engagement. A vendor who ships can produce one; a vendor who sells assessments will argue that responsible work requires more upfront scoping. Discovery duration is the tell: months of workshops before anything runs means the deliverable is the analysis itself.
What questions separate a real agent from a rebranded chatbot?
Ask what the system does when it encounters a case it was not built for, who owns the running system in production after go-live, what the evaluation method and the accuracy threshold are before it touches a live transaction, and what the rollback plan is when it makes a wrong call. Gartner has named the practice of rebranding chatbots and robotic process automation as agentic AI, and estimates only about one hundred and thirty vendors offer something genuinely agentic, so the label on the slide has to be tested.
Why does the base rate matter when choosing a vendor?
MIT's NANDA study found that about 95 percent of enterprise generative-AI pilots delivered no measurable impact on the profit-and-loss statement, with the failures traced to integration and the organizational learning gap rather than to weak models. Buying against that base rate means the burden is on the vendor to prove they sit in the surviving share, which is what the artifact test and the production questions are for.