AI Won’t Replace Your Experimentation Programme. It Will Expose Its Weaknesses.

sjarvie30 Apr 2026

TLDR: The conversation about AI in experimentation is mostly about the wrong thing. AI is not replacing sound experimental judgement. What it is doing is removing the friction that was previously masking the consequences of poor programme design. If your programme has structural weaknesses, and most do, AI will make them impossible to ignore.

Click to discuss this article with your favourite AI

There is a specific kind of article doing the rounds right now. The title is some variation of “Will AI replace your experimentation programme?” and the argument lands in one of two places: either yes, AI is coming for the whole function, or no, human judgement is irreplaceable and everyone can relax.

Both miss what is actually happening.

The more honest question is not whether AI will replace experimentation work. It is whether your programme is built well enough to benefit from what AI can actually do. And for most programmes, the answer is not yet, because the structural weaknesses that will cause problems with AI were already causing problems without it.

AI is not creating new failure modes. It is accelerating the existing ones.

Every AI tool making meaningful inroads into experimentation (hypothesis scoring, pattern recognition across test libraries, analysis summarisation) depends on the quality of what your programme has been producing. If the foundations are weak, AI does not shore them up. It moves faster through a broken process and generates outputs with more confidence.

Three weaknesses come up consistently across the programmes I have worked on, and in the research that the AI debate has inadvertently surfaced. None of them are new. All of them get worse when you introduce AI.

Poorly formed hypotheses cannot teach a model anything

The most visible weakness, and the one that connects most directly to what AI tools need to function, is hypothesis quality.

CROmetrics built a tool called ImpactLens that scores new experiment ideas against historical patterns and predicts probability to win. The limitation buried in that promise: it learns from your historical experiments. If your past hypotheses were vague, circular or missing the reasoning that explains why an intervention should work, that is what the model learns from. Predicting outcomes from a weak evidence base is not a shortcut to better experiments. It is a faster path to the same ones.

The problem with most hypotheses is not that they lack data. Teams have learned to attach a metric. The problem is that they conflate observation and reasoning. “We have observed users drop off at checkout, so we think addressing checkout will improve checkout” is circular. It names the problem twice. What is missing is the specific belief about why the drop-off is happening, written down clearly enough to be tested and challenged over time.

A complete hypothesis has five parts. Most teams write two or three. I wrote about what those five parts look like and why the ones teams skip are the most important. The structure matters not just for rigour in individual tests but because it is the only way to accumulate genuine programme learning over time. AI cannot extract signal from half-formed thinking.

Running more variants is not a substitute for making a decision

Most teams running multiple variants on the same test are not doing it because it is statistically sound. They are doing it because they cannot agree on which idea is better and do not want to make the call.

Before AI, build effort did that work for them. Developing two fully built variants instead of one took real time and resource, and at some point someone had to make a prioritisation call. It was uncomfortable, but the constraint forced it. Teams had to commit to a direction, which meant they had to form a view on which idea was more likely to work and why.

AI removes that friction. When generating variants is cheap, the tradeoff disappears. There is no longer a forcing function that requires the team to decide. So they do not decide. They test everything and wait for the data to tell them what they should have been able to reason through before the experiment ran.

The problem is that this does not get better over time. A programme that outsources prioritisation decisions to test results never develops the skill of forming strong, informed bets. It accumulates more data and less judgement. And as GrowthBook documents, it also accumulates worse data: more variants means less traffic per variant, which means noisier results, which means the “winners” that emerge are more likely to be statistical noise than genuine signal. The inability to prioritise produces tests that are less trustworthy, which makes the next prioritisation call even harder to make with confidence.

Traffic and time were always the real costs of running an experiment. Build effort was just the visible one. Most teams were never accounting for the others, and the build constraint masked that gap. AI removes the visible cost and leaves the hidden ones fully exposed. Teams that were never doing the cost accounting will now run more experiments with the same missing foundation, and wonder why they are getting noisier results.

Learning that does not get documented does not exist

The third weakness is the one that matters most for what AI can actually do for a programme over time, and the one that receives the least attention.

Experimentation creates learning. Most programmes do not capture it in a form that is usable. Test results get filed. The belief that was being tested, the reason the intervention was chosen and whether that belief held up under the result. None of that gets documented in any structured way. The programme accumulates a record of what happened and no record of what it understood.

AI tools that look across experiment libraries to surface insights have nothing to work with in that situation. You cannot run a meta-analysis across a list of “if/then” statements. You need the documented belief and the documented outcome to see whether a particular theory of user behaviour is consistently supported or consistently wrong. Without that, each test is an isolated event rather than a contribution to a growing body of knowledge about how your customers make decisions.

This is where the AI debate becomes accidentally useful. The question of what AI needs to function well inside your programme is the same as the question of what your programme needs to function well without AI. Clearly documented hypotheses, careful test design, a system for accumulating learning rather than just accumulating results.

What good foundations actually look like

The three weaknesses above have specific fixes. None of them require new tools.

On hypothesis quality: A well-formed hypothesis documents what you observed with data, not opinion. It then separates that observation from the belief about why it is happening. “We have observed that mobile checkout abandonment is 74% at the payment step” is an observation. “We believe this is caused by the keyboard obscuring the CVC field on smaller screens, creating friction at the moment of highest intent” is a belief. They are different things and they need to live in different parts of the hypothesis. When an AI tool scans your hypothesis library looking for patterns, a library full of documented beliefs gives it something to learn from. A library of “if we change X, conversion will improve” gives it nothing.

On prioritisation: Before a test is built, the team should be able to articulate why this idea over another. Not because the build is costly but because the reasoning is the point. Which idea has stronger evidence behind it? Which belief is more directly testable with the traffic available? Which intervention is most precisely targeted at the cause identified in the observation? These are questions that require a view, and forming that view is where the programme’s judgement develops. A team that can answer them before a test runs is a team that learns faster from the results, whether those results are wins or losses.

On learning documentation: After each test, the programme should record whether the belief held up, not just whether the metric moved. If a test loses, the useful question is whether the belief was wrong, the execution was flawed, or the sample was too thin to tell. Documented answers to that question, accumulated across dozens of tests, create a body of knowledge about how your customers make decisions. That is what a meta-analysis actually draws from. And it is what AI tools need to surface anything beyond surface-level pattern matching.

None of this is complicated. It is just disciplined. The programmes that will get genuine value from AI are the ones that have been doing this work consistently, because they will have something worth accelerating.

The programmes that benefit are the ones that were already ready

The HBR piece from August 2025 that documented how few gen AI investments had produced meaningful returns was not about experimentation specifically. But it was describing the same dynamic: activity without structure produces noise, and AI produces it faster.

The version of AI in experimentation that is genuinely useful is real. Teams with sound hypothesis frameworks, a discipline around prioritisation and a consistent approach to documenting learning will move faster, surface patterns more quickly and make better calls with less effort. AI will do real work for them.

The version being sold more often is that AI makes programme maturity less necessary. It does not. If you cannot look back across the last twelve months and say clearly what your programme now knows that it did not know before, that is the problem to fix first. AI will not fix it for you. It will just make the gap harder to ignore.

If you want to build a programme that produces learning before you add more tools to it, start with the hypothesis.

Use the assessor at stormjarvie.com.au/hypothesis to check whether your current hypotheses have all five parts. Or book a call if you want to look at the programme as a whole.

Try the hypothesis assessor Message me

Let’s Connect

If this sounds right, let's talk.

Book a free 30-minute conversation

No obligations — just a chance to talk through your challenge and see if I can help.

Pick a time

or send a message