Your Experimentation Hypothesis Has Five Parts. Most Teams Write Two.

sjarvie3 Apr 2026

TLDR: The “if we / then we” hypothesis most teams write is not a hypothesis. It is a guess with a metric attached. A complete hypothesis has five parts, and the ones you skip are exactly the ones that protect your programme from post-result rationalisation and ensure you accumulate real learning over time.

Click to discuss this article with your favourite AI

The industry has normalised writing half a hypothesis

Most experimentation practitioners know a hypothesis should have more than “if we change the button colour, then conversions will go up.” That version is embarrassing. Teams learn quickly that the format needs a “because” attached. Data-driven. Evidence-based. The works.

So they arrive at something like: “We hypothesise that if we simplify the checkout form, then checkout completion will increase by 8%, because our data shows users are dropping off at that step.”

That is better. It has a number. It has data. It reads like something a serious team would write.

It is still not a complete hypothesis.

Adding data does not mean you have explained your thinking

The problem with the data-driven version is not the format. It is what is missing. The “because” in most hypotheses restates the observation rather than explaining the reasoning. “Users are dropping off at checkout, so we think addressing checkout will improve checkout” is circular. You have named the problem twice.

What is missing is two things: an explicit account of what you actually observed, and a separate, honest articulation of why you think that observation is happening. These are not the same thing, and conflating them is where programmes start to lose their rigour.

A complete hypothesis has five parts:

Hypothesis

We have observed [objective observation]

Which we believe [subjective interpretation of why that observation is happening]

If we [specific intervention]

Then we will see [measurable change in specific metrics]

Because [the specific choices or behavioural principles you are applying]

The first two parts are the ones most teams skip entirely. They are also the ones that do the most work.

The parts you skip are doing the most work

“We have observed” should be a data statement, not an opinion dressed as one. “We have observed that mobile checkout abandonment is 74% at the payment step” is an observation. “We have observed that the form seems too long” is a feeling. They are distinguishable, and the distinction matters.

It is also worth being clear that “we have observed” can carry more than one data point. A single quantitative metric tells you something is happening. Session recordings showing users pausing at the same field, survey responses flagging confusion, and a 74% abandonment rate together build a much stronger case. Qual and quant are not competing sources. They are answering different questions about the same problem, and both belong here. The richer your evidence base, the more confident you can be in the belief you form next, and the more defensible that belief is when the results come in.

“Which we believe” is where you show your working. It forces you to commit to a specific mechanism. “Users lose patience with multi-field forms on small screens because the cognitive load of switching between fields on a mobile keyboard creates friction before they reach the payment step” is reasoning. “Users are abandoning because of the form” is circular. It just restates the observation in slightly different words.

Committing to a belief also gives you something to challenge over time. If you run three tests against the same belief using different interventions and different principles and still see no improvement, that is a signal to question the belief itself, not just the execution. Maybe the cause you identified in step two is wrong. A documented belief is what makes that pattern visible. Without it, you are just accumulating losses with no way to understand why.

The “because” at the end should close the loop between your intervention and your belief. It should explain why this specific change addresses the specific cause you identified in step two. If you are simplifying a CTA and your “because” references social proof principles, something has gone wrong in the logic.

What this looks like in practice

Here is what all five parts look like when they are working together:

Hypothesis

We have observed that customers who are ineligible for an offer but are exposed to CTAs that lock them into offer messaging across the site are 25% more likely to abandon at the promo code step, with 40% higher promo code error rates than eligible customers.

Which we believe is caused by the onsite messaging providing false security that they will be eligible for the offer, because ineligible customers are not engaging with the terms and conditions.

If we only show offer messaging once eligibility has been confirmed rather than upfront,

Then we will see no material change in conversion for eligible customers and a minimum 4% increase in conversion for ineligible customers,

Because we applied the principles of personalisation and clarity to remove false expectations for customers who cannot redeem the offer.

Notice what this makes visible.

The observation is specific and dual-metric. Abandonment rate and error rate together build the case that the problem is real and worth solving.

The belief identifies a precise mechanism (false security from messaging, not the promo code step itself).

The intervention is targeted at the cause, not the symptom.

The expected outcome is segmented by eligibility, because the change is not expected to affect both groups equally.

And the because names the principles applied, which means the next time personalisation or expectation-setting is used as a lever, the programme has a prior result to reason against.

Skipping these parts creates two specific problems for your programme

The first is that you open the door to HARKing. That is the practice, documented in Kerr’s 1998 paper in Personality and Social Psychology Review, of presenting a post-hoc interpretation as if it were your original reasoning. Without a written, committed account of what you observed and why you believed the intervention would work, there is nothing holding you to your original thinking when the results come in.

I have seen this happen in client work. A test was built on the belief that customers did not like the order of content on a page because it was too dense. When results came back and only products from a certain category showed uplift, the hypothesis was quietly rewritten. The new story was that the intervention worked because content related to that category had been moved higher on the page. The numbers stayed the same. The interpretation changed. And with no documented original reasoning, the programme recorded a win that taught it nothing.

The second problem is the loss of accumulated knowledge across a programme over time. If you are not documenting the specific behavioural principles you are applying and the causal reasoning behind each test, you cannot identify patterns. You cannot tell whether a particular principle consistently produces wins, or whether your interpretation of a certain problem type keeps failing regardless of the intervention. That kind of pattern recognition is what separates a mature programme from a list of results in a spreadsheet.

This connects directly to learning rate, which I have written about before. Win rate tells you how often you won. Learning rate tells you what you actually learned when you did not.

Small tests are not exempt from this

I hear two objections to the full five-part format. One is that it is too much overhead. The other is that it is only worth the effort for large, complex tests.

Both miss the point.

If you cannot clearly articulate what you observed and how you are interpreting the cause and the solution, the stakes of the test are irrelevant. A small test built on vague reasoning teaches you nothing when it fails. A small test built on a clear observation and a specific causal belief teaches you something regardless of the result. Things do not have to be large to matter. What matters is whether the team genuinely understands the problem it is trying to solve.

Spaghetti testing (throwing ideas at a page and measuring what sticks) can produce wins. It does not produce programmes.

You can assess an experimentation hypothesis at two levels, and both matter

The first is structural. Are all five parts present? Is the metric connected to the touchpoint being changed? If you are simplifying a CTA and measuring revenue per session rather than add-to-cart rate, that is a metric mismatch worth flagging before the test runs. Structural problems are fast to spot. The checklist is short.

The second level is quality. Is the observation grounded in data or opinion? Does the “which we believe” offer a specific mechanism, or does it restate the problem? Does the “because” close the loop back to the belief, or does it introduce a new idea that was never part of the original reasoning? These questions require reading comprehension, not pattern matching. A hypothesis can pass the structural check and still be hollow.

Both checks are worth doing before a single user sees your variant.

Want to put your Hypothesis to the test?

I’ve built a hypothesis assessor that uses the Structural and Quality checks to help you create strong hypotheses and generate belief and approach classifications to aid your meta analyses of your programme. Give it a try today.

Try the Hypothesis Assessor

Has this post got you thinking?

If you are building an experimentation programme and want a second opinion on how your hypotheses are structured, I am available for a focused session.

Let’s Connect

If this sounds right, let's talk.

Book a free 30-minute conversation

No obligations — just a chance to talk through your challenge and see if I can help.

Pick a time

or send a message