Insights
Experimentation6 min read

A 50% Win Rate Sounds Great. Here’s Why It Should Worry You.

The danger of chasing win rates in experimentation

TLDR: Win rate is a vanity metric. Mandating a high one doesn’t improve your programme. It encourages teams to lower statistical thresholds, pick safe tests and optimise for metrics that look good in a report but don’t move the business. The programmes that generate real revenue measure learning rate, not win rate.

Click to discuss this article with your favourite AI

Someone, somewhere, decided that a good experimentation programme wins half its tests. It’s a tidy number, it fits in a slide, and it sounds like a reasonable standard to hold a team to.

The problem is that it’s completely disconnected from how experimentation actually works.

The largest known analysis of online controlled experiments puts real win rates at around 10%. Google, Bing, Airbnb and Booking.com, organisations running some of the most sophisticated programmes in the world, all report something in that range. A 50% win rate mandate doesn’t mean your team got better at testing. It means the definition of winning got quietly adjusted until the number looked right.

That’s not experimentation. That’s scoreboard management.

The three ways teams game a win rate target

When a team is told to hit a number, they will hit the number. That’s not cynicism, it’s basic incentive design. The question is what they’ll change to get there. In practice it comes down to three things, and none of them are good.

They lower the statistical gates.

Statistical significance is the threshold that determines whether a result is real or noise. The industry standard is 95%. At that level, one in twenty results is a false positive. Drop to 80% and that becomes one in five.

Lowering the threshold is the fastest way to produce more “winners.” It’s also the fastest way to fill a roadmap with changes that don’t actually work. Teams roll out variants that showed early promise, see no movement in the business metrics that matter, and then wonder why the conversion rate at the end of the year looks identical to where it started. The answer is usually that they were implementing noise.

They only test things they already know will work.

Strategic experimentation is uncomfortable. A bold hypothesis about a fundamentally different user journey, a new pricing model or a checkout redesign might take weeks to run, require significant development effort and has a real chance of losing. If your team is accountable for a win rate, none of that is worth the risk.

Instead they test button colours. They test headline copy. They test whether the image on the left performs better than the image on the right. These tests are fast, cheap and safe. They are also largely inconsequential. You cannot optimise your way to a transformational outcome by rearranging furniture.

A mature programme that drops its win rate is often doing something right. It means the team is pushing into harder, more valuable territory where the answers aren’t obvious. Win rate tends to decrease as programmes grow for exactly this reason. Treating that as a failure is a fundamental misread of the data.

They choose metrics that move easily, not metrics that matter.

This one is the most insidious because it’s the hardest to see from the outside.

If a team needs to show a winning test, they will find a metric that moves. Click-through rate is easier to shift than conversion rate. Conversion rate is easier to shift than revenue per user. Revenue per user is easier to shift than customer lifetime value. The further up that chain you go, the harder the test is to win and the more it actually matters to the business.

The result is a programme full of tests that technically won on their stated metric, but produced no discernible impact on the numbers the business actually cares about. Uber learned this the hard way. An optimisation that showed a clear lift in ride completion rates later revealed a sharp drop in customer satisfaction and repeat usage. The primary metric looked great. The business metric told a different story.

Optimising a proxy while the real needle sits still is one of the most expensive things an experimentation programme can do, because it consumes the team’s time and the organisation’s trust while delivering nothing of lasting value.

What a healthy programme actually measures

The metric worth caring about isn’t win rate. It’s learning rate.

Learning rate isn’t a number you pull from a dashboard. It’s the proportion of your inconclusive and losing tests where the team can actually answer the question: why didn’t this work? Not “the variant lost” as a full stop, but a genuine read of the qualitative and quantitative signals that explains what happened and changes what you do next.

That might mean session recordings that show users ignoring the variant entirely. It might mean a segment cut that reveals the test won for new users and lost for returning ones, which tells you something worth acting on. It might mean a hypothesis that gets retired because the data consistently shows the underlying assumption was wrong.

A test that loses and produces that kind of output is more valuable than a test that wins on a lowered threshold and gets shipped into the void. The first one sharpens the programme. The second one just fills the roadmap.

The programmes I’ve seen generate real, compounding returns treat null and negative results as part of the output, not as failures to be hidden from leadership. They run harder tests, hold higher statistical bars and accept that a lower win rate is the cost of doing serious work.

That’s a harder story to tell upward. But it’s the right one.

The mandate worth setting instead

If you’re responsible for an experimentation programme and you’re being asked to report on win rate, it’s worth pushing back on the framing before you start moving thresholds to satisfy it.

The questions that actually indicate programme health are:

  • What percentage of experiments are testing a hypothesis grounded in user research or data rather than gut feel?
  • What is the average measured impact of winning tests on a primary business metric, not a proxy?
  • What did losing and inconclusive tests teach the team, and how did that change what came next?
  • Are the statistical gates consistent and documented, or are they adjusted test by test?

None of those fit as neatly on a slide as “50% win rate.” But they’re the difference between a programme that looks good and one that actually moves the business.

Want a stronger experimentation programme?

If you’re building or rebuilding an experimentation programme and want to know what good looks like, I can help.

Let’s Connect

If this sounds right, let's talk.

Book a free 30-minute conversation

No obligations — just a chance to talk through your challenge and see if I can help.

Pick a time
or send a message