Countless articles argue that it’s the smaller incremental improvements that form the overall growth of the company. However, even with many small tests yielding 2 or 3% improvements each, this kind of strategy can be the death of your company.

In product management, split testing is an excellent example of why the small improvements may not be in your interest. At the early stage of your company, chances are that you don’t have millions of website visitors. With the plethora of tools available for split testing, your company size is not an excuse for not running a proper test with great transparency on the results and the statistical strength of such results.

It may be easier to think of ideas that will make a small improvement than to make big step changes. The justification with smaller uplifts is that a) they still contribute meaningfully to the bottom line – particularly as they compound, and b) they are faster.

The flipside of that is obviously that the smaller the uplift, the higher the number of successful tests you’ll have to run to get to a meaningful uplift. Particularly at the early stage, when you start from a lower base, you may not hit your business goals if your conversion rate only contributes with a single digit uplift.

…there is a strong statistical argument for making bolder tests particularly at the early stage of your company. Try to change your product radically, do the tests that will either make the product significantly better or completely fail.

A tale of two strategies

If you are aiming at a modest (when coming from a low base) uplift of, say, 5%, the time it takes you to get to any kind of statistical significance on your numbers will be quite long. To make a meaningful contribution to business growth from your tests, you’ll have to make bolder bets on bigger uplifts – but not for the reasons you might think.

Imagine choosing between two testing strategies, A and B, where you are constantly starting a new test once you’ve reached significance around the results and where:

Strategy A has

  • 50% chance that the performance of the new variation is better than that of the old one (i.e. a successful test as the new variation has improved conversion rates) and
  • Each successful test yields a 5% uplift in conversion rate

Strategy B has

  • 20% chance that the performance of the new variation is better than that of the old one and
  • Each successful test yields a 10% uplift in conversion rate

The table below outlines the marginal and cumulative uplift from the two strategies for the first ten tests:

 Strategy AStrategy B
Test #Marginal upliftCumulative upliftMarginal upliftCumulative uplift
1+0%+0%
2+5%+5%+0%
3+5%+0%
4+5%+10%+0%
5+10%+10%+10%
6+5%+16%+10%
7+16%+10%
8+5%+22%+10%
9+22%+10%
10+5%+28%+10%+21%

In this case, it would be easy to jump to the conclusion that Strategy A is still superior – and from a pure arithmetic point of view, that would be correct. However, at a closer look, it isn’t so.

If we assume that each test starts at the same time for both Strategy A and B, then the above conclusion is correct, however, this is the false assumption! This is where statistics end up playing a larger strategic role. The table above doesn’t show the full picture.

The inverse correlation between uplift and sample size

If you flip a fair coin (i.e. a normal coin) twice, it wouldn’t be unlikely that they both came out with the same side up (in fact, there is a 50% chance of that!). However, just two flips of the coin would be a bad predictor of the future outcomes. Anyone who has tried to flip a coin more than just a few times knows that on average, it will yield heads and tails in an equal split.

Just as for coins, it goes for any other test that the more we repeat the test, the closer we will get to the “real” distribution of outcomes (the population mean, in statistical terms).

The key insight here is that to finalize a test, we need statistical significance. As a quick refresher, statistical significance tells us how likely it is that the two distributions of outcomes (i.e. the split between heads and tails or, in our case, conversion rates) are in fact different, and aren’t just different in our sample by chance.

Fortunately, in a split test, we don’t need to know exactly what the “real” conversion rates are, but just whether the conversion rate of one is higher than that of the other. Therefore, confidence levels are a convenient tool.

Depending on your risk appetite, you can choose various levels of confidence, but most often product people use either 95% or 99% confidence. I won’t dwell on the statistical calculations in this post, but if you have a bit more information, it is possible to calculate the minimum sample size (the number of data points you will need to conclude anything given the chosen confidence level.

Further, if we fix the desired level of confidence and make some assumptions about the conversion rates, we can reverse the calculation so that the result is the expected number of observations we will need, given our assumptions (as in the case above where we make assumptions about the conversion rate uplift).

Back to the example: Assuming a 10% baseline conversion rate before the testing, let’s look at the n needed (i.e. the number of data points or observations) with a 50/50 split of traffic between the control and the variant – the table will give some interesting insights.

In the table below, the Marginal n is the number of observations needed to conclude on the “winner” of the test, given a 99% confidence whereas Total n is the total number of observations needed for that and all the preceding tests.

  Strategy A Strategy B
Test #Marginal upliftCumulative upliftTotal nMarginal nMarginal upliftCumulative upliftTotal nMarginal n
1– 98k98k– 25k25k
25%5% 195k98k– 50k25k
35% 288k92k– 75k25k
45%10% 381k92k– 100k25k
510% 468k87k10%10% 125k25k
65%16% 555k87k10% 147k22k
716% 638k83k10% 170k22k
85%22% 721k83k10% 192k22k
922% 799k78k10% 214k22k
105%28% 878k78k10%21% 237k22k

You’ll note that the marginal n, the sample size needed to get to significance, is much higher when we are trying to detect a smaller change (e.g. see the two n‘s required for the first test where the conditions are similar except expected conversion rate uplift).

In other words: when you are trying to detect a smaller uplift, you’ll need more time to conclude anything. In practice, this means that getting above 20% uplift will need three times as much data and thus take three times as long time with Strategy A than with Strategy B (n of 721k vs. 237k)!

This is a remarkable difference. If you have 10,000 visitors per week, 100 weeks of testing will, theoretically, get you either a 28% or a 456% uplift! Of course, extending the test that far into the future may not be possible. In the end, there are limitations to how high your conversion rate can be. However, at this stage with a much higher baseline conversion rate, your required n is much smaller so you may likely find it hard to keep up with developing new tests!

Conclusion

In conclusion, there is a strong statistical argument for making bolder tests particularly at the early stage of your company. Try to change your product radically, do the tests that will either make the product significantly better or completely fail.

Don’t be afraid to fail. Even if Strategy B only had a 10% success rate, you’ll still be better off (You’d have a 61% uplift after 100 weeks, instead of 28%)!

This only highlights the conversion rate benefits of an improved testing methodology, but it doesn’t even take the side effects into account. With increasing conversion rates, you might be able to unlock additional marketing channels and opportunities to scale further, thus driving more traffic to your site with every incremental improvement. This is a cocktail that could lead to momentary exponential growth (until you’ll hit your next plateau, but that’s for a separate post!).

###
Image credit: Pixabay.com

Leave a Comment or Question