A/B testing when the stakes are high and the numbers are low

A/B test everything…

Anywhere you look that’s offering advice for startups you’ll find people saying you have to A/B test everything, stick with what works, and then A/B test some more stuff. That of course is good advice; seeing how people react to your messaging or UX is usually more valuable than your guesswork.

…But what happens when your sample sizes are low and/or the stakes are high?

When the stakes are high

Imagine this scenario: you suspect the messaging on your website is costing you sign-ups, and you want to A/B test it against what you think is new, improved messaging. Since you’re determined to get a robust sample that minimises the chances of false negatives or false positives, you commit to running the test until you hit statistical significance (if any of that sentence doesn’t gel for you, I’d suggest this great post from Optimizely on sample sizes and statistical significance).

If the conversion rate with your current messaging is 10% (one in ten people who visit your site click sign up, for example), and you want to be able able to detect an improvement in conversion rate of 20% or more with standard confidence (i.e. if the conversion rate goes up to 12% or more), you need a sample of 2,863 visitors for each variation – over 5,700 in total! (I’ll talk in a second about what happens if it would take you two months to get 5,700 hits on your website.)

Imagine that your current messaging is costing you signups – today you get 10% of visitors to your homepage to sign up, but your A/B test will reveal that with new messaging you can get 15% to sign up. It’s great that your A/B test means you now know this with a good degree of confidence. But that confidence has come at a cost.

In order to prove that your new messaging could deliver a 15% sign-up rate, you directed 2,863 visitors to the old, inferior messaging. They signed up at a 10%  rate – so those 2,863 visitors got you ~286 users. Had they received the new messaging, you’d have gained an extra 143 customers.

Here’s another example: say you wanted to test whether sending new users a ‘welcome’ email had any effect on whether they used your app. In order to A/B test this, you’d have to be willing to not send anything to 50% of your signups for a prolonged period of time. When every customer is valuable to you, and you’re desperate to engage them for feedback and validation, that would feel like a big sacrifice.

If your business can afford to ‘lose’ or miss out on customers, you must be doing a few things right. But for many businesses, it would be a big deal and might serve as an argument to ‘go with the gut’ rather than A/B test

No doubt there are some people whose ‘gut’ gets the same results as an A/B test. But I suspect they’re few and far between. For the rest of us mere mortals, I see three potential solutions for A/B testing when the stakes are high (you can use more than one solution at once):

  1. A/B test the questions you can’t solve by debating them with your team and/or speaking to customers. If everyone on the team thinks the current messaging stinks, and so do your power users and your investors, maybe you should be willing to make the change without running an A/B test first.
  2. Allocate close to 100% of your sample to the B test if you’re very confident it will succeed, but be willing to change quickly. If you know your conversion rate was 10%, switch almost all of your traffic to your new messaging and note your conversion rate. The weakness here is that you’re not comparing like for like – maybe the traffic that gave you a 10% conversion rate was different to the traffic that is going to your new messaging. This is
    particularly worrying when your new messaging accompanies, say, a new marketing campaign that brings a different sort of visitor to your site.
  3. Be willing to only detect with confidence a large change from your baseline. In the example above, we wanted to be able to detect a 20% or larger change in the signup rate, and to do that we needed ~5,700 visitors. If you were willing to only detect improvements of 100% or more, you’d only need 250 visitors. The downside here is obvious: with a smaller sample size, you may conclude you don’t have evidence that B is better, simply because it’s not better by enough.

When the numbers are low

As you would’ve picked up from above, the lower a sample size, the harder it is to draw statistically significant conclusions from it. This issue rears its head in situations where you have low traffic to your website or app, a small number of users (or a small number of users in the segment you’re testing), or are doing something at a small scale (e.g. offering live demos of a product to high value potential users).

Two sales people armed with the same collateral, trading terms and product are, in their way, a living A/B test. If you’ve sent them out into the world to do live demos of your product with high value leads, they might get through 20 or 30 meetings a week.

If you’re a startup that’s trying to move fast, it’s going to take weeks you don’t have before you can detect small differences in performance with statistical significance.

In this situation, I think there are two solutions:

  1. Always be aware of whether your results are statistically significant. This tool is a useful starting point if you don’t want to or can’t do the calculations yourself. While you might not be able to change the fact that your results aren’t significant, it’s useful to know what results would be significant. For example, in the example above, if one sales person converted 5 out of 20 leads in a week, the other would need to convert 10 to have results that were statistically different at a 95% confidence interval (i.e. 95% confident that the different is not due to chance alone). If you plug numbers into a calculator like the one above you’ll develop a bit of a spidey sense for statistical significance which you can use as a rule of thumb.
  2. Be willing to work with a lot less confidence than the scientists. Typically,in statistics a 95% confidence interval is used. This is partially convention, and partially because most experiments have very large sample sizes, which make statistical significance less of an issue. But you’re probably not trying to figure out if your cancer vaccine works. In other words, you can afford to (and probably have to) operate with far less confidence. Taking the example above, a conversion rate of 10 in 20 versus five in 20 is statistically significant with 95% confidence. Seven in 20 versus five in 20 is not statistically significant but does offer 76% confidence. If you’re pressed for time and committed to iteration, that might have to be enough.

You can’t always be perfectly ‘data-driven’ but you can be as data-driven as possible

Running a startup is all about trade-offs. A/B testing in search of being ‘data-driven’ is an area where you have to make sure you don’t let perfect be the enemy of the good. Sure, you might not be able to run the perfect experiment.

But if you’re clued into the basis of how statistical significance and sample size works, you can at least steer your experiments toward more meaningful outcomes. Ultimately this is a game that rewards good intuition and good use of data and there’s no substitute for either.

steve-hind
Steve Hind is a Strategist at Qwilr

Steve Hind is a strategist. He blogs (sometimes) at The Hindsite Blog and has economics and law degrees from the University of Sydney. He has won the World Universities Debating Championship, and the World Schools Debating Championship, as both a student and as coach of the Australian team.