Delving into experimentation requires precision and a thorough understanding of the process. When testing isn't executed correctly, the fallout can be significant. Imagine scenarios where your proposed changes aren't even visible to the variation group, you've mistakenly selected the wrong conversion event in the experiment setup, or there are unforeseen differences between group experiences. Such oversights can either make you restart the experiment, wasting time and resources, or even worse, lead you to make decisions based on misleading results. In fact, a poorly designed test can be more harmful than not testing at all. To ensure your experimentation remains on track and provides reliable insights, it's essential to adhere to certain best practices. Here, we'll walk you through critical aspects to cement the integrity of your experiment design.
Randomization is central to ensuring that observed differences between groups arise from the changes being tested, rather than other factors. It’s the heart of making fair comparisons. Yet, achieving proper randomization isn't always straightforward. Custom solutions can inadvertently introduce biases, muddying the waters of your results. That's why we advocate for specialized A/B testing tools—they not only facilitate a bias-free setup but also make data analysis easier.
Make sure the sample size is big enough
Before running an experiment, calculate the number of participants you'll need to detect a meaningful difference between groups. This ensures you don't draw conclusions prematurely when you achieve the magic statistical significance of a test.
Statistical significance versus validity
Statistical significance is the probability that a test result is accurate and not due to just chance alone but this is only valid if you have a big enough sample size.
Once a testing tool says you’ve achieved 95% statistical significance (or higher), that doesn’t mean anything if you don’t have enough sample size. Achieving significance is not a stopping rule for a test.
Run tests for longer
Even for high-traffic websites, getting lots of visitors quickly doesn’t capture the whole story. Running tests briefly can overlook elements like weekdays vs. weekends, varied traffic sources, or the effects of seasonality.
A good test needs two things: enough visitors and time. This usually means testing for 2 to 4 weeks, ensuring we see a complete and accurate picture, not just a quick glimpse.
Also, as far as possible, run tests during periods where no other major changes are happening that could influence user behavior.
Statistical Confidence: Is 95% the Golden Number?