A/B testing: complete guide
How to design, run and interpret tests that reliably improve your conversions
A/B testing is the primary CRO tool for validating hypotheses with real data. It involves showing two versions of a page or element to randomly selected user groups and measuring which produces better results.
But a poorly designed A/B test can be worse than not testing at all: it generates false positives, wrong decisions and a false sense of certainty. This guide covers the entire process: from hypothesis formulation to correct result interpretation.
How to formulate testable hypotheses
An A/B test without a hypothesis is an experiment without direction. The hypothesis should follow the structure: "If we change [variable], we expect [measurable outcome] because [data-based reason]." The reason is critical: without it, you learn nothing from the test, whether it wins or loses.
Hypotheses come from prior research: a heatmap showing nobody reaches the CTA, a recording where users cannot find the form, or an analytics insight revealing 70% drop-off at a checkout step. Without research, hypotheses are guesses.
- Structure: "If [change], then [outcome] because [evidence]"
- Based on qualitative or quantitative data, not opinions
- One variable per test to isolate the effect
- Measurable outcome with a primary metric and guardrail metrics
Calculating the required sample size
Sample size determines how many visits you need before the result is statistically reliable. It depends on three factors: your current conversion rate, the minimum detectable effect (what improvement you want to detect) and the desired confidence level (usually 95%).
A sample size calculator (Optimizely, Evan Miller, AB Test Guide) tells you how many visits you need per variant. If the answer is 50,000 and your site gets 2,000 weekly visits, the test will take 25 weeks, which is likely not viable. Adjust the minimum detectable effect or find intermediate metrics with higher volume.
Statistical significance: what it is and what it is not
Statistical significance (usually at 95%) indicates there is less than a 5% probability that the observed result is due to chance. It does not mean the result is correct with 95% certainty: it means that if there were no real difference, you would see that result only 5% of the time.
The most serious error is stopping a test before reaching the required sample size because "it already looks like a winner." Results fluctuate enormously in the first hours and days. A prematurely stopped test has a false positive probability far higher than the nominal 5%.
- Do not stop the test before reaching the calculated sample size
- A test at 90% significance is not the same as one at 95%
- Consider using sequential testing if you need faster decisions
- Look at the confidence interval, not just the p-value
A/B testing tools
Optimizely and VWO are the reference platforms for A/B testing with visual editors that let you create variants without code. Google Optimize was discontinued, but GA4 integrates basic tests and can be complemented with Optimizely or VWO.
For technical teams, tools like LaunchDarkly, Statsig or GrowthBook offer feature flags with integrated A/B testing, ideal for testing product changes beyond interface tweaks. The choice depends on your tech stack and the volume of tests you plan to run.
- Optimizely: the enterprise reference, visual editor + server-side
- VWO: powerful alternative with integrated heatmaps and recordings
- GrowthBook: open source, ideal for technical teams
- Statsig: feature flags + experimentation with advanced statistical analysis
Common A/B testing mistakes
Stopping the test early is the most frequent and most dangerous mistake. The second is lacking a clear hypothesis: testing for the sake of testing generates shallow learnings and wastes valuable traffic.
Other mistakes include: changing multiple variables at once (you cannot tell what caused the result), ignoring seasonality (a test that starts on Black Friday and ends afterwards is unreliable) and not segmenting results (the overall winning variant may be the loser on mobile).
- Stopping prematurely due to promising initial results
- Changing multiple variables in a single variant
- Not accounting for seasonality or external events
- Not segmenting results by device, source or audience
- Declaring a winner without sufficient statistical significance
- Not documenting learnings from losing tests
Beyond A/B: multivariate and bandits
Multivariate testing (MVT) tests multiple combinations of changes simultaneously: for example, 3 headlines × 2 CTAs = 6 variants. It requires much more traffic but identifies the optimal combination of elements.
Multi-armed bandits allocate more traffic to the winning variant as data accumulates, maximising conversions during the test instead of waiting until the end. They are useful when the opportunity cost of showing the losing variant is high (high-volume ecommerce).
Key Takeaways
- Every test hypothesis should be grounded in prior research data
- Calculate sample size beforehand and do not stop the test before reaching it
- Statistical significance does not guarantee correctness, only that the result is not due to chance
- Document both winning and losing tests to accumulate learnings
- Segment results: an overall winner may be a loser in a key segment
Want a testing programme that delivers results?
We design and run A/B testing programmes with grounded hypotheses, rigorous analysis and actionable learnings.