A/B testing: complete guide

How to design, run and interpret tests that reliably improve your conversions

January 23, 2026 9 min

A/B testing is the primary CRO tool for validating hypotheses with real data. It involves showing two versions of a page or element to randomly selected user groups and measuring which produces better results.

But a poorly designed A/B test can be worse than not testing at all: it generates false positives, wrong decisions and a false sense of certainty. This guide covers the entire process: from hypothesis formulation to correct result interpretation.

How to formulate testable hypotheses

An A/B test without a hypothesis is an experiment without direction. The hypothesis should follow the structure: "If we change [variable], we expect [measurable outcome] because [data-based reason]." The reason is critical: without it, you learn nothing from the test, whether it wins or loses.

Hypotheses come from prior research: a heatmap showing nobody reaches the CTA, a recording where users cannot find the form, or an analytics insight revealing 70% drop-off at a checkout step. Without research, hypotheses are guesses.

Structure: "If [change], then [outcome] because [evidence]"
Based on qualitative or quantitative data, not opinions
One variable per test to isolate the effect
Measurable outcome with a primary metric and guardrail metrics

Calculating the required sample size

Sample size determines how many visits you need before the result is statistically reliable. It depends on three factors: your current conversion rate, the minimum detectable effect (what improvement you want to detect) and the desired confidence level (usually 95%).

A sample size calculator (Optimizely, Evan Miller, AB Test Guide) tells you how many visits you need per variant. If the answer is 50,000 and your site gets 2,000 weekly visits, the test will take 25 weeks, which is likely not viable. Adjust the minimum detectable effect or find intermediate metrics with higher volume.

Statistical significance: what it is and what it is not

Statistical significance (usually at 95%) indicates there is less than a 5% probability that the observed result is due to chance. It does not mean the result is correct with 95% certainty: it means that if there were no real difference, you would see that result only 5% of the time.

The most serious error is stopping a test before reaching the required sample size because "it already looks like a winner." Results fluctuate enormously in the first hours and days. A prematurely stopped test has a false positive probability far higher than the nominal 5%.

Do not stop the test before reaching the calculated sample size
A test at 90% significance is not the same as one at 95%
Consider using sequential testing if you need faster decisions
Look at the confidence interval, not just the p-value

A/B testing tools

Optimizely and VWO are the reference platforms for A/B testing with visual editors that let you create variants without code. Google Optimize was discontinued, but GA4 integrates basic tests and can be complemented with Optimizely or VWO.

For technical teams, tools like LaunchDarkly, Statsig or GrowthBook offer feature flags with integrated A/B testing, ideal for testing product changes beyond interface tweaks. The choice depends on your tech stack and the volume of tests you plan to run.

Optimizely: the enterprise reference, visual editor + server-side
VWO: powerful alternative with integrated heatmaps and recordings
GrowthBook: open source, ideal for technical teams
Statsig: feature flags + experimentation with advanced statistical analysis

Common A/B testing mistakes

Stopping the test early is the most frequent and most dangerous mistake. The second is lacking a clear hypothesis: testing for the sake of testing generates shallow learnings and wastes valuable traffic.

Other mistakes include: changing multiple variables at once (you cannot tell what caused the result), ignoring seasonality (a test that starts on Black Friday and ends afterwards is unreliable) and not segmenting results (the overall winning variant may be the loser on mobile).

Stopping prematurely due to promising initial results
Changing multiple variables in a single variant
Not accounting for seasonality or external events
Not segmenting results by device, source or audience
Declaring a winner without sufficient statistical significance
Not documenting learnings from losing tests

Beyond A/B: multivariate and bandits

Multivariate testing (MVT) tests multiple combinations of changes simultaneously: for example, 3 headlines × 2 CTAs = 6 variants. It requires much more traffic but identifies the optimal combination of elements.

Multi-armed bandits allocate more traffic to the winning variant as data accumulates, maximising conversions during the test instead of waiting until the end. They are useful when the opportunity cost of showing the losing variant is high (high-volume ecommerce).

Key Takeaways

Every test hypothesis should be grounded in prior research data
Calculate sample size beforehand and do not stop the test before reaching it
Statistical significance does not guarantee correctness, only that the result is not due to chance
Document both winning and losing tests to accumulate learnings
Segment results: an overall winner may be a loser in a key segment

Want a testing programme that delivers results?

We design and run A/B testing programmes with grounded hypotheses, rigorous analysis and actionable learnings.

A/B testing: complete guide

How to formulate testable hypotheses

Calculating the required sample size

Statistical significance: what it is and what it is not

A/B testing tools

Common A/B testing mistakes

Beyond A/B: multivariate and bandits

Key Takeaways

Want a testing programme that delivers results?

Write to us

Schedule a call

Message sent!

Something went wrong