Uncategorized
A/B Test Statistical Significance
A/B Test Statistical Significance: What It Really Means (And When to Stop a Test)
By GenAlpha Tools Editorial
Statistical significance is the most misunderstood concept in conversion optimization. Most CRO dashboards display “95% confidence” the moment it appears — and most teams declare a winner the same day. That habit is responsible for a huge chunk of the failed “winning” experiments that fail to replicate when shipped to 100% of traffic.
What Significance Actually Tells You
A p-value of 0.05 means: if there were no real difference between A and B, there’s a 5% chance you’d see a result this extreme by random luck. It does not mean “B is 95% likely to win.” That’s the most common misinterpretation.
The Two Errors You’re Trading Off
- Type I (false positive): declaring a winner when none exists. Controlled by your significance level (α, usually 0.05).
- Type II (false negative): missing a real winner. Controlled by statistical power (1−β, usually 0.80).
Why Peeking Is Deadly
Looking at a fixed-horizon test daily and stopping the moment p < 0.05 multiplies your false positive rate to 20%+. To peek safely you need sequential testing methods (e.g. mSPRT, Bayesian) — most platforms like Optimizely and Statsig now ship these by default.
How to Plan a Test Properly
- Decide your minimum detectable effect (MDE) — the smallest lift worth shipping. Often 2–5% relative.
- Calculate the required sample size using a power calculator. Need: baseline conversion rate, MDE, α=0.05, power=0.80.
- Run the test for at least one full business cycle (typically 14 days) and until sample size is reached — whichever is longer.
- Don’t peek for decisions until then.
- Use our A/B Test Significance Calculator to compute results.
Worked Example
Baseline checkout conversion: 4.0%. You want to detect a 5% relative lift (so 4.2%) at 95% confidence and 80% power. Required sample size: ~31,000 visitors per variant. At 5,000 daily visits per variant, that’s a 6.2-day test minimum. Round up to 14 days to cover weekly seasonality.
When You Can Stop Early
- You’re using a Bayesian or sequential testing tool that adjusts for peeking.
- The variant is dramatically harming key metrics (kill switch — different rule).
- You’ve collected the planned sample and covered a full weekly cycle.
Sanity-Check the Result
Before shipping, run an SRM (Sample Ratio Mismatch) check — if your 50/50 split is actually 49.2/50.8 with high traffic, your test is broken and the result is invalid. Also segment by device and traffic source; “winners” that only work on desktop iOS Safari are a red flag.
FAQs
Is 90% confidence ever okay? For low-risk UI tweaks, yes. For pricing, checkout, or strategic changes, stick to 95% or higher.
Why did my “winner” lose in production? Usually one of: peeking, novelty effect, segment-specific lift, or insufficient power producing an inflated effect size.
Bayesian or frequentist? For most teams Bayesian is more intuitive (“probability B beats A: 97%”) and handles peeking naturally. Pick one and stick with it.