Data

A/B Test Statistical Significance: What It Really Means (And When to Stop a Test)

Statistical significance is the most misunderstood concept in conversion optimization. Most CRO dashboards display “95% confidence” the moment it appears — and most teams declare a winner the same day. That habit is responsible for a huge chunk of the failed “winning” experiments that fail to replicate when shipped to 100% of traffic.

What Significance Actually Tells You

A p-value of 0.05 means: if there were no real difference between A and B, there’s a 5% chance you’d see a result this extreme by random luck. It does not mean “B is 95% likely to win.” That’s the most common misinterpretation.

The Two Errors You’re Trading Off

Why Peeking Is Deadly

Looking at a fixed-horizon test daily and stopping the moment p < 0.05 multiplies your false positive rate to 20%+. To peek safely you need sequential testing methods (e.g. mSPRT, Bayesian) — most platforms like Optimizely and Statsig now ship these by default.

How to Plan a Test Properly

  1. Decide your minimum detectable effect (MDE) — the smallest lift worth shipping. Often 2–5% relative.
  2. Calculate the required sample size using a power calculator. Need: baseline conversion rate, MDE, α=0.05, power=0.80.
  3. Run the test for at least one full business cycle (typically 14 days) and until sample size is reached — whichever is longer.
  4. Don’t peek for decisions until then.
  5. Use our A/B Test Significance Calculator to compute results.

Worked Example

Baseline checkout conversion: 4.0%. You want to detect a 5% relative lift (so 4.2%) at 95% confidence and 80% power. Required sample size: ~31,000 visitors per variant. At 5,000 daily visits per variant, that’s a 6.2-day test minimum. Round up to 14 days to cover weekly seasonality.

When You Can Stop Early

Sanity-Check the Result

Before shipping, run an SRM (Sample Ratio Mismatch) check — if your 50/50 split is actually 49.2/50.8 with high traffic, your test is broken and the result is invalid. Also segment by device and traffic source; “winners” that only work on desktop iOS Safari are a red flag.

FAQs

Is 90% confidence ever okay? For low-risk UI tweaks, yes. For pricing, checkout, or strategic changes, stick to 95% or higher.

Why did my “winner” lose in production? Usually one of: peeking, novelty effect, segment-specific lift, or insufficient power producing an inflated effect size.

Bayesian or frequentist? For most teams Bayesian is more intuitive (“probability B beats A: 97%”) and handles peeking naturally. Pick one and stick with it.