In Part 1 we covered the foundations of A/B testing: what it is, why it matters, and how to design experiments with proper user segmentation and traffic allocation. Now we turn to the statistical machinery that makes A/B testing rigorous — how to determine sample sizes, choose the right metrics, and decide how long to run your experiment.
This is Part 2 of a 3-part series on A/B Testing:
- Part 1: Foundations & Experiment Design
- Part 2: Statistical Framework & Metrics (You are here)
- Part 3: Execution & Decision-Making
Statistical Considerations
Sample Size Determination
When we are testing A/B models, it is important to determine the right sample size (500 users, 1000 users etc.) is crucial for reliable results. The required sample size depends on several factors:
-
Minimum Detectable Effect (MDE) – It is the minimum improvement that we want to be able to detect in our variant model as against the control model. For example, in classification, we want 1-2% improvement in accuracy or in ranking, we may want to have 3-5% improvement in ranking metrics.
- We consider the following assumptions:
- Null Hypothesis: No difference exists between the models that would be A/B tested.
- Alternate Hypothesis: A difference exists between the models that would be A/B tested.
- Significance Level (\( \alpha \)):
- The significance level (\( \alpha \)) directly represents the probability of making a false positive (Type I error). The false positive (Type I error) corresponds to rejecting the null hypothesis and accepting the alternate hypothesis while the null hypothesis is true in reality.
- When significance level, \( \alpha \) = 0.05, it means there’s a 5% chance that we’ll conclude that the models are different when they are actually the same.
- In practice, we conduct a statistical test by computing a p-value. Here, p-value is the cutoff point where we decide there’s enough evidence to believe that a real difference exists.
- If p-value < \( \alpha \): We reject the null hypothesis. We have sufficient evidence to say that the observed difference is probably not due to random chance.
- If p-value >= \( \alpha \): We fail to reject the null hypothesis. We do NOT have sufficient evidence to conclude there’s a real difference.
- Statistical Power (\(1 - \beta \)):
- \(\beta \) is the probability of a false negative (Type II error). The null hypothesis is actually false in reality, but our test fails to reject this null hypothesis.
- \(1 - \beta \) is the probability of making a true positive. Alternatively, it indicates the probability that our A/B test will detect a statistically significant difference between models when such a difference truly exists.
- Usually, Type-II error, \(\beta = 0.2\), which means power, \(1 - \beta \) = 1 - 0.2 = 0.8
Putting It Together: The Trade-offs
These statistical parameters are interconnected. With a fixed sample size:
- Lowering \( \alpha \) (stricter significance) reduces false positives but also reduces power, meaning you may miss real improvements.
- Increasing power (\(1 - \beta \)) requires either a larger sample or a willingness to accept a higher false positive rate.
- A smaller MDE (detecting subtler differences) demands a larger sample size.
In practice, the standard starting point is \( \alpha = 0.05 \) and power = 0.80. From there, the MDE and your baseline metric variance determine the required sample size. Many online calculators and libraries (e.g., Python’s statsmodels.stats.power) can compute this for you.
Choosing Metrics
Selecting the right metrics is one of the most critical decisions in any A/B test. The wrong metric can lead you to ship a model that improves a number on a dashboard but degrades the actual user experience.
Primary vs Secondary Metrics
Every A/B test should have a single primary metric — the one metric that determines success or failure. This is sometimes called the Overall Evaluation Criterion (OEC). Having one clear primary metric prevents cherry-picking favorable results after the experiment ends.
Secondary metrics are additional measurements you track alongside the primary metric. They serve two purposes:
- Guardrail metrics ensure the new model doesn’t degrade something important (e.g., latency, error rate, revenue) even if the primary metric improves.
- Explanatory metrics help you understand why the primary metric moved. For instance, if click-through rate improved, was it because more users clicked, or because the same users clicked more often?
Business KPIs vs Model Metrics
When evaluating ML models in production, it is essential to track both business-level outcomes and model-level performance.
Business KPIs capture the real-world impact of your model:
- Revenue Metrics
- Average order value
- Revenue per user
- Customer acquisition cost
- Lifetime value
- User Behavior Metrics
- Retention rates
- Engagement levels (clicks, likes, shares)
- Conversion funnel metrics
- Time spent on platform
Model Performance Metrics capture how well the model performs its technical task:
- Classification Metrics
- Accuracy, Precision, Recall, F1 Score
- AUC-ROC and AUC-PR
- Calibration (predicted probabilities match observed rates)
- Ranking Metrics
- NDCG (Normalized Discounted Cumulative Gain)
- Mean Average Precision (MAP)
- Mean Reciprocal Rank (MRR)
A model can improve on technical metrics while hurting business KPIs (or vice versa). For example, a recommendation model with a higher NDCG might surface more relevant but less diverse content, leading to lower long-term engagement. Always ensure your primary A/B test metric aligns with what matters most to the business.
Common Metrics in ML Model Evaluation
Accuracy, Precision, Recall
These are the foundational classification metrics. Accuracy measures overall correctness, but can be misleading with imbalanced classes. Precision (of the items predicted positive, how many truly are?) and Recall (of the truly positive items, how many did we find?) offer a more nuanced view. In A/B tests, you typically track these on live traffic rather than a held-out test set, giving a true picture of production performance.
Response Time and Latency
A faster model isn’t always better, but a model that is too slow will hurt user experience regardless of its accuracy. Key latency metrics to track:
- Response time distribution (p50, p95, p99) — the p99 matters most for tail-end user experience
- Resource utilization (CPU, memory, GPU)
- Error rates and types (timeouts, exceptions, fallback triggers)
- Cache hit rates (if applicable)
User Engagement Metrics
These capture how users actually interact with your model’s outputs:
- Session duration
- Actions per session
- Return visit rate
- Feature adoption rate
Engagement metrics are powerful because they reflect user satisfaction more directly than model performance metrics. However, they can be noisy and take longer to stabilize.
Business Impact Metrics
Ultimately, models exist to serve business goals:
- Revenue per user
- Customer satisfaction scores (NPS, CSAT)
- User retention rates
- Market share changes
These are often lagging indicators — they take weeks or months to manifest. For shorter experiments, use leading indicators (engagement, conversion) that have historically correlated with these long-term outcomes.
Duration and Timing
How Long to Run the Test
A common mistake is stopping a test as soon as the p-value crosses 0.05. This is known as peeking and it inflates your false positive rate well beyond the stated \( \alpha \). Instead, determine the required duration before starting the experiment.
The minimum duration depends on:
- Required sample size — calculated from your MDE, \( \alpha \), and power (discussed above).
- Daily traffic volume — divide the required sample size by daily eligible users to get the minimum number of days.
- Full weekly cycle — always run for at least one full week (ideally two) to capture day-of-week effects. User behavior on Monday often differs significantly from Saturday.
Dealing with Seasonality
Seasonal patterns can confound your results. Strategies to handle this:
- Run control and variant simultaneously (this is fundamental to A/B testing and already handles most seasonality).
- Avoid launching during anomalous periods such as major holidays, sales events, or product launches — unless you specifically want to test performance during those periods.
- Extend duration if your test spans a seasonal boundary (e.g., starting a week before a holiday) so you capture the full cycle.
When to Stop Early
While premature stopping is dangerous, there are valid reasons to end a test early:
- Clear harm: If the variant is causing errors, crashes, or significant revenue loss, stop immediately. Guardrail metrics should trigger automatic alerts.
- Sequential testing methods: Techniques like group sequential designs or always-valid p-values allow valid early stopping with controlled error rates. These are more statistically sophisticated but increasingly common in industry.
- Practical ceiling: If you’ve already collected 10x the required sample size and results are nowhere near significance, the true effect is likely too small to matter.
This is Part 2 of a 3-part series on A/B Testing:
- Part 1: Foundations & Experiment Design
- Part 2: Statistical Framework & Metrics (You are here)
- Part 3: Execution & Decision-Making