In Part 1 we covered experiment design fundamentals, and in Part 2 we explored the statistical framework and metric selection. In this final part, we tackle the practical realities of running experiments — the pitfalls that can invalidate your results, the infrastructure needed to run experiments reliably, and the decision-making framework for acting on your findings.

This is Part 3 of a 3-part series on A/B Testing:

Common Pitfalls and Solutions

Even well-designed experiments can produce misleading results if you’re not aware of these common traps.

Simpson’s Paradox

Simpson’s Paradox occurs when a trend that appears in aggregated data reverses when the data is broken down by subgroups. For example, Model B might appear worse overall, but actually outperform Model A in every individual user segment — the reversal happens because of an uneven distribution of users across segments.

Solution: Always analyze results across key segments (device type, geography, user tenure, etc.) in addition to the overall population. If segment-level results contradict the aggregate, investigate the distribution of users across segments before drawing conclusions.

Multiple Testing Problem

When you test multiple metrics or run many simultaneous experiments, the probability of at least one false positive increases rapidly. With 20 independent metrics at \( \alpha = 0.05 \), you have a 64% chance of at least one false positive.

Solution:

  • Designate a single primary metric before the experiment starts.
  • Apply corrections for multiple comparisons (Bonferroni, Benjamini-Hochberg) when evaluating secondary metrics.
  • Be skeptical of surprising wins on metrics you weren’t specifically testing.

Selection Bias

Selection bias occurs when the users in your control and variant groups differ systematically in ways that affect the outcome. This can happen through flawed randomization, self-selection, or survivorship effects (e.g., only analyzing users who completed an action, ignoring those who dropped off).

Solution:

  • Verify randomization quality by checking that pre-experiment characteristics (demographics, activity levels) are balanced across groups.
  • Use intent-to-treat analysis: include all users who were assigned to a group, not just those who engaged.
  • Run A/A tests (same model in both groups) periodically to validate your experimentation infrastructure.

Novelty Effects

When users encounter something new, their behavior changes temporarily — either positively (excitement, curiosity) or negatively (confusion, resistance). These effects fade over time, meaning short experiments may overestimate or underestimate the true long-term impact.

Solution:

  • Run experiments long enough for novelty to wear off (typically 2-4 weeks).
  • Segment results by new vs returning users.
  • Track metrics over time within the experiment to see if the effect is stabilizing or decaying.

Sample Ratio Mismatch (SRM)

SRM occurs when the actual proportion of users in each group differs from the intended split. If you configured a 50/50 split but observe 51.5/48.5, something is wrong. SRM is a strong signal that your experiment has a technical bug that could invalidate results.

Solution:

  • Check for SRM at the start of every experiment analysis, before looking at metrics.
  • Common causes include: bot filtering affecting groups differently, redirects dropping users, or triggering conditions that correlate with assignment.
  • If SRM is detected, do not trust the experiment results. Debug the root cause first.

Network Effects

In systems where users interact with each other (social networks, marketplaces, messaging), treating users as independent units breaks down. A user in the control group might be influenced by changes experienced by their connections in the variant group.

Solution:

  • Use cluster-based randomization: randomize at the level of communities, regions, or social clusters rather than individual users.
  • For marketplaces, consider randomizing by geographic region or time slot.
  • Be aware that standard confidence intervals may underestimate uncertainty when network effects are present.

Implementation Best Practices

Setting Up Experiment Infrastructure

A reliable experimentation platform needs several core components:

  1. Randomization service: A centralized service that assigns users to experiments and persists assignments. Hash-based randomization (hashing user ID + experiment ID) is simple and deterministic, ensuring consistent assignment without storing state.

  2. Configuration management: A system to define experiments (name, traffic split, start/end dates, targeting criteria) without deploying code. This lets you launch, pause, and end experiments quickly.

  3. Feature flags: Decouple deployment from release. Deploy variant code behind feature flags so you can enable it for specific experiment groups without a full rollout.

  4. Experiment isolation: Ensure experiments don’t interfere with each other. If a user is in multiple experiments, the assignment for one should be independent of the other. Use orthogonal randomization (different hash salts per experiment) to achieve this.

Randomization Techniques

The quality of your randomization determines the validity of your experiment:

  • Hash-based assignment: Compute hash(user_id + experiment_salt) % 100 to get a bucket in [0, 99]. Assign buckets to control/variant. This is deterministic, consistent, and requires no storage.
  • Stratified randomization: First divide users into strata (e.g., by country, platform), then randomize within each stratum. This ensures balance on known confounders.
  • Pre-experiment validation: Before launching, verify that the randomization produces balanced groups on key covariates. An A/A test is the gold standard for this.

Data Collection and Logging

Robust logging is the backbone of trustworthy experiments:

  • Log the assignment event (which user was assigned to which variant) separately from outcome events. This ensures you can always reconstruct the experiment even if downstream logging fails.
  • Include experiment metadata (experiment ID, variant ID, timestamp) in all relevant event logs.
  • Log at the most granular level possible. You can always aggregate later, but you can’t disaggregate data you never collected.
  • Implement data quality checks: monitor for missing fields, duplicate events, and unexpected values. Automated checks should run daily during an experiment.

Monitoring and Alerting

Don’t wait until the experiment ends to look at data:

  • Set up real-time dashboards showing key metrics for each variant, updated at least daily.
  • Configure guardrail alerts that trigger if error rates spike, latency degrades beyond a threshold, or revenue drops significantly. These should be able to automatically disable the variant if thresholds are breached.
  • Monitor sample ratio continuously to catch SRM early.
  • Track system health metrics (memory, CPU, response codes) per variant to catch infrastructure issues that might affect only one group.

Making the Final Decision

Decision Frameworks

When the experiment concludes, you need a structured approach to interpret results and decide on next steps. A simple framework:

  1. Check validity: Verify no SRM, confirm the experiment ran for the planned duration, and check that guardrail metrics were not violated.
  2. Evaluate the primary metric: Is it statistically significant? Is the effect size practically meaningful?
  3. Review secondary metrics: Do guardrail metrics show any degradation? Do explanatory metrics tell a coherent story?
  4. Consider the full picture: Does the result make sense given what you know about the model change? Are there segments where the result differs?

Interpreting Results

Results fall into several categories, each calling for a different action:

  • Clear win (significant improvement on primary metric, no guardrail violations): Proceed to rollout.
  • Clear loss (significant degradation on primary metric): Do not ship. Investigate why and iterate.
  • Neutral result (no significant difference): The variant is likely no better or worse. Consider whether the MDE was appropriate — a neutral result doesn’t mean “no difference,” it means “no detectable difference at this sample size.” You may choose to ship if the variant has other benefits (code simplicity, latency improvement) without measurable downside.
  • Mixed result (primary metric improves but a guardrail metric degrades, or vice versa): Requires judgment. Quantify the trade-off and make a deliberate decision. Document the reasoning.

Rollout Strategies

Once you’ve decided to ship, don’t flip from 10% to 100% overnight:

  1. Gradual ramp: Increase traffic to the variant incrementally (e.g., 10% → 25% → 50% → 100%) over days or weeks. Monitor metrics at each stage. This limits blast radius if something goes wrong at scale.
  2. Staged rollout by segment: Roll out to less critical or more resilient segments first (e.g., internal users, then a single region, then globally).
  3. Rollback plan: Always have a clear, tested rollback procedure. Know exactly how to revert to the control model and how long the rollback takes.

Post-Deployment Monitoring

The experiment doesn’t end at rollout:

  • Continue monitoring metrics for at least 1-2 weeks after full deployment. Some effects (e.g., long-term retention changes) only manifest over time.
  • Compare against experiment predictions: Does the metric improvement at 100% traffic match what you observed during the experiment? A significant discrepancy suggests interference effects or other confounds.
  • Watch for delayed impacts: Revenue, retention, and satisfaction metrics can shift weeks after a change. Set calendar reminders to review these.
  • Document everything: Record the experiment hypothesis, design, results, decision, and post-deployment outcomes. This institutional memory is invaluable for future experiments.

This is Part 3 of a 3-part series on A/B Testing:

Useful Resources

What is A/B testing?

Trustworthy Online Controlled Experiments (Kohavi, Tang, Xu)

Designing and Deploying Online Experiments (Microsoft)