Skip to main content

From Setup to Analysis

Once your experiment groups are established and running, the focus shifts to measurement. Evaluating system-level performance requires different analytical approaches than individual campaign testing, with particular attention to cumulative effects and temporal patterns. This section covers common analytical approaches that have proven useful, though the specific methods will depend on your metrics, user behavior patterns, and business context.

Things to Look For

When examining your results, several patterns deserve extra attention:
  • Group Divergence: Watch for groups gradually drifting apart over time rather than immediate separation. System-level effects often compound, making end-period performance more informative than early results.
  • Pre-experiment Trends: Check whether groups showed similar patterns before the experiment began. Consistent historical behavior strengthens confidence that post-experiment differences are real.
  • Within-Group Outliers: Identify users with extreme behaviors in each group. A single high-value customer can distort averages. Consider whether your conclusions change if outliers are handled differently.
  • Temporal Consistency: Look for sustained differences rather than one-time spikes. A single good week doesn’t prove system superiority.
  • Cross-Group Patterns: If all groups show similar temporal patterns (everyone dips during holidays), these are likely external factors rather than treatment effects.

Visual Analysis

Start with clear visualizations:
  • Time series showing weekly or daily averages by group
  • Confidence intervals or standard errors around estimates
  • Cumulative metrics to show total impact over time
  • Distribution plots to understand the full range of user behaviors, not just averages
Visualizations often reveal patterns that statistics might miss—sudden changes, gradual trends, or unusual volatility.

Statistical Approaches

The specific approach of choice will vary across use cases, but here is a non-exhaustive list of useful methods that we often use when examining an experiment where the impact potentially builds slowly over time:
  • Difference-in-Differences (DiD): Compares the change in treatment groups against the change in control. This accounts for temporal trends affecting all users and isolates the treatment effect.
  • CUPED (Controlled Pre-Experiment Data): Uses pre-experiment behavior to reduce variance in estimates. Users who were highly active before the experiment tend to remain active—CUPED leverages this predictability to detect smaller effects.
  • Regression Analysis: Model outcomes at the user level, including treatment assignment, pre-experiment behavior, and time effects. This allows you to control for confounding factors and estimate treatment effects more precisely.
  • Synthetic Controls: Fit time-series models on pre-test data, then compare observed post-test results to predicted values. Useful when you have rich historical data and want to account for complex temporal patterns.
The choice of method depends on your data characteristics and the questions you’re asking. Consider consulting with data scientists familiar with causal inference if these methods are new to your team.

Special Considerations

  • Heavy-Tailed Distributions: User behavior often follows power laws—most users do little, a few do a lot. Consider winsorizing extreme values or using median-based metrics alongside means.
  • Zero-Inflation: Many metrics are mostly zeros (purchases, for example). Standard statistical models may perform poorly. Zero-inflated models or analyzing “probability of action” separately from “amount given action” can help.
  • Multiple Testing: If you examine many metrics or time periods, some will show significance by chance. Consider corrections for multiple comparisons, or better yet, stick to your pre-registered primary metric.
  • Sample Size Imbalances: If one group is much smaller than others, it may show more variance simply due to size. Weight your analyses appropriately or ensure adequate sample sizes in all groups.
You’re test is done! Hopefully it executed without any major mistakes. A clean experiment is a lot of work and surprises are inevitable. Our final section lists common errors we’ve encountered while helping businesses set up system-level experiments as we’ve described.