Double the Winners = Double the Winnings?

Suppose an organization runs Aampe alongside their business-as-usual (BAU) messaging in a 50/50 split. When Aampe shows strong results, there’s a natural expectation that switching the other half to Aampe should double the overall business impact. That expectation makes sense and sometimes that’s exactly what we see. Other times, however, things work out differently. Here are some potential reasons why an experiment may fail to replicate its early results.

Cold Start for Personalization

Aampe agents learn individual behavior. Each new user begins in a learning phase where the agent tests timing, channels, and content combinations to discover what works. Performance is naturally lower during exploration. Users with Aampe likely went through this for several weeks before seeing results.

It’s possible that the newly added BAU users will experience the same success as the Aampe group, and simply need more time for the agents to learn.

Different Starting Points for BAU Users

Users under BAU are accustomed to a specific cadence, tone, and product focus. When moved to Aampe, agents must work harder to overcome these patterns. Some users will respond to new and different content; others may not budge. A long history of BAU interactions potentially makes these users systematically different from the Aampe group in the test.

Spillover Effects

In split tests, there’s a risk that the treatment group influences the control group in some way. A common scenario is when treatment users respond to an incentive such that the product is no longer available to the control group. When this happens, it exaggerates the differences between the two groups. The differences disappear, however, when both groups receive the same experience.

Seasonality and Timing

Any test occurs during specific conditions: a particular season, a set of active promotions, certain market dynamics. While the environment is the same for both groups, the relative impact of Aampe may vary under a different season, different promotions, etc.

Regression to the Mean

Smaller test groups are more likely to outperform long-run averages due to chance, user composition, timing, or other factors. This is true of any test.

Full rollouts with larger groups settle toward true averages, however, which may differ from the short-term test results of a small group.

Moving Forward

Full adoption marks the beginning of a new optimization phase. Some factors will improve with time (cold start). Other factors are worth examining to see how likely they are to have influenced the results. This research can inform how you work with agents moving forward, as well as how you might improve upon experiments in the future.