Comparing Systems

When 1:1 personalization makes traditional message-level testing impossible, we need to evaluate the entire system instead.
  • This is true when comparing Aampe’s agentic personalization against non-personalized customer experiences
  • This is also true when comparing different agentically-personalized systems, such as when A/B testing various Aampe configurations.
What follows is an example framework that has worked well with many Aampe customers. Your specific setup will vary based on your individual circumstances.

Group Structure

In the scenario when we are comparing agentic personalization against a non-agentic system, it is helpful to create three user groups:
  • No-Message Group: Receives no marketing messages. This establishes your true baseline behavior—what happens without any messaging influence. While it may feel counterintuitive to leave out some users, this group reveals which behaviors are truly driven by messages.
  • Business-as-Usual Group: Continues receiving existing campaign messages through current tools. This group provides the benchmark for what you’re currently achieving.
  • Aampe Group: Receives messages exclusively from Aampe’s agents. As much as possible, no business-as-usual marketing messages should influence this group. This isolation allows you to measure what the AI system can achieve when given full control.
The size of each group is your choice, but there are tradoffs with the size of each group and the amount of time required to confidently observe differences. Aampe has tools available to help you visualize these tradeoffs and decide on your group sizes.

Another consideration is the size of the Aampe group. Because agents can learn from each other, a small Aampe group is inferior to a large Aampe group. To get the most accurate comparison of Aampe and a traditional system, it makes sense to have a larger Aampe group.

User Group Assignment

Assignment should be randomly assigned to groups in a way that is easy to extend to new users who join during the experiment. The assignment is permanent and consistent throughout your system. For this example we’ll randomize over user IDs.
CASE 
  WHEN MOD(HASH(user_id), 100) < 10 THEN 'no_message'
  WHEN MOD(HASH(user_id), 100) >= 10 AND MOD(HASH(user_id), 100) < 55 THEN 'business_as_usual'
  ELSE 'aampe'
END
Include Everyone: New users need to be assigned to a group upon entering the system. If new users all land in the same group, the results of the experiment will be biased. Apply Everywhere: The group assignments must be consistent across all relevant tools.
  • Data warehouse (for analysis)
  • Existing marketing tools (for excluding the aampe-only and no-message groups)
  • Aampe (for excluding the business-as-usual and no-message groups)
  • Any other messaging systems
Look out for Spillovers: Ensure the experience of one group does not affect the experiences of other groups. If this is the case, Aampe can help you determine an alternative randomization strategy.
  • Example: Business-as-usual sends coupons for a product. The product has limited supply and runs out of stock. The coupon recipients buy more of the product than usual, while the no-message group buys less of the product because it’s out of stock.

Content Requirements

For the AI to learn effectively, it needs sufficient material to work with:
  • Diverse Labels: Create content across multiple distinct label categories representing different value propositions. If you only provide discount messages, the agents will only learn discount preferences.
  • Balanced Channels: If business-as-usual messages apply to several channels, ensure the same channels are available to the agentic personalization group.
  • Comprehensive Audiences: If the goal is to learn if agentic personalization performs better or worse than another approach, the Aampe audiences should be as diverse as the business-as-usual audiences.
Fortunately, Aampe provides the infrastructure to easily manage both content and audiences.

Message Delivery

To ensure the test is running as expected, it is key to have visibility into what messages are delivered and when.
  • The primary concern is that the business-as-usual messages are sent only to the business-as-usual group (same for the aampe group).
There may be times during the test when all users must be messaged. When unavoidable announcements must be made (major sales, critical announcements), message all groups equally.

Pre-Test Planning

Before launching your experiment, establish your measurement framework:
  • Primary KPI: Select your primary metric before starting. This prevents post-hoc rationalization and keeps analysis focused.
  • Statistical Power: Understand how long you’ll need to run the test given your metric’s natural variance and the effect size you hope to detect. Higher variance metrics and smaller expected effects require longer test periods.
  • Burn-in Period: Plan to exclude an initial period from primary analysis. Early results often reflect adjustment periods rather than steady-state performance—users adapting to no messages, agents still learning preferences, etc.
How do we compare the groups once we’ve completed the experiment? Keep reading for thoughts on analyzing the experiment results.