A Fundamental Difference

Traditional CRM systems operate through campaigns: fixed messages sent to defined audiences at specific times. Each campaign typically has its own random holdout group, and attribution is straightforward because all treated users receive the same message simultaneously. Agentic AI systems like Aampe operate differently. Instead of campaigns, they create personalized message sequences for each user. This level of personalization is great for customers, but creates difficulties for traditional measurement techniques. Path Dependency and Its Implications In true 1:1 personalized systems like Aampe, actions become path-dependent. Each message influences the messages that follow depending on how it performs. This creates several measurement challenges for an individual message:
  • Random Assignment: Recipients and non-recipients of a given message have been chosen strategically, not randomly.
  • Cumulative Impact: Traditional A/B tests measure the immediate response to a message. Personalized sequences build impact over time as each message learns and improves from prior messages.
  • Treatment Timing: Measuring impact is difficult when some users receive a message on a Tuesday morning, while other users receive the same message on a Saturday evening.
  • Learning Phases: Aampe agents learn in real-time from watching users respond to different messages. It can take several customer interactions to discover strong patterns. Messages at the start of a sequence are not typically as impactful as messages later on.

Implications for Testing

Standard campaign-level holdouts lose meaning when users receive different messages at different times. But 1:1 personalization is exactly that, different messages at different times. So rather than evaluate individual messages, we need to step back and evaluate the message-generation process. Instead of asking, “does message A beat message B?” we start asking, “does an adaptive, learning system outperform static rules?”

This requires longer test periods, different group structures, and metrics that capture cumulative rather than immediate effects.
What does this look like in practice? Continue reading for an example.