1
A/B Testing,
Experiment design &
Evaluation
JULIEN KERVIZIC
2
What is an A/B test
A
B
50%
50%
Percentages may vary
Test
Control
3
How are user assigned to a group
id Last digit Group
11069 9 0
12017 7 0
17761 1 1
19730 0 1
9515 5 0
7253 3 1
677 7 0
18798 8 0
Simple heuristic
Last digit < 5 → group 1
Otherwise → group 0
4
From which population segment do you measure results?
Treated
Exposed
Allocated Allocated
Contains all users that have been assigned to a
group. Usually used to measure the overall impact of
launching a feature on the business.
Exposed
All users having performed an action that should
have caused them to see the feature being tested.
Usually used to measure the impact of the specific
feature on the population concerned.
Treated
Done the desired action from the feature. This
segmentation is done to understand the impact on
metrics on going through with the specific action
5
Example
Allocated
All visitors on the site
Exposed
All visitors having clicked the buy
button on the SUB
Treated
Added to cart one of the bundles.
6
Metrics
Plan which key metrics to track for optimization purposes.
Make sure there is logging implemented to track secondary user actions.
Example:
Primary metrics:
• Add to cart
• Quantity Added to Cart
• Quantity Purchased
• AOV
Secondary metrics:
• Clicks on go to beer page
• Go to cart
7
Significance
Significance helps you understand how likely you are to make a wrong judgment call
At 95% confidence, you have 1 in 20 chance of being wrong for a given metric
At 90% confidence, you have 1 in 10 chance of being wrong for a given metric
Significance is established from
• Sample Size
• Effect Size
The higher the sample size and (relative) size of the effect the more easily it is to
establish statistical significance.
8
Sample size & Effect Size
• Explanation of Relative Effect Size –
• Normally having significance over an allocated population is more difficult
to establish due to a lower relative effect (due to dilution of impact of the
effect across the population)
• Explanation of Sample Size & Time.
9
Experiment designs and Holdouts
Some effects can only be analyzed over the long run
for example uplift in Torps sold needs to be balanced between:
• Ordering Frequency
• Torps/Order
Other effects change over time due to Novelty effect or Learning curves
In these case it is important to define a holdout,
ie: a sample of the population that will not be exposed to your
treatment over an extended period of time
10
What are A/B test not good for?
Handling Network effects:
Example: Mentionme
The referrer and Referee could be
split across test and control groups
Novelty effect
Prompts and CTA tend to exhibit
novelty effects, if not measuring
their performance over the long
term using a holdout a wrong
attribution and/or customer fatigue
can happen.
11
Limitation & issues in our current setup
• Only able to track short-term effects
• Users could be exposed to different test depending on their devices and
be “split-counted”, or if they reset their cookies
• Experiments are not evaluated across communication channels (Eg:
Email and Website)
• No clear population distinction between allocated / exposed and treated
groups
12
What should be done to solve this?
• More aggressively push for accounts login and setup experiments
specifically for registered users
• This would require to give users incentive to login, such as specific
functionalities, eg: specific deals for registered users
• Ensure that logging exists to classify users at different step within the
funnel
• Find a way to use the same experimental setup on Lightspeed and
SFMC for the allocation to user groups
13
What kind of Questions can this help us solve?
• What is the cannibalization effect of offering Best Before Torps to our
existing customers?
• What impact does increasing average order size have on Frequency and
total Torps purchased?
• Do people who have seen but not purchased Torps on discount, delay
their purchases in expectation of a good deal?
14
Other challenges in Experimental design
• Your test or control group becomes polluted? e: Your Groups size
is not balanced when you expected it to be?
• A bug in your test had brought a negative experience to users?
• How to handle experiments that should be stacked vs.
Independent?

Ab testing explained

  • 1.
    1 A/B Testing, Experiment design& Evaluation JULIEN KERVIZIC
  • 2.
    2 What is anA/B test A B 50% 50% Percentages may vary Test Control
  • 3.
    3 How are userassigned to a group id Last digit Group 11069 9 0 12017 7 0 17761 1 1 19730 0 1 9515 5 0 7253 3 1 677 7 0 18798 8 0 Simple heuristic Last digit < 5 → group 1 Otherwise → group 0
  • 4.
    4 From which populationsegment do you measure results? Treated Exposed Allocated Allocated Contains all users that have been assigned to a group. Usually used to measure the overall impact of launching a feature on the business. Exposed All users having performed an action that should have caused them to see the feature being tested. Usually used to measure the impact of the specific feature on the population concerned. Treated Done the desired action from the feature. This segmentation is done to understand the impact on metrics on going through with the specific action
  • 5.
    5 Example Allocated All visitors onthe site Exposed All visitors having clicked the buy button on the SUB Treated Added to cart one of the bundles.
  • 6.
    6 Metrics Plan which keymetrics to track for optimization purposes. Make sure there is logging implemented to track secondary user actions. Example: Primary metrics: • Add to cart • Quantity Added to Cart • Quantity Purchased • AOV Secondary metrics: • Clicks on go to beer page • Go to cart
  • 7.
    7 Significance Significance helps youunderstand how likely you are to make a wrong judgment call At 95% confidence, you have 1 in 20 chance of being wrong for a given metric At 90% confidence, you have 1 in 10 chance of being wrong for a given metric Significance is established from • Sample Size • Effect Size The higher the sample size and (relative) size of the effect the more easily it is to establish statistical significance.
  • 8.
    8 Sample size &Effect Size • Explanation of Relative Effect Size – • Normally having significance over an allocated population is more difficult to establish due to a lower relative effect (due to dilution of impact of the effect across the population) • Explanation of Sample Size & Time.
  • 9.
    9 Experiment designs andHoldouts Some effects can only be analyzed over the long run for example uplift in Torps sold needs to be balanced between: • Ordering Frequency • Torps/Order Other effects change over time due to Novelty effect or Learning curves In these case it is important to define a holdout, ie: a sample of the population that will not be exposed to your treatment over an extended period of time
  • 10.
    10 What are A/Btest not good for? Handling Network effects: Example: Mentionme The referrer and Referee could be split across test and control groups Novelty effect Prompts and CTA tend to exhibit novelty effects, if not measuring their performance over the long term using a holdout a wrong attribution and/or customer fatigue can happen.
  • 11.
    11 Limitation & issuesin our current setup • Only able to track short-term effects • Users could be exposed to different test depending on their devices and be “split-counted”, or if they reset their cookies • Experiments are not evaluated across communication channels (Eg: Email and Website) • No clear population distinction between allocated / exposed and treated groups
  • 12.
    12 What should bedone to solve this? • More aggressively push for accounts login and setup experiments specifically for registered users • This would require to give users incentive to login, such as specific functionalities, eg: specific deals for registered users • Ensure that logging exists to classify users at different step within the funnel • Find a way to use the same experimental setup on Lightspeed and SFMC for the allocation to user groups
  • 13.
    13 What kind ofQuestions can this help us solve? • What is the cannibalization effect of offering Best Before Torps to our existing customers? • What impact does increasing average order size have on Frequency and total Torps purchased? • Do people who have seen but not purchased Torps on discount, delay their purchases in expectation of a good deal?
  • 14.
    14 Other challenges inExperimental design • Your test or control group becomes polluted? e: Your Groups size is not balanced when you expected it to be? • A bug in your test had brought a negative experience to users? • How to handle experiments that should be stacked vs. Independent?