Detecting Incorrectly
Implemented Experiments
Michael Lindon
Staff Statistician
Optimizely
Challenges of Experimentation
Sample Ratio Mismatch (SRM)
Testing for SRMs
Sequential Testing
SSRM Project
Outline
Challenges of
Online
Experimentation
“Perhaps the most common type of metric interpretation
pitfall is when the observed metric change is not due to
the expected behavior of the new feature, but due
to a bug introduced when implementing the feature.”
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
KDD 2017
Case Study [1]
Z, Zhao et Al./Yahoo Inc
Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation
DSAA 2016
● User ID is assigned Test ID
● Test ID labels whether user receives
treatment or control
● Traffic splitter consistently exposes users
the correct variant (treatment or control)
● Necessary to provide a consistent
experience in order to measure the long
term effect of the treatment
Intention:
● 4% User IDs lacked a valid Test ID
● Some users interacted with components
of both control AND treatment
● Treatment group experience was a mixture
of treatment and control
Observation:
● Likely underestimate the treatment effect
Consequences:
Case Study [2]
● Increasing number of carousel items
increases user engagement
Intention:
● User engagement negatively affected!?
● Users were engaged so long, accidentally
classified (algorithmically) as bots, and
removed from the analysis
Observation:
● Incorrect data processing logic, intended
to remove non human visitors from
analysis, removed human visitors from the
analysis
● Metric change caused by bug, not
treatment effect
Consequences:
A, Fabijan et Al./
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019
Case Study [3]
● New protocol to delivering push
notifications
● Expected increase in reliability of message
delivery
Intention:
● Significant improvements in totally
unrelated call metrics
● Fraction of successfully connected calls
increased
● Treatment affected telemetry loss rate
Observation:
● Increase in metrics not caused by
treatment effect
● Caused by a side effect of treatment,
improving telemetry loss rate
● Biased telemetry
Consequences:
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
The Sample
Ratio Mismatch
“One of the most useful indicators of a variety of data
quality issues is a Sample Ratio Mismatch (SRM) – the
situation when the observed sample ratio in the
experiment is different from the expected.”
A, Fabijan et Al./ Microsoft/ Booking.com/ Outreach.io
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019
Intended Traffic Allocation
Observed Traffic Allocation
Observations Don’t Match Expectations
● Intended Allocation Probabilities:
○ 0.50, 0.30, 0.20
● Empirical Allocation Probabilities
○ 0.45, 0.28, 0.27
● Why do we observe a different traffic distribution than intended?
● Strong signal of an incorrect implementation
● When this departure from the intended allocation probabilities is
statistically significant, a sample ratio mismatch (SRM) is said to be
present
“...within the last year we
identified that approximately
6% of experiments at
Microsoft exhibit an SRM.”
A, Fabijan et Al.
Diagnosing Sample Ratio Mismatch in Online Controlled Experime
KDD 2019
Testing for
Sample Ratio
Mismatches
n_treatment = 821588
n_control = 815482
p = [0.5, 0.5]
Binomial Test:
p-value: 1.8 e-06
Outcome:
Entire Experiment Lost
Example 1: Using a Non Sequential Test
SSRM:
Null Rejected after 417150 visitors
(at alpha=0.05)
Savings:
Detected SRM in ¼ time of original
experiment
Outcome:
Prevented 1219920 visitors entering a
faulty experiment
Example 1: Using the Sequential SRM Test
Why Can’t I Just Use
The Chi Squared Test
Sequentially?
Sequential
Testing
Data Simulated Under Null p=0.5
At the end of the experiment,
the null is rejected if
x/n >= 0.531
Or
x/n <= 0.469
In this case, x/n = 0.504, so null is not
rejected (correct)
What does the rejection region look
like for all n?
Rejection Regions for Chi2(alpha=0.05, p=0.5) Test
Null hypothesis incorrectly rejected at n=26
Repeated usage of the Chi2 test resulted in a False Positive
Null hypothesis incorrectly rejected at n=37
Repeated usage of the likelihood ratio test resulted in a False Positive
Would a Likelihood Ratio Test have been any different?
At any point in time, rejection region for SSRM is smaller than Chi2 test.
This allows SSRM to be used after every datapoint, without increasing false positive rate
Comparison of Critical Regions between Chi2 and SSRM
Null hypothesis not rejected (correct)
Repeated usage of the ssrm test did not result in a false positive
Would the SSRM Test have been any different?
Was it Just Bad
Luck?
Chi2 Simulation Study (np.random.seed(0))
Number of False Positives Much Higher Than Expected
LRT Simulation Study (np.random.seed(0))
Number of False Positives Much Higher Than Expected
SSRM Simulation Study (np.random.seed(0))
Number of False Positives As Expected
What About
Detecting SRMs?
SSRM Simulation Study: null_probability = 0.5, true_probability = 0.6
Almost all SRMs were detected near the beginning of the experiment
The SSRM
Project
github.com/optimizely/ssrm
Github Repo Contains Jupyter Notebook Tutorials
Available on PyPI
optimize.ly/dev-community
Thank you!
Join us on
Slack for Q&A
@michaelslindon
michaellindon

Detecting incorrectly implemented experiments

  • 1.
    Detecting Incorrectly Implemented Experiments MichaelLindon Staff Statistician Optimizely
  • 2.
    Challenges of Experimentation SampleRatio Mismatch (SRM) Testing for SRMs Sequential Testing SSRM Project Outline
  • 3.
  • 4.
    “Perhaps the mostcommon type of metric interpretation pitfall is when the observed metric change is not due to the expected behavior of the new feature, but due to a bug introduced when implementing the feature.” P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments KDD 2017
  • 5.
    Case Study [1] Z,Zhao et Al./Yahoo Inc Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation DSAA 2016 ● User ID is assigned Test ID ● Test ID labels whether user receives treatment or control ● Traffic splitter consistently exposes users the correct variant (treatment or control) ● Necessary to provide a consistent experience in order to measure the long term effect of the treatment Intention: ● 4% User IDs lacked a valid Test ID ● Some users interacted with components of both control AND treatment ● Treatment group experience was a mixture of treatment and control Observation: ● Likely underestimate the treatment effect Consequences:
  • 6.
    Case Study [2] ●Increasing number of carousel items increases user engagement Intention: ● User engagement negatively affected!? ● Users were engaged so long, accidentally classified (algorithmically) as bots, and removed from the analysis Observation: ● Incorrect data processing logic, intended to remove non human visitors from analysis, removed human visitors from the analysis ● Metric change caused by bug, not treatment effect Consequences: A, Fabijan et Al./ Diagnosing Sample Ratio Mismatch in Online Controlled Experiments / KDD 2019
  • 7.
    Case Study [3] ●New protocol to delivering push notifications ● Expected increase in reliability of message delivery Intention: ● Significant improvements in totally unrelated call metrics ● Fraction of successfully connected calls increased ● Treatment affected telemetry loss rate Observation: ● Increase in metrics not caused by treatment effect ● Caused by a side effect of treatment, improving telemetry loss rate ● Biased telemetry Consequences: P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments
  • 8.
  • 9.
    “One of themost useful indicators of a variety of data quality issues is a Sample Ratio Mismatch (SRM) – the situation when the observed sample ratio in the experiment is different from the expected.” A, Fabijan et Al./ Microsoft/ Booking.com/ Outreach.io Diagnosing Sample Ratio Mismatch in Online Controlled Experiments / KDD 2019
  • 10.
  • 11.
  • 12.
    Observations Don’t MatchExpectations ● Intended Allocation Probabilities: ○ 0.50, 0.30, 0.20 ● Empirical Allocation Probabilities ○ 0.45, 0.28, 0.27 ● Why do we observe a different traffic distribution than intended? ● Strong signal of an incorrect implementation ● When this departure from the intended allocation probabilities is statistically significant, a sample ratio mismatch (SRM) is said to be present
  • 13.
    “...within the lastyear we identified that approximately 6% of experiments at Microsoft exhibit an SRM.” A, Fabijan et Al. Diagnosing Sample Ratio Mismatch in Online Controlled Experime KDD 2019
  • 14.
  • 15.
    n_treatment = 821588 n_control= 815482 p = [0.5, 0.5] Binomial Test: p-value: 1.8 e-06 Outcome: Entire Experiment Lost Example 1: Using a Non Sequential Test
  • 16.
    SSRM: Null Rejected after417150 visitors (at alpha=0.05) Savings: Detected SRM in ¼ time of original experiment Outcome: Prevented 1219920 visitors entering a faulty experiment Example 1: Using the Sequential SRM Test
  • 17.
    Why Can’t IJust Use The Chi Squared Test Sequentially?
  • 18.
  • 19.
  • 20.
    At the endof the experiment, the null is rejected if x/n >= 0.531 Or x/n <= 0.469 In this case, x/n = 0.504, so null is not rejected (correct) What does the rejection region look like for all n?
  • 21.
    Rejection Regions forChi2(alpha=0.05, p=0.5) Test
  • 22.
    Null hypothesis incorrectlyrejected at n=26 Repeated usage of the Chi2 test resulted in a False Positive
  • 23.
    Null hypothesis incorrectlyrejected at n=37 Repeated usage of the likelihood ratio test resulted in a False Positive Would a Likelihood Ratio Test have been any different?
  • 24.
    At any pointin time, rejection region for SSRM is smaller than Chi2 test. This allows SSRM to be used after every datapoint, without increasing false positive rate Comparison of Critical Regions between Chi2 and SSRM
  • 25.
    Null hypothesis notrejected (correct) Repeated usage of the ssrm test did not result in a false positive Would the SSRM Test have been any different?
  • 26.
    Was it JustBad Luck?
  • 27.
    Chi2 Simulation Study(np.random.seed(0)) Number of False Positives Much Higher Than Expected
  • 28.
    LRT Simulation Study(np.random.seed(0)) Number of False Positives Much Higher Than Expected
  • 29.
    SSRM Simulation Study(np.random.seed(0)) Number of False Positives As Expected
  • 30.
  • 31.
    SSRM Simulation Study:null_probability = 0.5, true_probability = 0.6 Almost all SRMs were detected near the beginning of the experiment
  • 32.
  • 33.
  • 34.
    Github Repo ContainsJupyter Notebook Tutorials
  • 35.
  • 36.
    optimize.ly/dev-community Thank you! Join uson Slack for Q&A @michaelslindon michaellindon

Editor's Notes

  • #37 Remember to join our developer Slack community, where you can keeping the progressive delivery and experimentation discussion going. Thanks so much for joining us today, we look forward to continuing the conversation.