Detecting incorrectly implemented experiments

Detecting Incorrectly
Implemented Experiments
Michael Lindon
Staff Statistician
Optimizely

Challenges of Experimentation
Sample Ratio Mismatch (SRM)
Testing for SRMs
Sequential Testing
SSRM Project
Outline

Challenges of
Online
Experimentation

“Perhaps the most common type of metric interpretation
pitfall is when the observed metric change is not due to
the expected behavior of the new feature, but due
to a bug introduced when implementing the feature.”
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
KDD 2017

Case Study [1]
Z, Zhao et Al./Yahoo Inc
Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation
DSAA 2016
● User ID is assigned Test ID
● Test ID labels whether user receives
treatment or control
● Traffic splitter consistently exposes users
the correct variant (treatment or control)
● Necessary to provide a consistent
experience in order to measure the long
term effect of the treatment
Intention:
● 4% User IDs lacked a valid Test ID
● Some users interacted with components
of both control AND treatment
● Treatment group experience was a mixture
of treatment and control
Observation:
● Likely underestimate the treatment effect
Consequences:

Case Study [2]
● Increasing number of carousel items
increases user engagement
Intention:
● User engagement negatively affected!?
● Users were engaged so long, accidentally
classified (algorithmically) as bots, and
removed from the analysis
Observation:
● Incorrect data processing logic, intended
to remove non human visitors from
analysis, removed human visitors from the
analysis
● Metric change caused by bug, not
treatment effect
Consequences:
A, Fabijan et Al./
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019

Case Study [3]
● New protocol to delivering push
notifications
● Expected increase in reliability of message
delivery
Intention:
● Significant improvements in totally
unrelated call metrics
● Fraction of successfully connected calls
increased
● Treatment affected telemetry loss rate
Observation:
● Increase in metrics not caused by
treatment effect
● Caused by a side effect of treatment,
improving telemetry loss rate
● Biased telemetry
Consequences:
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments

“One of the most useful indicators of a variety of data
quality issues is a Sample Ratio Mismatch (SRM) – the
situation when the observed sample ratio in the
experiment is different from the expected.”
A, Fabijan et Al./ Microsoft/ Booking.com/ Outreach.io
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019

Observations Don’t Match Expectations
● Intended Allocation Probabilities:
○ 0.50, 0.30, 0.20
● Empirical Allocation Probabilities
○ 0.45, 0.28, 0.27
● Why do we observe a different traffic distribution than intended?
● Strong signal of an incorrect implementation
● When this departure from the intended allocation probabilities is
statistically significant, a sample ratio mismatch (SRM) is said to be
present

“...within the last year we
identified that approximately
6% of experiments at
Microsoft exhibit an SRM.”
A, Fabijan et Al.
Diagnosing Sample Ratio Mismatch in Online Controlled Experime
KDD 2019

Testing for
Sample Ratio
Mismatches

n_treatment = 821588
n_control = 815482
p = [0.5, 0.5]
Binomial Test:
p-value: 1.8 e-06
Outcome:
Entire Experiment Lost
Example 1: Using a Non Sequential Test

SSRM:
Null Rejected after 417150 visitors
(at alpha=0.05)
Savings:
Detected SRM in ¼ time of original
experiment
Outcome:
Prevented 1219920 visitors entering a
faulty experiment
Example 1: Using the Sequential SRM Test

Why Can’t I Just Use
The Chi Squared Test
Sequentially?

Data Simulated Under Null p=0.5

At the end of the experiment,
the null is rejected if
x/n >= 0.531
Or
x/n <= 0.469
In this case, x/n = 0.504, so null is not
rejected (correct)
What does the rejection region look
like for all n?

Rejection Regions for Chi2(alpha=0.05, p=0.5) Test

Null hypothesis incorrectly rejected at n=26
Repeated usage of the Chi2 test resulted in a False Positive

Null hypothesis incorrectly rejected at n=37
Repeated usage of the likelihood ratio test resulted in a False Positive
Would a Likelihood Ratio Test have been any different?

At any point in time, rejection region for SSRM is smaller than Chi2 test.
This allows SSRM to be used after every datapoint, without increasing false positive rate
Comparison of Critical Regions between Chi2 and SSRM

Null hypothesis not rejected (correct)
Repeated usage of the ssrm test did not result in a false positive
Would the SSRM Test have been any different?

Chi2 Simulation Study (np.random.seed(0))
Number of False Positives Much Higher Than Expected

LRT Simulation Study (np.random.seed(0))
Number of False Positives Much Higher Than Expected

SSRM Simulation Study (np.random.seed(0))
Number of False Positives As Expected

SSRM Simulation Study: null_probability = 0.5, true_probability = 0.6
Almost all SRMs were detected near the beginning of the experiment

Github Repo Contains Jupyter Notebook Tutorials

optimize.ly/dev-community
Thank you!
Join us on
Slack for Q&A
@michaelslindon
michaellindon

Detecting incorrectly implemented experiments

Recommended

Recommended

More Related Content

Similar to Detecting incorrectly implemented experiments

Similar to Detecting incorrectly implemented experiments (20)

More from Optimizely

More from Optimizely (20)

Recently uploaded

Recently uploaded (20)

Detecting incorrectly implemented experiments

Editor's Notes