Has the outcome of an experiment ever gone so strongly against your intuition that you doubt its correct implementation? This is certainly a possibility, as correctly implementing an online experiment is fraught with data quality challenges. There is a considerable amount of engineering and processing of data before it reaches the analysis, leaving plenty of room for bugs and biases to creep in. One of the strongest signals of an incorrect implementation is a sample ratio mismatch (SRM) - when the number of users assigned to each variation differs significantly from what is expected under the intended random allocation.
This talk will:
- Demo a novel SRM test which allows experimentation teams to rapidly detect if there is a bug in the implementation, even while the experiment is running, allowing developers to quickly fix the underlying issue.
- Provide an introduction to both the mathematics of the SRM test as well as the newly open sourced library which developers can immediately integrate into their platform.
4. “Perhaps the most common type of metric interpretation
pitfall is when the observed metric change is not due to
the expected behavior of the new feature, but due
to a bug introduced when implementing the feature.”
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
KDD 2017
5. Case Study [1]
Z, Zhao et Al./Yahoo Inc
Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation
DSAA 2016
● User ID is assigned Test ID
● Test ID labels whether user receives
treatment or control
● Traffic splitter consistently exposes users
the correct variant (treatment or control)
● Necessary to provide a consistent
experience in order to measure the long
term effect of the treatment
Intention:
● 4% User IDs lacked a valid Test ID
● Some users interacted with components
of both control AND treatment
● Treatment group experience was a mixture
of treatment and control
Observation:
● Likely underestimate the treatment effect
Consequences:
6. Case Study [2]
● Increasing number of carousel items
increases user engagement
Intention:
● User engagement negatively affected!?
● Users were engaged so long, accidentally
classified (algorithmically) as bots, and
removed from the analysis
Observation:
● Incorrect data processing logic, intended
to remove non human visitors from
analysis, removed human visitors from the
analysis
● Metric change caused by bug, not
treatment effect
Consequences:
A, Fabijan et Al./
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019
7. Case Study [3]
● New protocol to delivering push
notifications
● Expected increase in reliability of message
delivery
Intention:
● Significant improvements in totally
unrelated call metrics
● Fraction of successfully connected calls
increased
● Treatment affected telemetry loss rate
Observation:
● Increase in metrics not caused by
treatment effect
● Caused by a side effect of treatment,
improving telemetry loss rate
● Biased telemetry
Consequences:
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
9. “One of the most useful indicators of a variety of data
quality issues is a Sample Ratio Mismatch (SRM) – the
situation when the observed sample ratio in the
experiment is different from the expected.”
A, Fabijan et Al./ Microsoft/ Booking.com/ Outreach.io
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019
12. Observations Don’t Match Expectations
● Intended Allocation Probabilities:
○ 0.50, 0.30, 0.20
● Empirical Allocation Probabilities
○ 0.45, 0.28, 0.27
● Why do we observe a different traffic distribution than intended?
● Strong signal of an incorrect implementation
● When this departure from the intended allocation probabilities is
statistically significant, a sample ratio mismatch (SRM) is said to be
present
13. “...within the last year we
identified that approximately
6% of experiments at
Microsoft exhibit an SRM.”
A, Fabijan et Al.
Diagnosing Sample Ratio Mismatch in Online Controlled Experime
KDD 2019
15. n_treatment = 821588
n_control = 815482
p = [0.5, 0.5]
Binomial Test:
p-value: 1.8 e-06
Outcome:
Entire Experiment Lost
Example 1: Using a Non Sequential Test
16. SSRM:
Null Rejected after 417150 visitors
(at alpha=0.05)
Savings:
Detected SRM in ¼ time of original
experiment
Outcome:
Prevented 1219920 visitors entering a
faulty experiment
Example 1: Using the Sequential SRM Test
17. Why Can’t I Just Use
The Chi Squared Test
Sequentially?
20. At the end of the experiment,
the null is rejected if
x/n >= 0.531
Or
x/n <= 0.469
In this case, x/n = 0.504, so null is not
rejected (correct)
What does the rejection region look
like for all n?
23. Null hypothesis incorrectly rejected at n=37
Repeated usage of the likelihood ratio test resulted in a False Positive
Would a Likelihood Ratio Test have been any different?
24. At any point in time, rejection region for SSRM is smaller than Chi2 test.
This allows SSRM to be used after every datapoint, without increasing false positive rate
Comparison of Critical Regions between Chi2 and SSRM
25. Null hypothesis not rejected (correct)
Repeated usage of the ssrm test did not result in a false positive
Would the SSRM Test have been any different?
Remember to join our developer Slack community, where you can keeping the progressive delivery and experimentation discussion going.
Thanks so much for joining us today, we look forward to continuing the conversation.