SlideShare a Scribd company logo
1 of 36
Detecting Incorrectly
Implemented Experiments
Michael Lindon
Staff Statistician
Optimizely
Challenges of Experimentation
Sample Ratio Mismatch (SRM)
Testing for SRMs
Sequential Testing
SSRM Project
Outline
Challenges of
Online
Experimentation
“Perhaps the most common type of metric interpretation
pitfall is when the observed metric change is not due to
the expected behavior of the new feature, but due
to a bug introduced when implementing the feature.”
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
KDD 2017
Case Study [1]
Z, Zhao et Al./Yahoo Inc
Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation
DSAA 2016
● User ID is assigned Test ID
● Test ID labels whether user receives
treatment or control
● Traffic splitter consistently exposes users
the correct variant (treatment or control)
● Necessary to provide a consistent
experience in order to measure the long
term effect of the treatment
Intention:
● 4% User IDs lacked a valid Test ID
● Some users interacted with components
of both control AND treatment
● Treatment group experience was a mixture
of treatment and control
Observation:
● Likely underestimate the treatment effect
Consequences:
Case Study [2]
● Increasing number of carousel items
increases user engagement
Intention:
● User engagement negatively affected!?
● Users were engaged so long, accidentally
classified (algorithmically) as bots, and
removed from the analysis
Observation:
● Incorrect data processing logic, intended
to remove non human visitors from
analysis, removed human visitors from the
analysis
● Metric change caused by bug, not
treatment effect
Consequences:
A, Fabijan et Al./
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019
Case Study [3]
● New protocol to delivering push
notifications
● Expected increase in reliability of message
delivery
Intention:
● Significant improvements in totally
unrelated call metrics
● Fraction of successfully connected calls
increased
● Treatment affected telemetry loss rate
Observation:
● Increase in metrics not caused by
treatment effect
● Caused by a side effect of treatment,
improving telemetry loss rate
● Biased telemetry
Consequences:
P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft
A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled
Experiments
The Sample
Ratio Mismatch
“One of the most useful indicators of a variety of data
quality issues is a Sample Ratio Mismatch (SRM) – the
situation when the observed sample ratio in the
experiment is different from the expected.”
A, Fabijan et Al./ Microsoft/ Booking.com/ Outreach.io
Diagnosing Sample Ratio Mismatch in Online Controlled Experiments /
KDD 2019
Intended Traffic Allocation
Observed Traffic Allocation
Observations Don’t Match Expectations
● Intended Allocation Probabilities:
○ 0.50, 0.30, 0.20
● Empirical Allocation Probabilities
○ 0.45, 0.28, 0.27
● Why do we observe a different traffic distribution than intended?
● Strong signal of an incorrect implementation
● When this departure from the intended allocation probabilities is
statistically significant, a sample ratio mismatch (SRM) is said to be
present
“...within the last year we
identified that approximately
6% of experiments at
Microsoft exhibit an SRM.”
A, Fabijan et Al.
Diagnosing Sample Ratio Mismatch in Online Controlled Experime
KDD 2019
Testing for
Sample Ratio
Mismatches
n_treatment = 821588
n_control = 815482
p = [0.5, 0.5]
Binomial Test:
p-value: 1.8 e-06
Outcome:
Entire Experiment Lost
Example 1: Using a Non Sequential Test
SSRM:
Null Rejected after 417150 visitors
(at alpha=0.05)
Savings:
Detected SRM in ¼ time of original
experiment
Outcome:
Prevented 1219920 visitors entering a
faulty experiment
Example 1: Using the Sequential SRM Test
Why Can’t I Just Use
The Chi Squared Test
Sequentially?
Sequential
Testing
Data Simulated Under Null p=0.5
At the end of the experiment,
the null is rejected if
x/n >= 0.531
Or
x/n <= 0.469
In this case, x/n = 0.504, so null is not
rejected (correct)
What does the rejection region look
like for all n?
Rejection Regions for Chi2(alpha=0.05, p=0.5) Test
Null hypothesis incorrectly rejected at n=26
Repeated usage of the Chi2 test resulted in a False Positive
Null hypothesis incorrectly rejected at n=37
Repeated usage of the likelihood ratio test resulted in a False Positive
Would a Likelihood Ratio Test have been any different?
At any point in time, rejection region for SSRM is smaller than Chi2 test.
This allows SSRM to be used after every datapoint, without increasing false positive rate
Comparison of Critical Regions between Chi2 and SSRM
Null hypothesis not rejected (correct)
Repeated usage of the ssrm test did not result in a false positive
Would the SSRM Test have been any different?
Was it Just Bad
Luck?
Chi2 Simulation Study (np.random.seed(0))
Number of False Positives Much Higher Than Expected
LRT Simulation Study (np.random.seed(0))
Number of False Positives Much Higher Than Expected
SSRM Simulation Study (np.random.seed(0))
Number of False Positives As Expected
What About
Detecting SRMs?
SSRM Simulation Study: null_probability = 0.5, true_probability = 0.6
Almost all SRMs were detected near the beginning of the experiment
The SSRM
Project
github.com/optimizely/ssrm
Github Repo Contains Jupyter Notebook Tutorials
Available on PyPI
optimize.ly/dev-community
Thank you!
Join us on
Slack for Q&A
@michaelslindon
michaellindon

More Related Content

Similar to Detect incorrectly implemented experiments

Business Research Methods PPT - III
Business Research Methods PPT - IIIBusiness Research Methods PPT - III
Business Research Methods PPT - IIIRavinder Singh
 
The State of Clinical Outsourcing
The State of Clinical OutsourcingThe State of Clinical Outsourcing
The State of Clinical OutsourcingThe Avoca Group
 
Comparison statisticalsignificancetestir
Comparison statisticalsignificancetestirComparison statisticalsignificancetestir
Comparison statisticalsignificancetestirClaudia Ribeiro
 
Managing Risk in Outsourced Clinical Trials
Managing Risk in Outsourced Clinical TrialsManaging Risk in Outsourced Clinical Trials
Managing Risk in Outsourced Clinical TrialsThe Avoca Group
 
CrowdTruth for medical relation extraction - WAI talk
CrowdTruth for medical relation extraction - WAI talkCrowdTruth for medical relation extraction - WAI talk
CrowdTruth for medical relation extraction - WAI talkAnca Dumitrache
 
2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical Practices2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical PracticesTerry Liao
 
Impact Evaluation: Balancing Rigor with Reality
Impact Evaluation: Balancing Rigor with RealityImpact Evaluation: Balancing Rigor with Reality
Impact Evaluation: Balancing Rigor with RealityDonna Smith-Moncrieffe
 
E bay amplify_final
E bay amplify_finalE bay amplify_final
E bay amplify_finalMaria Stone
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyaltyTheDataNation
 
Measurement System Analysis
Measurement System AnalysisMeasurement System Analysis
Measurement System AnalysisRonald Shewchuk
 
biki1 biostat.pdf
biki1 biostat.pdfbiki1 biostat.pdf
biki1 biostat.pdfGoogle
 
Non parametrict test
Non parametrict testNon parametrict test
Non parametrict testdobhalshiv
 
Statistics in meta analysis
Statistics in meta analysisStatistics in meta analysis
Statistics in meta analysisDr Shri Sangle
 
Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...
Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...
Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...KanakoIshiguro
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Manoj Sharma
 
Data Analysis Of An Analytical Method Transfer To
Data Analysis Of An Analytical Method Transfer ToData Analysis Of An Analytical Method Transfer To
Data Analysis Of An Analytical Method Transfer ToDwayne Neal
 
Assessment Group Managers Training-SMCC1.pptx
Assessment Group Managers Training-SMCC1.pptxAssessment Group Managers Training-SMCC1.pptx
Assessment Group Managers Training-SMCC1.pptxHctorMurciaForero
 
Topic 10 DATA ANALYSIS TECHNIQUES.pptx
Topic 10 DATA ANALYSIS TECHNIQUES.pptxTopic 10 DATA ANALYSIS TECHNIQUES.pptx
Topic 10 DATA ANALYSIS TECHNIQUES.pptxEdwinDagunot4
 
Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10Beamsync
 

Similar to Detect incorrectly implemented experiments (20)

Business Research Methods PPT - III
Business Research Methods PPT - IIIBusiness Research Methods PPT - III
Business Research Methods PPT - III
 
The State of Clinical Outsourcing
The State of Clinical OutsourcingThe State of Clinical Outsourcing
The State of Clinical Outsourcing
 
Comparison statisticalsignificancetestir
Comparison statisticalsignificancetestirComparison statisticalsignificancetestir
Comparison statisticalsignificancetestir
 
Managing Risk in Outsourced Clinical Trials
Managing Risk in Outsourced Clinical TrialsManaging Risk in Outsourced Clinical Trials
Managing Risk in Outsourced Clinical Trials
 
CrowdTruth for medical relation extraction - WAI talk
CrowdTruth for medical relation extraction - WAI talkCrowdTruth for medical relation extraction - WAI talk
CrowdTruth for medical relation extraction - WAI talk
 
Analysis and Interpretation
Analysis and InterpretationAnalysis and Interpretation
Analysis and Interpretation
 
2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical Practices2011 JSM - Good Statistical Practices
2011 JSM - Good Statistical Practices
 
Impact Evaluation: Balancing Rigor with Reality
Impact Evaluation: Balancing Rigor with RealityImpact Evaluation: Balancing Rigor with Reality
Impact Evaluation: Balancing Rigor with Reality
 
E bay amplify_final
E bay amplify_finalE bay amplify_final
E bay amplify_final
 
Satisfaction and loyalty
Satisfaction and loyaltySatisfaction and loyalty
Satisfaction and loyalty
 
Measurement System Analysis
Measurement System AnalysisMeasurement System Analysis
Measurement System Analysis
 
biki1 biostat.pdf
biki1 biostat.pdfbiki1 biostat.pdf
biki1 biostat.pdf
 
Non parametrict test
Non parametrict testNon parametrict test
Non parametrict test
 
Statistics in meta analysis
Statistics in meta analysisStatistics in meta analysis
Statistics in meta analysis
 
Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...
Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...
Comparison of the CDE guidance and the FDA guidance for antitumor clinical tr...
 
Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...Experimental designs and data analysis in the field of Agronomy science by ma...
Experimental designs and data analysis in the field of Agronomy science by ma...
 
Data Analysis Of An Analytical Method Transfer To
Data Analysis Of An Analytical Method Transfer ToData Analysis Of An Analytical Method Transfer To
Data Analysis Of An Analytical Method Transfer To
 
Assessment Group Managers Training-SMCC1.pptx
Assessment Group Managers Training-SMCC1.pptxAssessment Group Managers Training-SMCC1.pptx
Assessment Group Managers Training-SMCC1.pptx
 
Topic 10 DATA ANALYSIS TECHNIQUES.pptx
Topic 10 DATA ANALYSIS TECHNIQUES.pptxTopic 10 DATA ANALYSIS TECHNIQUES.pptx
Topic 10 DATA ANALYSIS TECHNIQUES.pptx
 
Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10Introduction to Business Analytics Course Part 10
Introduction to Business Analytics Course Part 10
 

More from Optimizely

Clover Rings Up Digital Growth to Drive Experimentation
Clover Rings Up Digital Growth to Drive ExperimentationClover Rings Up Digital Growth to Drive Experimentation
Clover Rings Up Digital Growth to Drive ExperimentationOptimizely
 
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...Optimizely
 
The Science of Getting Testing Right
The Science of Getting Testing RightThe Science of Getting Testing Right
The Science of Getting Testing RightOptimizely
 
Atlassian's Mystique CLI, Minimizing the Experiment Development Cycle
Atlassian's Mystique CLI, Minimizing the Experiment Development CycleAtlassian's Mystique CLI, Minimizing the Experiment Development Cycle
Atlassian's Mystique CLI, Minimizing the Experiment Development CycleOptimizely
 
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...Optimizely
 
Zillow + Optimizely: Building the Bridge to $20 Billion Revenue
Zillow + Optimizely: Building the Bridge to $20 Billion RevenueZillow + Optimizely: Building the Bridge to $20 Billion Revenue
Zillow + Optimizely: Building the Bridge to $20 Billion RevenueOptimizely
 
The Future of Optimizely for Technical Teams
The Future of Optimizely for Technical TeamsThe Future of Optimizely for Technical Teams
The Future of Optimizely for Technical TeamsOptimizely
 
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...Optimizely
 
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...Optimizely
 
Building an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team OfferingBuilding an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team OfferingOptimizely
 
AMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server SideAMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server SideOptimizely
 
Evolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product DevelopmentEvolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product DevelopmentOptimizely
 
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented ArchitectureOvercoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented ArchitectureOptimizely
 
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...Optimizely
 
Making Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyMaking Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyOptimizely
 
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives RevenueKick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives RevenueOptimizely
 
Experimentation through Clients' Eyes
Experimentation through Clients' EyesExperimentation through Clients' Eyes
Experimentation through Clients' EyesOptimizely
 
Shipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHubShipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHubOptimizely
 
Test Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with ExperimentationTest Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with ExperimentationOptimizely
 
Optimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature DeliveryOptimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature DeliveryOptimizely
 

More from Optimizely (20)

Clover Rings Up Digital Growth to Drive Experimentation
Clover Rings Up Digital Growth to Drive ExperimentationClover Rings Up Digital Growth to Drive Experimentation
Clover Rings Up Digital Growth to Drive Experimentation
 
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
Make Every Touchpoint Count: How to Drive Revenue in an Increasingly Online W...
 
The Science of Getting Testing Right
The Science of Getting Testing RightThe Science of Getting Testing Right
The Science of Getting Testing Right
 
Atlassian's Mystique CLI, Minimizing the Experiment Development Cycle
Atlassian's Mystique CLI, Minimizing the Experiment Development CycleAtlassian's Mystique CLI, Minimizing the Experiment Development Cycle
Atlassian's Mystique CLI, Minimizing the Experiment Development Cycle
 
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
Autotrader Case Study: Migrating from Home-Grown Testing to Best-in-Class Too...
 
Zillow + Optimizely: Building the Bridge to $20 Billion Revenue
Zillow + Optimizely: Building the Bridge to $20 Billion RevenueZillow + Optimizely: Building the Bridge to $20 Billion Revenue
Zillow + Optimizely: Building the Bridge to $20 Billion Revenue
 
The Future of Optimizely for Technical Teams
The Future of Optimizely for Technical TeamsThe Future of Optimizely for Technical Teams
The Future of Optimizely for Technical Teams
 
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
Empowering Agents to Provide Service from Anywhere: Contact Centers in the Ti...
 
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
Experimentation Everywhere: Create Exceptional Online Shopping Experiences an...
 
Building an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team OfferingBuilding an Experiment Pipeline for GitHub’s New Free Team Offering
Building an Experiment Pipeline for GitHub’s New Free Team Offering
 
AMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server SideAMC Networks Experiments Faster on the Server Side
AMC Networks Experiments Faster on the Server Side
 
Evolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product DevelopmentEvolving Experimentation from CRO to Product Development
Evolving Experimentation from CRO to Product Development
 
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented ArchitectureOvercoming the Challenges of Experimentation on a Service Oriented Architecture
Overcoming the Challenges of Experimentation on a Service Oriented Architecture
 
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
How The Zebra Utilized Feature Experiments To Increase Carrier Card Engagemen...
 
Making Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product StrategyMaking Your Hypothesis Work Harder to Inform Future Product Strategy
Making Your Hypothesis Work Harder to Inform Future Product Strategy
 
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives RevenueKick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
Kick Your Assumptions: How Scholl's Test-Everything Culture Drives Revenue
 
Experimentation through Clients' Eyes
Experimentation through Clients' EyesExperimentation through Clients' Eyes
Experimentation through Clients' Eyes
 
Shipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHubShipping to Learn and Accelerate Growth with GitHub
Shipping to Learn and Accelerate Growth with GitHub
 
Test Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with ExperimentationTest Everything: TrustRadius Delivers Customer Value with Experimentation
Test Everything: TrustRadius Delivers Customer Value with Experimentation
 
Optimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature DeliveryOptimizely Agent: Scaling Resilient Feature Delivery
Optimizely Agent: Scaling Resilient Feature Delivery
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 

Detect incorrectly implemented experiments

  • 1. Detecting Incorrectly Implemented Experiments Michael Lindon Staff Statistician Optimizely
  • 2. Challenges of Experimentation Sample Ratio Mismatch (SRM) Testing for SRMs Sequential Testing SSRM Project Outline
  • 4. “Perhaps the most common type of metric interpretation pitfall is when the observed metric change is not due to the expected behavior of the new feature, but due to a bug introduced when implementing the feature.” P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments KDD 2017
  • 5. Case Study [1] Z, Zhao et Al./Yahoo Inc Online Experimentation Diagnosis and Troubleshooting Beyond AA Validation DSAA 2016 ● User ID is assigned Test ID ● Test ID labels whether user receives treatment or control ● Traffic splitter consistently exposes users the correct variant (treatment or control) ● Necessary to provide a consistent experience in order to measure the long term effect of the treatment Intention: ● 4% User IDs lacked a valid Test ID ● Some users interacted with components of both control AND treatment ● Treatment group experience was a mixture of treatment and control Observation: ● Likely underestimate the treatment effect Consequences:
  • 6. Case Study [2] ● Increasing number of carousel items increases user engagement Intention: ● User engagement negatively affected!? ● Users were engaged so long, accidentally classified (algorithmically) as bots, and removed from the analysis Observation: ● Incorrect data processing logic, intended to remove non human visitors from analysis, removed human visitors from the analysis ● Metric change caused by bug, not treatment effect Consequences: A, Fabijan et Al./ Diagnosing Sample Ratio Mismatch in Online Controlled Experiments / KDD 2019
  • 7. Case Study [3] ● New protocol to delivering push notifications ● Expected increase in reliability of message delivery Intention: ● Significant improvements in totally unrelated call metrics ● Fraction of successfully connected calls increased ● Treatment affected telemetry loss rate Observation: ● Increase in metrics not caused by treatment effect ● Caused by a side effect of treatment, improving telemetry loss rate ● Biased telemetry Consequences: P, Dmitriev et Al./Analysis and Experimentation Team/Microsoft A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments
  • 9. “One of the most useful indicators of a variety of data quality issues is a Sample Ratio Mismatch (SRM) – the situation when the observed sample ratio in the experiment is different from the expected.” A, Fabijan et Al./ Microsoft/ Booking.com/ Outreach.io Diagnosing Sample Ratio Mismatch in Online Controlled Experiments / KDD 2019
  • 12. Observations Don’t Match Expectations ● Intended Allocation Probabilities: ○ 0.50, 0.30, 0.20 ● Empirical Allocation Probabilities ○ 0.45, 0.28, 0.27 ● Why do we observe a different traffic distribution than intended? ● Strong signal of an incorrect implementation ● When this departure from the intended allocation probabilities is statistically significant, a sample ratio mismatch (SRM) is said to be present
  • 13. “...within the last year we identified that approximately 6% of experiments at Microsoft exhibit an SRM.” A, Fabijan et Al. Diagnosing Sample Ratio Mismatch in Online Controlled Experime KDD 2019
  • 15. n_treatment = 821588 n_control = 815482 p = [0.5, 0.5] Binomial Test: p-value: 1.8 e-06 Outcome: Entire Experiment Lost Example 1: Using a Non Sequential Test
  • 16. SSRM: Null Rejected after 417150 visitors (at alpha=0.05) Savings: Detected SRM in ¼ time of original experiment Outcome: Prevented 1219920 visitors entering a faulty experiment Example 1: Using the Sequential SRM Test
  • 17. Why Can’t I Just Use The Chi Squared Test Sequentially?
  • 19. Data Simulated Under Null p=0.5
  • 20. At the end of the experiment, the null is rejected if x/n >= 0.531 Or x/n <= 0.469 In this case, x/n = 0.504, so null is not rejected (correct) What does the rejection region look like for all n?
  • 21. Rejection Regions for Chi2(alpha=0.05, p=0.5) Test
  • 22. Null hypothesis incorrectly rejected at n=26 Repeated usage of the Chi2 test resulted in a False Positive
  • 23. Null hypothesis incorrectly rejected at n=37 Repeated usage of the likelihood ratio test resulted in a False Positive Would a Likelihood Ratio Test have been any different?
  • 24. At any point in time, rejection region for SSRM is smaller than Chi2 test. This allows SSRM to be used after every datapoint, without increasing false positive rate Comparison of Critical Regions between Chi2 and SSRM
  • 25. Null hypothesis not rejected (correct) Repeated usage of the ssrm test did not result in a false positive Would the SSRM Test have been any different?
  • 26. Was it Just Bad Luck?
  • 27. Chi2 Simulation Study (np.random.seed(0)) Number of False Positives Much Higher Than Expected
  • 28. LRT Simulation Study (np.random.seed(0)) Number of False Positives Much Higher Than Expected
  • 29. SSRM Simulation Study (np.random.seed(0)) Number of False Positives As Expected
  • 31. SSRM Simulation Study: null_probability = 0.5, true_probability = 0.6 Almost all SRMs were detected near the beginning of the experiment
  • 34. Github Repo Contains Jupyter Notebook Tutorials
  • 36. optimize.ly/dev-community Thank you! Join us on Slack for Q&A @michaelslindon michaellindon

Editor's Notes

  1. Remember to join our developer Slack community, where you can keeping the progressive delivery and experimentation discussion going. Thanks so much for joining us today, we look forward to continuing the conversation.