SlideShare a Scribd company logo
1 of 35
Download to read offline
Always Valid Inference
Bringing Sequential Analysis to A/B Testing
Ramesh Johari | Leo Pekelis | David Walsh
Stanford University / Optimizely
ramesh.johari@stanford.edu
25 May 2016
1 / 29
A/B testing
▶ Visitors randomized to
Variation A (control) or B
(treatment).
▶ qA, qB: true conversion
rates in each group
▶ Hypothesis test:
H0 : θ = 0 (Null)
H1 : θ ̸= 0 (Alternative)
where θ = qA − qB.
2 / 29
How it works
Fixed horizon testing:
▶ The user must set sample size N in advance.
▶ After each new visitor, the interface computes a p-value:
pn = P(data at least as "extreme" as current sample
|H0).
▶ Decision rule: Reject H0 if p-value pN ≤ α.
3 / 29
How it works
Fixed horizon testing:
▶ The user must set sample size N in advance.
▶ After each new visitor, the interface computes a p-value:
pn = P(data at least as "extreme" as current sample
|H0).
▶ Decision rule: Reject H0 if p-value pN ≤ α.
Remarks:
▶ This bounds Type I error (false positive probability) at
level α.
▶ α trades off false positives and false negatives.
3 / 29
Continuous Monitoring
4 / 29
Continuous monitoring
In practice:
Technology makes it convenient to
continuously monitor tests!
E.g., results matrix:
5 / 29
The problem with peeking
Unfortunately this dramatically inflates Type I errors!
Example: A sample path from an A/A test:
6 / 29
The problem with peeking
With arbitrarily large horizon, Type I error is guaranteed.
Even on finite horizons, Type I errors are highly inflated:
7 / 29
Always Valid Statistics
8 / 29
Always valid p-values
A desired approach is one which decouples
a user's interaction model from
providing statistically valid results.
Initial goal:
▶ A user should be able to look at
their results whenever they want.
▶ The p-value at that time should
give valid type I error control.
9 / 29
Always valid p-values
Definition
A (fixed-horizon) p-value process is a (data-dependent)
sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
10 / 29
Always valid p-values
Definition
A (fixed-horizon) p-value process is a (data-dependent)
sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
Definition
A p-value process is always valid if for any data-dependent
stopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
10 / 29
Always valid p-values
Definition
A (fixed-horizon) p-value process is a (data-dependent)
sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
Definition
A p-value process is always valid if for any data-dependent
stopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
Allows user to favorably bias the choice of T based on the
data that is seen.
10 / 29
Constructing Always Valid p-values
11 / 29
Sequential tests
Definition
A sequential test {Tα} is a data-dependent rule for stopping
the test and rejecting the null that:
1. stops the test later when α is lower; and
2. stops with probability ≤ α when the null is true:
P0(Tα < ∞) ≤ α.
12 / 29
Constructing always valid p-values
Theorem
Given a sequential test, define the p-value pn to be:
the smallest α such that the α-level test
would have stopped by observation n.
Then the resulting p-value process is always valid.
13 / 29
Duality
Note that if a user stops the first time that the always valid
p-value drops below α, then the stopping time is Tα.
This natural stopping policy thus implements the original
sequential test.
We now ask: What family of sequential tests should we use?
14 / 29
Power and Run Length
15 / 29
High power and fast detection
To choose a particular class of always valid p-values, we have
to return to what the user is trying to do.
Here is one way to view it:
1. We choose a method of computing p-values.
2. The user observes p-values, and wants to detect whether
a nonzero effect exists (reject H0).
3. The user determines when they want to stop the test, up
to a maximum run time of N.
16 / 29
High power and fast detection
To choose a particular class of always valid p-values, we have
to return to what the user is trying to do.
Here is one way to view it:
1. We choose a method of computing p-values.
2. The user observes p-values, and wants to detect whether
a nonzero effect exists (reject H0).
3. The user determines when they want to stop the test, up
to a maximum run time of N.
So what the user wants is:
1. High power. In other words, high Pθ( reject the null ) for
nonzero θ.
2. Fast detection. As quickly as possible (up to time N).
16 / 29
Special case: N(θ, 1) data
For simplicity: assume data generated from N(θ, 1)
distribution, where θ is unknown.
Consider testing:
H0 : θ = 0
H1 : θ ̸= 0
Notation: Let Ln(θ, θ0; x, n) be LR of θ against θ0 = 0,
given n observations with sample mean x.
17 / 29
The mSPRT
Let H ∼ N(0, σ2
), and consider:
Ln =
∫
Ln(θ, θ0; Xn, n)dH(θ).
Define:
Tα = inf
{
n : Ln ≥
1
α
}
.
This test has high power.
Proposition (Robbins and Siegmund)
{Tα} is a sequential test of power one: the mixture sequential
probability ratio test (mSPRT).
18 / 29
Why it works
Under the null:
▶ Typical fluctuations of sample mean are of size 1/
√
n, so
any decision rule of the form:
Reject if |sample mean| > k/
√
n
is bound to eventually reject. This is what fixed horizon
testing (e.g., z-test) will do.
19 / 29
Why it works
Under the null:
▶ Typical fluctuations of sample mean are of size 1/
√
n, so
any decision rule of the form:
Reject if |sample mean| > k/
√
n
is bound to eventually reject. This is what fixed horizon
testing (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√
2 log log n/
√
n is crossed infinitely often.
19 / 29
Why it works
Under the null:
▶ Typical fluctuations of sample mean are of size 1/
√
n, so
any decision rule of the form:
Reject if |sample mean| > k/
√
n
is bound to eventually reject. This is what fixed horizon
testing (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√
2 log log n/
√
n is crossed infinitely often.
▶ mSPRT leads to boundary of the form C
√
log n/
√
n:
▶ Goes to zero (so power one);
▶ But slowly enough (so Type I error can be controlled).
19 / 29
Why it works
Graphical illustration:
20 / 29
Fast detection
In a Bayesian analysis, we show that in the limit of α → 0:
The optimal choice of mixing distribution in the mSPRT
involves (roughly) matching the mixing variance σ2
to the
prior variance of the effect size θ.
21 / 29
Run length
22 / 29
No free lunch?
What do we give up in return for continuous monitoring?
▶ If the effect size is known in advance,
should only be better!
▶ In practice, we don't know the effect size in advance.
The test we designed does not assume knowledge of the
effect size.
We compare our test to a fixed horizon test,
using data from Optimizely.
23 / 29
Run lengths on Optimizely
Our results show robustness to not knowing the effect size:
24 / 29
Run lengths: Interpretation
Our results show robustness to not knowing the effect size.
Intuition:
▶ Detecting an effect of size ∆ takes
a run length proportional to 1/∆2
▶ So the penalty for guessing wrong
about δ is very high!
▶ An MDE that is 2x too small =⇒
run length that is 4x too long
25 / 29
Run lengths: Theory
Suppose that effect size is drawn from a normal distribution.
In an appropriate scaling regime where α → 0 and N → ∞,
we show that mSPRT at level α truncated to Θ(N) gives
similar power as fixed horizon test of length N, but with
detection time that is o(N).
26 / 29
Conclusions
27 / 29
Experimentation in the Internet age
Rapid innovation in
information & communication technology
has democratized the scientific method.
Our goal: "adapt" statistical methodology to
act in partnership with the user.
Additional results:
▶ Extension to multiple testing (Benjamini-Hochberg)
▶ Extension to adaptive allocation (bandits)
▶ Extension to confidence intervals
28 / 29
Optimizely Stats Engine
▶ The ideas presented in this talk were released to all of
Optimizely's customers on January 20, 2015
▶ Provides both always valid p-values
and multiple testing corrections
29 / 29

More Related Content

Similar to Always Valid Inference (Ramesh Johari, Stanford)

Monte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxMonte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxHaibinSu2
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxlmelaine
 
hypothesisTestPPT.pptx
hypothesisTestPPT.pptxhypothesisTestPPT.pptx
hypothesisTestPPT.pptxdangwalakash07
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Financethilankm
 
Po 1 10 t distribution-df
Po 1 10  t distribution-dfPo 1 10  t distribution-df
Po 1 10 t distribution-dfshyna10
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionJeremyHeng10
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
Categorical data analysis full lecture note PPT.pptx
Categorical data analysis full lecture note  PPT.pptxCategorical data analysis full lecture note  PPT.pptx
Categorical data analysis full lecture note PPT.pptxMinilikDerseh1
 
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...The Stockker
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big DataGianvito Siciliano
 
Converting Measurement Systems From Attribute
Converting Measurement Systems From AttributeConverting Measurement Systems From Attribute
Converting Measurement Systems From Attributejdavidgreen007
 

Similar to Always Valid Inference (Ramesh Johari, Stanford) (20)

Monte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptxMonte Carlo Berkeley.pptx
Monte Carlo Berkeley.pptx
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
Chapter10 Revised
Chapter10 RevisedChapter10 Revised
Chapter10 Revised
 
L18_D.pdf
L18_D.pdfL18_D.pdf
L18_D.pdf
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docx
 
hypothesisTestPPT.pptx
hypothesisTestPPT.pptxhypothesisTestPPT.pptx
hypothesisTestPPT.pptx
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Po 1 10 t distribution-df
Po 1 10  t distribution-dfPo 1 10  t distribution-df
Po 1 10 t distribution-df
 
Sequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmissionSequential Monte Carlo algorithms for agent-based models of disease transmission
Sequential Monte Carlo algorithms for agent-based models of disease transmission
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
ML MODULE 4.pdf
ML MODULE 4.pdfML MODULE 4.pdf
ML MODULE 4.pdf
 
Categorical data analysis full lecture note PPT.pptx
Categorical data analysis full lecture note  PPT.pptxCategorical data analysis full lecture note  PPT.pptx
Categorical data analysis full lecture note PPT.pptx
 
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Two Proportions
Two Proportions  Two Proportions
Two Proportions
 
Converting Measurement Systems From Attribute
Converting Measurement Systems From AttributeConverting Measurement Systems From Attribute
Converting Measurement Systems From Attribute
 
Statistics
StatisticsStatistics
Statistics
 

More from Hakka Labs

DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceHakka Labs
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartHakka Labs
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleHakka Labs
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...Hakka Labs
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesHakka Labs
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInHakka Labs
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopHakka Labs
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...Hakka Labs
 

More from Hakka Labs (20)

DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
DataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data ScienceDataEngConf SF16 - Data Asserts: Defensive Data Science
DataEngConf SF16 - Data Asserts: Defensive Data Science
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
DataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at InstacartDataEngConf SF16 - Recommendations at Instacart
DataEngConf SF16 - Recommendations at Instacart
 
DataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scaleDataEngConf SF16 - Running simulations at scale
DataEngConf SF16 - Running simulations at scale
 
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataDataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQDataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ
 
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...
 
DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...DataEngConf SF16 - Three lessons learned from building a production machine l...
DataEngConf SF16 - Three lessons learned from building a production machine l...
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineeringDataEngConf SF16 - Bridging the gap between data science and data engineering
DataEngConf SF16 - Bridging the gap between data science and data engineering
 
DataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data StructuresDataEngConf SF16 - Multi-temporal Data Structures
DataEngConf SF16 - Multi-temporal Data Structures
 
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkDataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark
 
DataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with OurselvesDataEngConf SF16 - Beginning with Ourselves
DataEngConf SF16 - Beginning with Ourselves
 
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityDataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability
 
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
DataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL WorkshopDataEngConf SF16 - Spark SQL Workshop
DataEngConf SF16 - Spark SQL Workshop
 
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
DataEngConf: Building a Music Recommender System from Scratch with Spotify Da...
 

Recently uploaded

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 

Recently uploaded (20)

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 

Always Valid Inference (Ramesh Johari, Stanford)

  • 1. Always Valid Inference Bringing Sequential Analysis to A/B Testing Ramesh Johari | Leo Pekelis | David Walsh Stanford University / Optimizely ramesh.johari@stanford.edu 25 May 2016 1 / 29
  • 2. A/B testing ▶ Visitors randomized to Variation A (control) or B (treatment). ▶ qA, qB: true conversion rates in each group ▶ Hypothesis test: H0 : θ = 0 (Null) H1 : θ ̸= 0 (Alternative) where θ = qA − qB. 2 / 29
  • 3. How it works Fixed horizon testing: ▶ The user must set sample size N in advance. ▶ After each new visitor, the interface computes a p-value: pn = P(data at least as "extreme" as current sample |H0). ▶ Decision rule: Reject H0 if p-value pN ≤ α. 3 / 29
  • 4. How it works Fixed horizon testing: ▶ The user must set sample size N in advance. ▶ After each new visitor, the interface computes a p-value: pn = P(data at least as "extreme" as current sample |H0). ▶ Decision rule: Reject H0 if p-value pN ≤ α. Remarks: ▶ This bounds Type I error (false positive probability) at level α. ▶ α trades off false positives and false negatives. 3 / 29
  • 6. Continuous monitoring In practice: Technology makes it convenient to continuously monitor tests! E.g., results matrix: 5 / 29
  • 7. The problem with peeking Unfortunately this dramatically inflates Type I errors! Example: A sample path from an A/A test: 6 / 29
  • 8. The problem with peeking With arbitrarily large horizon, Type I error is guaranteed. Even on finite horizons, Type I errors are highly inflated: 7 / 29
  • 10. Always valid p-values A desired approach is one which decouples a user's interaction model from providing statistically valid results. Initial goal: ▶ A user should be able to look at their results whenever they want. ▶ The p-value at that time should give valid type I error control. 9 / 29
  • 11. Always valid p-values Definition A (fixed-horizon) p-value process is a (data-dependent) sequence pn such that for all n and all x ∈ [0, 1]: P0(pn ≤ x) ≤ x. 10 / 29
  • 12. Always valid p-values Definition A (fixed-horizon) p-value process is a (data-dependent) sequence pn such that for all n and all x ∈ [0, 1]: P0(pn ≤ x) ≤ x. Definition A p-value process is always valid if for any data-dependent stopping time T and all x ∈ [0, 1]: P0(pT ≤ x) ≤ x. 10 / 29
  • 13. Always valid p-values Definition A (fixed-horizon) p-value process is a (data-dependent) sequence pn such that for all n and all x ∈ [0, 1]: P0(pn ≤ x) ≤ x. Definition A p-value process is always valid if for any data-dependent stopping time T and all x ∈ [0, 1]: P0(pT ≤ x) ≤ x. Allows user to favorably bias the choice of T based on the data that is seen. 10 / 29
  • 14. Constructing Always Valid p-values 11 / 29
  • 15. Sequential tests Definition A sequential test {Tα} is a data-dependent rule for stopping the test and rejecting the null that: 1. stops the test later when α is lower; and 2. stops with probability ≤ α when the null is true: P0(Tα < ∞) ≤ α. 12 / 29
  • 16. Constructing always valid p-values Theorem Given a sequential test, define the p-value pn to be: the smallest α such that the α-level test would have stopped by observation n. Then the resulting p-value process is always valid. 13 / 29
  • 17. Duality Note that if a user stops the first time that the always valid p-value drops below α, then the stopping time is Tα. This natural stopping policy thus implements the original sequential test. We now ask: What family of sequential tests should we use? 14 / 29
  • 18. Power and Run Length 15 / 29
  • 19. High power and fast detection To choose a particular class of always valid p-values, we have to return to what the user is trying to do. Here is one way to view it: 1. We choose a method of computing p-values. 2. The user observes p-values, and wants to detect whether a nonzero effect exists (reject H0). 3. The user determines when they want to stop the test, up to a maximum run time of N. 16 / 29
  • 20. High power and fast detection To choose a particular class of always valid p-values, we have to return to what the user is trying to do. Here is one way to view it: 1. We choose a method of computing p-values. 2. The user observes p-values, and wants to detect whether a nonzero effect exists (reject H0). 3. The user determines when they want to stop the test, up to a maximum run time of N. So what the user wants is: 1. High power. In other words, high Pθ( reject the null ) for nonzero θ. 2. Fast detection. As quickly as possible (up to time N). 16 / 29
  • 21. Special case: N(θ, 1) data For simplicity: assume data generated from N(θ, 1) distribution, where θ is unknown. Consider testing: H0 : θ = 0 H1 : θ ̸= 0 Notation: Let Ln(θ, θ0; x, n) be LR of θ against θ0 = 0, given n observations with sample mean x. 17 / 29
  • 22. The mSPRT Let H ∼ N(0, σ2 ), and consider: Ln = ∫ Ln(θ, θ0; Xn, n)dH(θ). Define: Tα = inf { n : Ln ≥ 1 α } . This test has high power. Proposition (Robbins and Siegmund) {Tα} is a sequential test of power one: the mixture sequential probability ratio test (mSPRT). 18 / 29
  • 23. Why it works Under the null: ▶ Typical fluctuations of sample mean are of size 1/ √ n, so any decision rule of the form: Reject if |sample mean| > k/ √ n is bound to eventually reject. This is what fixed horizon testing (e.g., z-test) will do. 19 / 29
  • 24. Why it works Under the null: ▶ Typical fluctuations of sample mean are of size 1/ √ n, so any decision rule of the form: Reject if |sample mean| > k/ √ n is bound to eventually reject. This is what fixed horizon testing (e.g., z-test) will do. ▶ In fact, by law of the iterated logarithm, the boundary√ 2 log log n/ √ n is crossed infinitely often. 19 / 29
  • 25. Why it works Under the null: ▶ Typical fluctuations of sample mean are of size 1/ √ n, so any decision rule of the form: Reject if |sample mean| > k/ √ n is bound to eventually reject. This is what fixed horizon testing (e.g., z-test) will do. ▶ In fact, by law of the iterated logarithm, the boundary√ 2 log log n/ √ n is crossed infinitely often. ▶ mSPRT leads to boundary of the form C √ log n/ √ n: ▶ Goes to zero (so power one); ▶ But slowly enough (so Type I error can be controlled). 19 / 29
  • 26. Why it works Graphical illustration: 20 / 29
  • 27. Fast detection In a Bayesian analysis, we show that in the limit of α → 0: The optimal choice of mixing distribution in the mSPRT involves (roughly) matching the mixing variance σ2 to the prior variance of the effect size θ. 21 / 29
  • 29. No free lunch? What do we give up in return for continuous monitoring? ▶ If the effect size is known in advance, should only be better! ▶ In practice, we don't know the effect size in advance. The test we designed does not assume knowledge of the effect size. We compare our test to a fixed horizon test, using data from Optimizely. 23 / 29
  • 30. Run lengths on Optimizely Our results show robustness to not knowing the effect size: 24 / 29
  • 31. Run lengths: Interpretation Our results show robustness to not knowing the effect size. Intuition: ▶ Detecting an effect of size ∆ takes a run length proportional to 1/∆2 ▶ So the penalty for guessing wrong about δ is very high! ▶ An MDE that is 2x too small =⇒ run length that is 4x too long 25 / 29
  • 32. Run lengths: Theory Suppose that effect size is drawn from a normal distribution. In an appropriate scaling regime where α → 0 and N → ∞, we show that mSPRT at level α truncated to Θ(N) gives similar power as fixed horizon test of length N, but with detection time that is o(N). 26 / 29
  • 34. Experimentation in the Internet age Rapid innovation in information & communication technology has democratized the scientific method. Our goal: "adapt" statistical methodology to act in partnership with the user. Additional results: ▶ Extension to multiple testing (Benjamini-Hochberg) ▶ Extension to adaptive allocation (bandits) ▶ Extension to confidence intervals 28 / 29
  • 35. Optimizely Stats Engine ▶ The ideas presented in this talk were released to all of Optimizely's customers on January 20, 2015 ▶ Provides both always valid p-values and multiple testing corrections 29 / 29