Bringing Sequential Analysis to A/B Testing with examples from his work at Optimizely.
These slides are from a talk given at the SF Data Engineering meetup. http://www.meetup.com/SF-Data-Engineering/events/231047195/
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
Always Valid Inference (Ramesh Johari, Stanford)
1. Always Valid Inference
Bringing Sequential Analysis to A/B Testing
Ramesh Johari | Leo Pekelis | David Walsh
Stanford University / Optimizely
ramesh.johari@stanford.edu
25 May 2016
1 / 29
2. A/B testing
▶ Visitors randomized to
Variation A (control) or B
(treatment).
▶ qA, qB: true conversion
rates in each group
▶ Hypothesis test:
H0 : θ = 0 (Null)
H1 : θ ̸= 0 (Alternative)
where θ = qA − qB.
2 / 29
3. How it works
Fixed horizon testing:
▶ The user must set sample size N in advance.
▶ After each new visitor, the interface computes a p-value:
pn = P(data at least as "extreme" as current sample
|H0).
▶ Decision rule: Reject H0 if p-value pN ≤ α.
3 / 29
4. How it works
Fixed horizon testing:
▶ The user must set sample size N in advance.
▶ After each new visitor, the interface computes a p-value:
pn = P(data at least as "extreme" as current sample
|H0).
▶ Decision rule: Reject H0 if p-value pN ≤ α.
Remarks:
▶ This bounds Type I error (false positive probability) at
level α.
▶ α trades off false positives and false negatives.
3 / 29
7. The problem with peeking
Unfortunately this dramatically inflates Type I errors!
Example: A sample path from an A/A test:
6 / 29
8. The problem with peeking
With arbitrarily large horizon, Type I error is guaranteed.
Even on finite horizons, Type I errors are highly inflated:
7 / 29
10. Always valid p-values
A desired approach is one which decouples
a user's interaction model from
providing statistically valid results.
Initial goal:
▶ A user should be able to look at
their results whenever they want.
▶ The p-value at that time should
give valid type I error control.
9 / 29
11. Always valid p-values
Definition
A (fixed-horizon) p-value process is a (data-dependent)
sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
10 / 29
12. Always valid p-values
Definition
A (fixed-horizon) p-value process is a (data-dependent)
sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
Definition
A p-value process is always valid if for any data-dependent
stopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
10 / 29
13. Always valid p-values
Definition
A (fixed-horizon) p-value process is a (data-dependent)
sequence pn such that for all n and all x ∈ [0, 1]:
P0(pn ≤ x) ≤ x.
Definition
A p-value process is always valid if for any data-dependent
stopping time T and all x ∈ [0, 1]:
P0(pT ≤ x) ≤ x.
Allows user to favorably bias the choice of T based on the
data that is seen.
10 / 29
15. Sequential tests
Definition
A sequential test {Tα} is a data-dependent rule for stopping
the test and rejecting the null that:
1. stops the test later when α is lower; and
2. stops with probability ≤ α when the null is true:
P0(Tα < ∞) ≤ α.
12 / 29
16. Constructing always valid p-values
Theorem
Given a sequential test, define the p-value pn to be:
the smallest α such that the α-level test
would have stopped by observation n.
Then the resulting p-value process is always valid.
13 / 29
17. Duality
Note that if a user stops the first time that the always valid
p-value drops below α, then the stopping time is Tα.
This natural stopping policy thus implements the original
sequential test.
We now ask: What family of sequential tests should we use?
14 / 29
19. High power and fast detection
To choose a particular class of always valid p-values, we have
to return to what the user is trying to do.
Here is one way to view it:
1. We choose a method of computing p-values.
2. The user observes p-values, and wants to detect whether
a nonzero effect exists (reject H0).
3. The user determines when they want to stop the test, up
to a maximum run time of N.
16 / 29
20. High power and fast detection
To choose a particular class of always valid p-values, we have
to return to what the user is trying to do.
Here is one way to view it:
1. We choose a method of computing p-values.
2. The user observes p-values, and wants to detect whether
a nonzero effect exists (reject H0).
3. The user determines when they want to stop the test, up
to a maximum run time of N.
So what the user wants is:
1. High power. In other words, high Pθ( reject the null ) for
nonzero θ.
2. Fast detection. As quickly as possible (up to time N).
16 / 29
21. Special case: N(θ, 1) data
For simplicity: assume data generated from N(θ, 1)
distribution, where θ is unknown.
Consider testing:
H0 : θ = 0
H1 : θ ̸= 0
Notation: Let Ln(θ, θ0; x, n) be LR of θ against θ0 = 0,
given n observations with sample mean x.
17 / 29
22. The mSPRT
Let H ∼ N(0, σ2
), and consider:
Ln =
∫
Ln(θ, θ0; Xn, n)dH(θ).
Define:
Tα = inf
{
n : Ln ≥
1
α
}
.
This test has high power.
Proposition (Robbins and Siegmund)
{Tα} is a sequential test of power one: the mixture sequential
probability ratio test (mSPRT).
18 / 29
23. Why it works
Under the null:
▶ Typical fluctuations of sample mean are of size 1/
√
n, so
any decision rule of the form:
Reject if |sample mean| > k/
√
n
is bound to eventually reject. This is what fixed horizon
testing (e.g., z-test) will do.
19 / 29
24. Why it works
Under the null:
▶ Typical fluctuations of sample mean are of size 1/
√
n, so
any decision rule of the form:
Reject if |sample mean| > k/
√
n
is bound to eventually reject. This is what fixed horizon
testing (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√
2 log log n/
√
n is crossed infinitely often.
19 / 29
25. Why it works
Under the null:
▶ Typical fluctuations of sample mean are of size 1/
√
n, so
any decision rule of the form:
Reject if |sample mean| > k/
√
n
is bound to eventually reject. This is what fixed horizon
testing (e.g., z-test) will do.
▶ In fact, by law of the iterated logarithm, the boundary√
2 log log n/
√
n is crossed infinitely often.
▶ mSPRT leads to boundary of the form C
√
log n/
√
n:
▶ Goes to zero (so power one);
▶ But slowly enough (so Type I error can be controlled).
19 / 29
27. Fast detection
In a Bayesian analysis, we show that in the limit of α → 0:
The optimal choice of mixing distribution in the mSPRT
involves (roughly) matching the mixing variance σ2
to the
prior variance of the effect size θ.
21 / 29
29. No free lunch?
What do we give up in return for continuous monitoring?
▶ If the effect size is known in advance,
should only be better!
▶ In practice, we don't know the effect size in advance.
The test we designed does not assume knowledge of the
effect size.
We compare our test to a fixed horizon test,
using data from Optimizely.
23 / 29
30. Run lengths on Optimizely
Our results show robustness to not knowing the effect size:
24 / 29
31. Run lengths: Interpretation
Our results show robustness to not knowing the effect size.
Intuition:
▶ Detecting an effect of size ∆ takes
a run length proportional to 1/∆2
▶ So the penalty for guessing wrong
about δ is very high!
▶ An MDE that is 2x too small =⇒
run length that is 4x too long
25 / 29
32. Run lengths: Theory
Suppose that effect size is drawn from a normal distribution.
In an appropriate scaling regime where α → 0 and N → ∞,
we show that mSPRT at level α truncated to Θ(N) gives
similar power as fixed horizon test of length N, but with
detection time that is o(N).
26 / 29
34. Experimentation in the Internet age
Rapid innovation in
information & communication technology
has democratized the scientific method.
Our goal: "adapt" statistical methodology to
act in partnership with the user.
Additional results:
▶ Extension to multiple testing (Benjamini-Hochberg)
▶ Extension to adaptive allocation (bandits)
▶ Extension to confidence intervals
28 / 29
35. Optimizely Stats Engine
▶ The ideas presented in this talk were released to all of
Optimizely's customers on January 20, 2015
▶ Provides both always valid p-values
and multiple testing corrections
29 / 29