- A/B testing involves randomized controlled experiments comparing a treatment group to a control group. However, there are various sources of variability beyond just the treatment that must be accounted for.
- Good experiment design aims to minimize bias and convert it to random noise through randomization. The role of statistics is to quantify the magnitude of the treatment effect compared to the noise.
- Classical hypothesis testing approaches the problem as "assuming no difference and seeing if the data contradicts that". However, concerns with this approach include overreliance on p-values and not addressing multiple testing.
- Bayesian approaches consider the probability of there being a difference given the data, but require specifying a prior probability which is challenging. Alternatives like multi-
2. Roadmap
• What is A/B testing?
•
•
•
•
•
•
Good experiments and the role of statistics
Similar to proof by contradiction
“Tests”
Big data meets classic asymptotics
Complaints with classical hypothesis testing
Alternatives?
3. What is A/B Testing
• An industry term for controlled and randomized experiment
between treatment/control groups.
• Age old problem….especially with humans
4. What most people know:
Gather samples
Apply treatments
Compare
Measure Outcome
Assign treatments
A
?
B
5. What most people know:
Only difference is in the treatment!
A
?
B
7. Confounding:
• If there are variabilities in addition to the treatment effect,
how can we identify/isolate the effect from the treatment?
8. 3 Types of Variability:
• Controlled variability
• Systematic and desired
• i.e. our treatment
• Bias
• Systematic but not desired
• Anything that can confound our study
• Noise
• Random error but not desired
• Won’t confound the study but makes it hard to
make a decision.
9. How do we categorize each?
Variability from
Samples/Inputs
Variability from
Treatment/function
Variability from
Measurement
A
??????
B
13. Reality:
Think about what you want to measure and how!
Minimize the noise level/variability in the metric.
A
?
B
14. A good experiment in general:
- Good design and implementation should be used to avoid bias.
- For unavoidable biases, use randomization to turn it into noise.
- Good planning to minimize noise in data.
15. How do we deal with noise?
- Bread and butter of statisticians!
- Quantify the magnitude of the treatment
- Quantify the magnitude of the noise
- Just compare…..most of the time
17. Formalizing the Comparison
Similar to proof by contradiction
- You assume the difference is by chance (noise)
- See how the data contradicts the assumption
18. Formalizing the Comparison
Similar to proof by contradiction
- You assume the difference is by chance (noise)
- See how the data contradicts the assumption
- If the surprise surpasses a threshold, we reject
the assumption.
- ….nothing is “100%”
19. Difference due to chance?
Red -> treatment; Black -> control
ID
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
PV
39
209
31
98
9
151
20. Difference due to chance?
Red -> treatment; Black -> control
ID
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
PV
39
209
31
98
9
151
|
|
|
|
|
|
|
Let’s measure the difference in means!
mean
72
mean
124.5
Diff = -52.5
….so what?
21. Difference due to chance?
Red -> treatment; Black -> control
ID
PV
ID
PV
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
39
209
31
98
9
151
1
2
3
4
5
6
39
209
31
98
9
151
If there was no difference from the treatment, shuffling the treatment status
can emulate the randomization of the samples.
22. Difference due to chance?
Red -> treatment; Black -> control
ID
PV
ID
PV
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
39
209
31
98
9
151
1
2
3
4
5
6
39
209
31
98
9
151
Diff = 122.25 – 24 = 98.25
23. Difference due to chance?
Red -> treatment; Black -> control
ID
PV
ID
PV
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6
39
209
31
98
9
151
1
2
3
4
5
6
39
209
31
98
9
151
Diff = 107. 5 – 53.5 = 54
25. Difference due to chance?
Our original -52.5
46.5% of the permutations yielded a larger if not the same
difference as our original sample (in magnitude).
Are you surprised by the initial results?
27. “Tests”
Congratulations!
- You just learned the permutation test!
- The 46.5% is the p-value under the permutation test.
Problems:
- Permuting the labels can be computationally costly.
- Not possible before computers!
- Statistical theory says there are many tests out there.
28. Standard t-test:
1) Calculate delta:
= mean_treatment – mean_control
2) Assumes follows a Normal distribution then calculate
the p-value.
p-value = sum of red areas
-
0
3) If p-value < 0.05 then we reject the assumption that there is no
difference between treatment and control.
28
“Tests”
29. Big data meets classic Stats
29
Wait, our metrics may not be Normal!
30. Big Data meets Classic Stat
We care about the “mean of
the metric” and not the actual
metric distribution.
30
Wait, our metrics may not be Normal!
31. Big Data meets Classic Stat
We care about the “mean of
the metric” and not the actual
metric distribution.
31
Wait, our metrics may not be Normal!
Central Limit Theorem:
The “mean of the metric” will be
Normal if the sample size is LARGE!
32. Assumptions with t-test
- Normality of %delta
- Guaranteed with large sample sizes
- Independent Samples
- Not too many 0’s
That’s IT!!!
- Easy to automate.
- Simple and general.
32
Big Data meets Classic Stat
33. What are “Tests”?
33
• Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
34. • Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
Guarantees:
• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
34
What are “Tests”?
35. • Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
Guarantees:
• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
• By setting the power of the test to be 80%, we control
P( Test says difference exists | In reality difference exists) >= 80%
35
What are “Tests”?
36. • Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
Guarantees:
• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
• By setting the power of the test to be 80%, we control
P( Test says difference exists | In reality difference exists) >= 80%
• Increasing this often requires more data
36
What are “Tests”?
42. - Most appropriate over repeated decision making
- E.g. spammer or not
- Not seeing a difference could mean
- There is no difference
- Not enough power
42
Meaning:
43. - Most appropriate over repeated decision making
- E.g. spammer or not
- Not seeing a difference could mean
- There is no difference
- Not enough power
- Seeing a difference could mean
- There is a difference
- Got unlucky/lucky
43
Meaning:
44. - Most appropriate over repeated decision making
- E.g. spammer or not
- Not seeing a difference could mean
- There is no difference
- Not enough power
- Seeing a difference could mean
- There is a difference
- Got unlucky/lucky
- Your specific test is either impactful or not. (100% or 0%)
Not what most people want to hear….
44
Meaning:
45. Complaints with Hypth Testing
45
• People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
46. Complaints with Hypth Testing
46
• People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance
• You could detect a .000001 difference, so what?
47. • People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance
• You could detect a .000001 difference, so what?
• Multiple Hypothesis testing
• 5% false positive is 1 out of 20. Quite high!
• http://xkcd.com/882/
• Most published results are false still (Ioannidis 2005)
47
Complaints with Hypth Testing
48. • People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance
• You could detect a .000001 difference, so what?
• Multiple Hypothesis testing
• 5% false positive is 1 out of 20. Quite high!
• http://xkcd.com/882/
• Most published results are false still (Ioannidis 2005)
• What is it answering?
• Nothing specific about your test…. probabilities are
over repeated trials.
48
Complaints with Hypth Testing
49. Both children of a British mother died within a short period of
time. Mother was convicted of murder because p-value was low.
If she was innocent, the chance of both children dying is low
p-value = P( two deaths | innocent )
49
Abuse: Prosecutor Fallacy
50. Both children of a British mother died within a short period of
time. Mother was convicted of murder because p-value was low.
If she was innocent, the chance of both children dying is low
p-value = P( two deaths | innocent )
In fact, we should be looking at P( innocent | two deaths )
This is the prosecutor’s fallacy.
50
Abuse: Prosecutor Fallacy
52. Example: base line matters!
52
All Mothers
Guilty Mothers
Two deaths
Innocent Mothers
Two deaths
P-value can be small.
But base line can be huge.
53. Any Alternatives?
53
P( innocent | two deaths ) is what we want……
but does it make sense?
Bayesian methodology:
P( difference exists | data )
This requires knowing P(difference exists), i.e. the prior
- Philosophical debate, “What is a probability?”
- Easy to cheat the numbers
54. - How to deal with multiple hypothesis testing?
- What are we doing in the company?
- Rumor has it that “Multi-armed bandit > A/B testing”?
54
Questions?