SlideShare a Scribd company logo
1 of 54
Crash Course in
A/B testing
A statistical perspective
Wayne Tai Lee
Roadmap
• What is A/B testing?
•
•
•
•
•
•

Good experiments and the role of statistics
Similar to proof by contradiction
“Tests”
Big data meets classic asymptotics
Complaints with classical hypothesis testing
Alternatives?
What is A/B Testing
• An industry term for controlled and randomized experiment
between treatment/control groups.
• Age old problem….especially with humans
What most people know:
Gather samples

Apply treatments

Compare

Measure Outcome

Assign treatments

A
?

B
What most people know:
Only difference is in the treatment!

A
?

B
Reality:
Variability from
Samples/Inputs

Variability from
Treatment/function

Variability from
Measurement

A
??????

B
How do we account
for all that?
Confounding:
• If there are variabilities in addition to the treatment effect,
how can we identify/isolate the effect from the treatment?
3 Types of Variability:
• Controlled variability
• Systematic and desired
• i.e. our treatment
• Bias
• Systematic but not desired
• Anything that can confound our study
• Noise
• Random error but not desired
• Won’t confound the study but makes it hard to
make a decision.
How do we categorize each?
Variability from
Samples/Inputs

Variability from
Treatment/function

Variability from
Measurement

A
??????

B
Reality:
Good instrumentation!

A
??????

B
Reality:
Randomize assignment!
Convert bias to noise

A
??????

B
Reality:
Randomize assignment!
Convert bias to noise

A
??????

B
Your population can be skewed or biased….but
that only restricts the generalizability of the results
Reality:
Think about what you want to measure and how!
Minimize the noise level/variability in the metric.

A
?

B
A good experiment in general:
- Good design and implementation should be used to avoid bias.
- For unavoidable biases, use randomization to turn it into noise.
- Good planning to minimize noise in data.
How do we deal with noise?
- Bread and butter of statisticians!
- Quantify the magnitude of the treatment
- Quantify the magnitude of the noise
- Just compare…..most of the time
Formalizing the Comparison
Similar to proof by contradiction
- You assume the difference is by chance (noise)
Formalizing the Comparison
Similar to proof by contradiction
- You assume the difference is by chance (noise)
- See how the data contradicts the assumption
Formalizing the Comparison
Similar to proof by contradiction
- You assume the difference is by chance (noise)
- See how the data contradicts the assumption
- If the surprise surpasses a threshold, we reject
the assumption.
- ….nothing is “100%”
Difference due to chance?
Red -> treatment; Black -> control

ID
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6

PV
39
209
31
98
9
151
Difference due to chance?
Red -> treatment; Black -> control

ID
Person 1
Person 2
Person 3
Person 4
Person 5
Person 6

PV
39
209
31
98
9
151

|
|
|
|
|
|
|

Let’s measure the difference in means!

mean
72

mean
124.5

Diff = -52.5
….so what?
Difference due to chance?
Red -> treatment; Black -> control

ID

PV

ID

PV

Person 1
Person 2
Person 3
Person 4
Person 5
Person 6

39
209
31
98
9
151

1
2
3
4
5
6

39
209
31
98
9
151

If there was no difference from the treatment, shuffling the treatment status
can emulate the randomization of the samples.
Difference due to chance?
Red -> treatment; Black -> control

ID

PV

ID

PV

Person 1
Person 2
Person 3
Person 4
Person 5
Person 6

39
209
31
98
9
151

1
2
3
4
5
6

39
209
31
98
9
151

Diff = 122.25 – 24 = 98.25
Difference due to chance?
Red -> treatment; Black -> control

ID

PV

ID

PV

Person 1
Person 2
Person 3
Person 4
Person 5
Person 6

39
209
31
98
9
151

1
2
3
4
5
6

39
209
31
98
9
151

Diff = 107. 5 – 53.5 = 54
Difference due to chance?
50000 repeats later…..

Our original -52.5
Difference due to chance?

Our original -52.5

46.5% of the permutations yielded a larger if not the same
difference as our original sample (in magnitude).
Are you surprised by the initial results?
“Tests”
Congratulations!
- You just learned the permutation test!
- The 46.5% is the p-value under the permutation test.
“Tests”
Congratulations!
- You just learned the permutation test!
- The 46.5% is the p-value under the permutation test.

Problems:
- Permuting the labels can be computationally costly.
- Not possible before computers!
- Statistical theory says there are many tests out there.
Standard t-test:
1) Calculate delta:
= mean_treatment – mean_control
2) Assumes follows a Normal distribution then calculate
the p-value.
p-value = sum of red areas

-

0

3) If p-value < 0.05 then we reject the assumption that there is no
difference between treatment and control.

28

“Tests”
Big data meets classic Stats

29

Wait, our metrics may not be Normal!
Big Data meets Classic Stat
We care about the “mean of
the metric” and not the actual
metric distribution.
30

Wait, our metrics may not be Normal!
Big Data meets Classic Stat
We care about the “mean of
the metric” and not the actual
metric distribution.
31

Wait, our metrics may not be Normal!

Central Limit Theorem:
The “mean of the metric” will be
Normal if the sample size is LARGE!
Assumptions with t-test
- Normality of %delta
- Guaranteed with large sample sizes
- Independent Samples
- Not too many 0’s
That’s IT!!!
- Easy to automate.
- Simple and general.

32

Big Data meets Classic Stat
What are “Tests”?

33

• Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
• Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
Guarantees:
• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%

34

What are “Tests”?
• Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
Guarantees:
• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
• By setting the power of the test to be 80%, we control
P( Test says difference exists | In reality difference exists) >= 80%

35

What are “Tests”?
• Statistical tests are just procedures that depend on data
to make a decision.
• Engineerify: Statistical tests are functions that take
in data, treatments, and return a boolean.
Guarantees:
• By setting the p-value to compare to a 5% threshold, we control
P( Test says difference exists | In reality NO difference) <= 5%
• By setting the power of the test to be 80%, we control
P( Test says difference exists | In reality difference exists) >= 80%
• Increasing this often requires more data

36

What are “Tests”?
Meaning:
All treatments
No difference

Difference exist
37

Reality

Useless treatments

Impactful treatments
Meaning:
All treatments
No difference

Difference exist
38

Reality

Useless treatments

Test Decision

No difference

Difference Exists

Impactful treatments

No difference Difference Exists
Meaning:
All treatments
No difference

Difference exist
39

Reality

Useless treatments

Test Decision
Guarantees
through
conventional
thresholds

No difference

>95%

Difference Exists

<=5%

Impactful treatments

No difference Difference Exists

<20%

>=80%
Meaning:
All treatments
No difference

Difference exist
40

Reality

Useless treatments

Test Decision

Guarantees
through
conventional
thresholds
Jargon

No difference

>95%

Difference Exists

<=5%

Significance level

Impactful treatments

No difference Difference Exists

<20%

>=80%

Power
Meaning:

41

- Most appropriate over repeated decision making
- E.g. spammer or not
- Most appropriate over repeated decision making
- E.g. spammer or not
- Not seeing a difference could mean
- There is no difference
- Not enough power

42

Meaning:
- Most appropriate over repeated decision making
- E.g. spammer or not
- Not seeing a difference could mean
- There is no difference
- Not enough power
- Seeing a difference could mean
- There is a difference
- Got unlucky/lucky

43

Meaning:
- Most appropriate over repeated decision making
- E.g. spammer or not
- Not seeing a difference could mean
- There is no difference
- Not enough power
- Seeing a difference could mean
- There is a difference
- Got unlucky/lucky
- Your specific test is either impactful or not. (100% or 0%)
Not what most people want to hear….

44

Meaning:
Complaints with Hypth Testing

45

• People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
Complaints with Hypth Testing

46

• People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance
• You could detect a .000001 difference, so what?
• People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance
• You could detect a .000001 difference, so what?
• Multiple Hypothesis testing
• 5% false positive is 1 out of 20. Quite high!
• http://xkcd.com/882/
• Most published results are false still (Ioannidis 2005)

47

Complaints with Hypth Testing
• People get really stuck on p-values and tests.
• Confusing, boring, and formulaic.
• Statistical significance != Scientific significance
• You could detect a .000001 difference, so what?
• Multiple Hypothesis testing
• 5% false positive is 1 out of 20. Quite high!
• http://xkcd.com/882/
• Most published results are false still (Ioannidis 2005)
• What is it answering?
• Nothing specific about your test…. probabilities are
over repeated trials.

48

Complaints with Hypth Testing
Both children of a British mother died within a short period of
time. Mother was convicted of murder because p-value was low.
If she was innocent, the chance of both children dying is low

p-value = P( two deaths | innocent )

49

Abuse: Prosecutor Fallacy
Both children of a British mother died within a short period of
time. Mother was convicted of murder because p-value was low.
If she was innocent, the chance of both children dying is low

p-value = P( two deaths | innocent )
In fact, we should be looking at P( innocent | two deaths )

This is the prosecutor’s fallacy.

50

Abuse: Prosecutor Fallacy
Example:

51

All Mothers

Guilty Mothers

Two deaths

Innocent Mothers

Two deaths
Example: base line matters!

52

All Mothers

Guilty Mothers

Two deaths

Innocent Mothers

Two deaths

P-value can be small.
But base line can be huge.
Any Alternatives?

53

P( innocent | two deaths ) is what we want……
but does it make sense?
Bayesian methodology:
P( difference exists | data )
This requires knowing P(difference exists), i.e. the prior
- Philosophical debate, “What is a probability?”
- Easy to cheat the numbers
- How to deal with multiple hypothesis testing?
- What are we doing in the company?
- Rumor has it that “Multi-armed bandit > A/B testing”?

54

Questions?

More Related Content

What's hot

Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingCarlo Magno
 
Hypothesis testing - T Test
Hypothesis testing - T TestHypothesis testing - T Test
Hypothesis testing - T TestAr. Avitesh
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1shoffma5
 
Statistical tests
Statistical testsStatistical tests
Statistical testsmartyynyyte
 
Lect w6 hypothesis_testing
Lect w6 hypothesis_testingLect w6 hypothesis_testing
Lect w6 hypothesis_testingRione Drevale
 
parametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLABparametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLABKajal Saraswat
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
Hypothesis Tests in R Programming
Hypothesis Tests in R ProgrammingHypothesis Tests in R Programming
Hypothesis Tests in R ProgrammingAtacan Garip
 
Presentation week 8
Presentation week 8Presentation week 8
Presentation week 8krookroo
 
Testing Of Hypothesis
Testing Of HypothesisTesting Of Hypothesis
Testing Of HypothesisSWATI SINGH
 
Session 9 intro_of_topics_in_hypothesis_testing
Session 9 intro_of_topics_in_hypothesis_testingSession 9 intro_of_topics_in_hypothesis_testing
Session 9 intro_of_topics_in_hypothesis_testingGlory Codilla
 
Basics of statistics
Basics of statisticsBasics of statistics
Basics of statisticsGaurav Kr
 
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...Pat Barlow
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testingo_devinyak
 

What's hot (20)

RESEARCH METHODS LESSON 3
RESEARCH METHODS LESSON 3RESEARCH METHODS LESSON 3
RESEARCH METHODS LESSON 3
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing - T Test
Hypothesis testing - T TestHypothesis testing - T Test
Hypothesis testing - T Test
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
 
Statistical tests
Statistical testsStatistical tests
Statistical tests
 
Lect w6 hypothesis_testing
Lect w6 hypothesis_testingLect w6 hypothesis_testing
Lect w6 hypothesis_testing
 
parametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLABparametric hypothesis testing using MATLAB
parametric hypothesis testing using MATLAB
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Hypothesis Tests in R Programming
Hypothesis Tests in R ProgrammingHypothesis Tests in R Programming
Hypothesis Tests in R Programming
 
Presentation week 8
Presentation week 8Presentation week 8
Presentation week 8
 
Testing Of Hypothesis
Testing Of HypothesisTesting Of Hypothesis
Testing Of Hypothesis
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Z test, f-test,etc
Z test, f-test,etcZ test, f-test,etc
Z test, f-test,etc
 
Session 9 intro_of_topics_in_hypothesis_testing
Session 9 intro_of_topics_in_hypothesis_testingSession 9 intro_of_topics_in_hypothesis_testing
Session 9 intro_of_topics_in_hypothesis_testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Basics of statistics
Basics of statisticsBasics of statistics
Basics of statistics
 
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
What's Significant? Hypothesis Testing, Effect Size, Confidence Intervals, & ...
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
 
ch04.ppt
ch04.pptch04.ppt
ch04.ppt
 

Viewers also liked

Feature selection can hurt model inference
Feature selection can hurt model inferenceFeature selection can hurt model inference
Feature selection can hurt model inferenceWayne Lee
 
The Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingThe Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingWayne Lee
 
Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Wayne Lee
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?Wayne Lee
 
Genscape Photos
Genscape Photos Genscape Photos
Genscape Photos mhislop
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 

Viewers also liked (7)

Feature selection can hurt model inference
Feature selection can hurt model inferenceFeature selection can hurt model inference
Feature selection can hurt model inference
 
The Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingThe Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data Snooping
 
Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?
 
Genscape Photos
Genscape Photos Genscape Photos
Genscape Photos
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 

Similar to Crash Course in A/B testing

1. complete stats notes
1. complete stats notes1. complete stats notes
1. complete stats notesBob Smullen
 
P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...David Pratap
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiranKiran Ramakrishna
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisUC Davis
 
5. testing differences
5. testing differences5. testing differences
5. testing differencesSteve Saffhill
 
Chapter 28 clincal trials
Chapter 28 clincal trials Chapter 28 clincal trials
Chapter 28 clincal trials Nilesh Kucha
 
BASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdf
BASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdfBASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdf
BASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdfAdamu Mohammad
 
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptxSAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptxssuserd509321
 
Application of statistical tests in Biomedical Research .pptx
Application of statistical tests in Biomedical Research .pptxApplication of statistical tests in Biomedical Research .pptx
Application of statistical tests in Biomedical Research .pptxHalim AS
 
Causality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELCausality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELHanan Shteingart
 
Testing of Hypothesis combined with tests.pdf
Testing of Hypothesis combined with tests.pdfTesting of Hypothesis combined with tests.pdf
Testing of Hypothesis combined with tests.pdfRamBk5
 
Hypothesis and its important parametric tests
Hypothesis and its important parametric testsHypothesis and its important parametric tests
Hypothesis and its important parametric testsMansiGajare1
 
Chapter 18 Hypothesis testing (1).pptx
Chapter 18 Hypothesis testing (1).pptxChapter 18 Hypothesis testing (1).pptx
Chapter 18 Hypothesis testing (1).pptxNELVINNOOL1
 
7- Quantitative Research- Part 3.pdf
7- Quantitative Research- Part 3.pdf7- Quantitative Research- Part 3.pdf
7- Quantitative Research- Part 3.pdfezaldeen2013
 

Similar to Crash Course in A/B testing (20)

1. complete stats notes
1. complete stats notes1. complete stats notes
1. complete stats notes
 
P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...P-values the gold measure of statistical validity are not as reliable as many...
P-values the gold measure of statistical validity are not as reliable as many...
 
Basics of Statistics.pptx
Basics of Statistics.pptxBasics of Statistics.pptx
Basics of Statistics.pptx
 
T test
T test T test
T test
 
Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
 
Hypo
HypoHypo
Hypo
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysis
 
5. testing differences
5. testing differences5. testing differences
5. testing differences
 
Chapter 28 clincal trials
Chapter 28 clincal trials Chapter 28 clincal trials
Chapter 28 clincal trials
 
BASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdf
BASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdfBASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdf
BASIC STATISTICS AND THEIR INTERPRETATION AND USE IN EPIDEMIOLOGY 050822.pdf
 
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptxSAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptx
 
Application of statistical tests in Biomedical Research .pptx
Application of statistical tests in Biomedical Research .pptxApplication of statistical tests in Biomedical Research .pptx
Application of statistical tests in Biomedical Research .pptx
 
Analysis 101
Analysis 101Analysis 101
Analysis 101
 
Causality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELCausality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAEL
 
Lund 2009
Lund 2009Lund 2009
Lund 2009
 
Testing of Hypothesis combined with tests.pdf
Testing of Hypothesis combined with tests.pdfTesting of Hypothesis combined with tests.pdf
Testing of Hypothesis combined with tests.pdf
 
Hypothesis and its important parametric tests
Hypothesis and its important parametric testsHypothesis and its important parametric tests
Hypothesis and its important parametric tests
 
Chapter 18 Hypothesis testing (1).pptx
Chapter 18 Hypothesis testing (1).pptxChapter 18 Hypothesis testing (1).pptx
Chapter 18 Hypothesis testing (1).pptx
 
Stats1for podcast
Stats1for podcastStats1for podcast
Stats1for podcast
 
7- Quantitative Research- Part 3.pdf
7- Quantitative Research- Part 3.pdf7- Quantitative Research- Part 3.pdf
7- Quantitative Research- Part 3.pdf
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 

Crash Course in A/B testing

  • 1. Crash Course in A/B testing A statistical perspective Wayne Tai Lee
  • 2. Roadmap • What is A/B testing? • • • • • • Good experiments and the role of statistics Similar to proof by contradiction “Tests” Big data meets classic asymptotics Complaints with classical hypothesis testing Alternatives?
  • 3. What is A/B Testing • An industry term for controlled and randomized experiment between treatment/control groups. • Age old problem….especially with humans
  • 4. What most people know: Gather samples Apply treatments Compare Measure Outcome Assign treatments A ? B
  • 5. What most people know: Only difference is in the treatment! A ? B
  • 6. Reality: Variability from Samples/Inputs Variability from Treatment/function Variability from Measurement A ?????? B How do we account for all that?
  • 7. Confounding: • If there are variabilities in addition to the treatment effect, how can we identify/isolate the effect from the treatment?
  • 8. 3 Types of Variability: • Controlled variability • Systematic and desired • i.e. our treatment • Bias • Systematic but not desired • Anything that can confound our study • Noise • Random error but not desired • Won’t confound the study but makes it hard to make a decision.
  • 9. How do we categorize each? Variability from Samples/Inputs Variability from Treatment/function Variability from Measurement A ?????? B
  • 12. Reality: Randomize assignment! Convert bias to noise A ?????? B Your population can be skewed or biased….but that only restricts the generalizability of the results
  • 13. Reality: Think about what you want to measure and how! Minimize the noise level/variability in the metric. A ? B
  • 14. A good experiment in general: - Good design and implementation should be used to avoid bias. - For unavoidable biases, use randomization to turn it into noise. - Good planning to minimize noise in data.
  • 15. How do we deal with noise? - Bread and butter of statisticians! - Quantify the magnitude of the treatment - Quantify the magnitude of the noise - Just compare…..most of the time
  • 16. Formalizing the Comparison Similar to proof by contradiction - You assume the difference is by chance (noise)
  • 17. Formalizing the Comparison Similar to proof by contradiction - You assume the difference is by chance (noise) - See how the data contradicts the assumption
  • 18. Formalizing the Comparison Similar to proof by contradiction - You assume the difference is by chance (noise) - See how the data contradicts the assumption - If the surprise surpasses a threshold, we reject the assumption. - ….nothing is “100%”
  • 19. Difference due to chance? Red -> treatment; Black -> control ID Person 1 Person 2 Person 3 Person 4 Person 5 Person 6 PV 39 209 31 98 9 151
  • 20. Difference due to chance? Red -> treatment; Black -> control ID Person 1 Person 2 Person 3 Person 4 Person 5 Person 6 PV 39 209 31 98 9 151 | | | | | | | Let’s measure the difference in means! mean 72 mean 124.5 Diff = -52.5 ….so what?
  • 21. Difference due to chance? Red -> treatment; Black -> control ID PV ID PV Person 1 Person 2 Person 3 Person 4 Person 5 Person 6 39 209 31 98 9 151 1 2 3 4 5 6 39 209 31 98 9 151 If there was no difference from the treatment, shuffling the treatment status can emulate the randomization of the samples.
  • 22. Difference due to chance? Red -> treatment; Black -> control ID PV ID PV Person 1 Person 2 Person 3 Person 4 Person 5 Person 6 39 209 31 98 9 151 1 2 3 4 5 6 39 209 31 98 9 151 Diff = 122.25 – 24 = 98.25
  • 23. Difference due to chance? Red -> treatment; Black -> control ID PV ID PV Person 1 Person 2 Person 3 Person 4 Person 5 Person 6 39 209 31 98 9 151 1 2 3 4 5 6 39 209 31 98 9 151 Diff = 107. 5 – 53.5 = 54
  • 24. Difference due to chance? 50000 repeats later….. Our original -52.5
  • 25. Difference due to chance? Our original -52.5 46.5% of the permutations yielded a larger if not the same difference as our original sample (in magnitude). Are you surprised by the initial results?
  • 26. “Tests” Congratulations! - You just learned the permutation test! - The 46.5% is the p-value under the permutation test.
  • 27. “Tests” Congratulations! - You just learned the permutation test! - The 46.5% is the p-value under the permutation test. Problems: - Permuting the labels can be computationally costly. - Not possible before computers! - Statistical theory says there are many tests out there.
  • 28. Standard t-test: 1) Calculate delta: = mean_treatment – mean_control 2) Assumes follows a Normal distribution then calculate the p-value. p-value = sum of red areas - 0 3) If p-value < 0.05 then we reject the assumption that there is no difference between treatment and control. 28 “Tests”
  • 29. Big data meets classic Stats 29 Wait, our metrics may not be Normal!
  • 30. Big Data meets Classic Stat We care about the “mean of the metric” and not the actual metric distribution. 30 Wait, our metrics may not be Normal!
  • 31. Big Data meets Classic Stat We care about the “mean of the metric” and not the actual metric distribution. 31 Wait, our metrics may not be Normal! Central Limit Theorem: The “mean of the metric” will be Normal if the sample size is LARGE!
  • 32. Assumptions with t-test - Normality of %delta - Guaranteed with large sample sizes - Independent Samples - Not too many 0’s That’s IT!!! - Easy to automate. - Simple and general. 32 Big Data meets Classic Stat
  • 33. What are “Tests”? 33 • Statistical tests are just procedures that depend on data to make a decision. • Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean.
  • 34. • Statistical tests are just procedures that depend on data to make a decision. • Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean. Guarantees: • By setting the p-value to compare to a 5% threshold, we control P( Test says difference exists | In reality NO difference) <= 5% 34 What are “Tests”?
  • 35. • Statistical tests are just procedures that depend on data to make a decision. • Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean. Guarantees: • By setting the p-value to compare to a 5% threshold, we control P( Test says difference exists | In reality NO difference) <= 5% • By setting the power of the test to be 80%, we control P( Test says difference exists | In reality difference exists) >= 80% 35 What are “Tests”?
  • 36. • Statistical tests are just procedures that depend on data to make a decision. • Engineerify: Statistical tests are functions that take in data, treatments, and return a boolean. Guarantees: • By setting the p-value to compare to a 5% threshold, we control P( Test says difference exists | In reality NO difference) <= 5% • By setting the power of the test to be 80%, we control P( Test says difference exists | In reality difference exists) >= 80% • Increasing this often requires more data 36 What are “Tests”?
  • 37. Meaning: All treatments No difference Difference exist 37 Reality Useless treatments Impactful treatments
  • 38. Meaning: All treatments No difference Difference exist 38 Reality Useless treatments Test Decision No difference Difference Exists Impactful treatments No difference Difference Exists
  • 39. Meaning: All treatments No difference Difference exist 39 Reality Useless treatments Test Decision Guarantees through conventional thresholds No difference >95% Difference Exists <=5% Impactful treatments No difference Difference Exists <20% >=80%
  • 40. Meaning: All treatments No difference Difference exist 40 Reality Useless treatments Test Decision Guarantees through conventional thresholds Jargon No difference >95% Difference Exists <=5% Significance level Impactful treatments No difference Difference Exists <20% >=80% Power
  • 41. Meaning: 41 - Most appropriate over repeated decision making - E.g. spammer or not
  • 42. - Most appropriate over repeated decision making - E.g. spammer or not - Not seeing a difference could mean - There is no difference - Not enough power 42 Meaning:
  • 43. - Most appropriate over repeated decision making - E.g. spammer or not - Not seeing a difference could mean - There is no difference - Not enough power - Seeing a difference could mean - There is a difference - Got unlucky/lucky 43 Meaning:
  • 44. - Most appropriate over repeated decision making - E.g. spammer or not - Not seeing a difference could mean - There is no difference - Not enough power - Seeing a difference could mean - There is a difference - Got unlucky/lucky - Your specific test is either impactful or not. (100% or 0%) Not what most people want to hear…. 44 Meaning:
  • 45. Complaints with Hypth Testing 45 • People get really stuck on p-values and tests. • Confusing, boring, and formulaic.
  • 46. Complaints with Hypth Testing 46 • People get really stuck on p-values and tests. • Confusing, boring, and formulaic. • Statistical significance != Scientific significance • You could detect a .000001 difference, so what?
  • 47. • People get really stuck on p-values and tests. • Confusing, boring, and formulaic. • Statistical significance != Scientific significance • You could detect a .000001 difference, so what? • Multiple Hypothesis testing • 5% false positive is 1 out of 20. Quite high! • http://xkcd.com/882/ • Most published results are false still (Ioannidis 2005) 47 Complaints with Hypth Testing
  • 48. • People get really stuck on p-values and tests. • Confusing, boring, and formulaic. • Statistical significance != Scientific significance • You could detect a .000001 difference, so what? • Multiple Hypothesis testing • 5% false positive is 1 out of 20. Quite high! • http://xkcd.com/882/ • Most published results are false still (Ioannidis 2005) • What is it answering? • Nothing specific about your test…. probabilities are over repeated trials. 48 Complaints with Hypth Testing
  • 49. Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low. If she was innocent, the chance of both children dying is low p-value = P( two deaths | innocent ) 49 Abuse: Prosecutor Fallacy
  • 50. Both children of a British mother died within a short period of time. Mother was convicted of murder because p-value was low. If she was innocent, the chance of both children dying is low p-value = P( two deaths | innocent ) In fact, we should be looking at P( innocent | two deaths ) This is the prosecutor’s fallacy. 50 Abuse: Prosecutor Fallacy
  • 51. Example: 51 All Mothers Guilty Mothers Two deaths Innocent Mothers Two deaths
  • 52. Example: base line matters! 52 All Mothers Guilty Mothers Two deaths Innocent Mothers Two deaths P-value can be small. But base line can be huge.
  • 53. Any Alternatives? 53 P( innocent | two deaths ) is what we want…… but does it make sense? Bayesian methodology: P( difference exists | data ) This requires knowing P(difference exists), i.e. the prior - Philosophical debate, “What is a probability?” - Easy to cheat the numbers
  • 54. - How to deal with multiple hypothesis testing? - What are we doing in the company? - Rumor has it that “Multi-armed bandit > A/B testing”? 54 Questions?

Editor's Notes

  1. Wine testing pairing vs just two groups