SlideShare a Scribd company logo
A/B Testing: Avoiding
Common Pitfalls
Danielle Jabin

March 5, 2014
1

Make all the world’s music
available instantly to everyone,
wherever and whenever they
want it
2

As of March 5, 2014
3

Over 24 million active users

As of March 5, 2014
4

Access to more than 20 million
songs

As of March 5, 2014
5
6

But can we make it even
easier?
7

We can try…
…with A/B testing!
8

So…what’s an A/B test?
9

Control

A
Pitfall #1: Not limiting
your error rate
11

Source: assets.20bits.com/20081027/normal-curve-small.png
12

What if I flip a coin 100 times and get 51 heads?
13

What if I flip a coin 100 times and get 5 heads?
14
15

The likelihood of obtaining a
certain value under a given
distribution is measured by its
p-value
16

If there is a low likelihood that a
change is due to chance alone,
we call our results statistically
significant
17

What if I flip a coin 100 times and get 5 heads?
18

Statistical significance is measured by alpha
● alpha levels of 5% and 1% are most commonly used
– Alternatively: P(significant) = .05 or .01
19

Each alpha has a corresponding Z-score
alpha

Z-score (two-sided test)

.10

1.65

.05

1.96

.01

2.58
20

The Z-score tells us how far a
particular value is from the
mean (and what the corresponding
likelihood is)
21

Source: assets.20bits.com/20081027/normal-curve-small.png
22

Compute the Z-score at the end of the test
23

Standard deviation (σ) tells us
how spread out the numbers
are
24
25

To lock in error rates before you
start, fix your sample size
26

What should my sample size be?
● To lock in error rates before you start a test, fix your sample size
Represents the
desired power
(typically .84 for 80%
power).

Sample size in each
group (assumes equal
sized groups)

2s (Z b + Za /2 )
n=
2
difference Represents the desired
2

Standard deviation of
the outcome variable

Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt

2

Effect Size (the
difference in
means)

level of statistical
significance (typically
1.96).
27

Recap: running an A/B test
● Compute your sample size
– Using alpha, beta, standard deviation of your metric, and effect size
● Run your test! But stop once you’ve reached the fixed sample size stopping point
● Compute your z-score and compare it with the z-score for the chosen alpha level
28

Control

A
29

Resulting Z-score?
30

33.3
Pitfall #2: Stopping your
test before the fixed
sample size stopping
point
32

Sample size for varying alpha levels
● With σ = 10, difference in means = 1

Two-sided test
alpha = .10, beta = .80

1230

alpha = .05, beta = .80

1568

alpha = .01, beta = .80

2339
33

Let’s see some numbers
● 1,000 experiments with 200,000 fake participants divided randomly into two groups both
receiving the exact same version, A, with a 3% conversion rate

90% significance
reached
95% significance
reached
99% significance
reached
Source: destack.home.xs4all.nl/projects/significance/

Stop at first point of
significance
654 of 1,000

Ended as significant

427 of 1,000

49 of 1,000

146 of 1,000

14 of 1,000

100 of 1,000
34

Remedies
● Don’t peek
● Okay, maybe you can peek, but don’t stop or make a decision before you reach the fixed
sample size stopping point
● Sequential sampling
35

Control

A

B
Pitfall #3: Making
multiple comparisons in
one test
37

A test can be one of two things: significant or not significant
● P(significant) + P(not significant) = 1
● Let’s take an alpha of .05
– P(significant) = .05
– P(not significant) = 1 – P(significant) = 1 - .05 = .95
38

What about for two comparisons?
● P(at least 1 significant) = 1 - P(none of the 2 are significant)
● P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95 = .9025
● P(at least 1 significant) = 1 - .9025 = .0975
39

What about for two comparisons?

●That’s almost 2x (1.95x, to be precise) your .05
significance rate!
40

And it just gets worse…
P(at least 1 signifcant)

An increase of…

5 variations

1 – (1-.05)^5 = .23

4.6x

10 variations

1 – (1-.05)^10 = .40

8x

20 variations

1 – (1-.05)^20 = .64

12.8x
41

How can we remedy this?
●Bonferroni correction
– Divide P(significant), your alpha, by the number of variations you are testing, n
– alpha/n becomes the new level of statistical significance
42

So what about two comparisons now?
● Our new P(significant) = .05/2 = .025
● Our new P(not significant) = 1 - .025 = .975
● P(at least 1 significant) = 1 - P(none of the 2 are significant)
● P(none of the 2 are significant) = P(not significant)*P(not significant) = .975*.975 = .951
● P(at least 1 significant) = 1 - .951 = .0499
43

P(significant) stays under .05 
Corrected alpha

P(at least 1 signifcant)

5 variations

.05/5 = .01

1 – (1-.01)^5 = .049

10 variations

.05/10 = .005

1 – (1-.005)^10 = .049

20 variations

.05/20 = .0025

1 – (1-.0025)^20 = .049
Questions?
Appendix
46

A/B test steps:
1. Decide what to test
2. Determine a metric to test
3. Formulate your hypothesis
1. Select an effect size threshold: what change of the metric would make a rollout
worthwhile?
4. Calculate sample size (your stopping point)
1. Decide your Type I (alpha) and Type 2 (beta) error levels and the corresponding zscores
2. Determine the standard deviation of your metric
5. Run your test! But stop once you’ve reached the fixed sample size stopping point
6. Compute your z-score and compare it with the z-score for your chosen alpha level
47

Type I and Type II error
● Type I error: incorrectly reject a true null hypothesis
– alpha
● Type II error: incorrectly accept a false null hypothesis
– beta
– Power: 1 - beta
48

Z-score reference table
alpha

One-sided test

Two-sided test

.10

1.28

1.65

.05

1.65

1.96

.01

2.33

2.58
49

Z-score for proportions (e.g. conversion)

More Related Content

What's hot

What's hot (20)

Customer Centricity and Product Led Growth by Airbnb Product & Growth
Customer Centricity and Product Led Growth by Airbnb Product & Growth Customer Centricity and Product Led Growth by Airbnb Product & Growth
Customer Centricity and Product Led Growth by Airbnb Product & Growth
 
A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation A/B Testing at Pinterest: Building a Culture of Experimentation
A/B Testing at Pinterest: Building a Culture of Experimentation
 
A/B Testing Framework Design
A/B Testing Framework DesignA/B Testing Framework Design
A/B Testing Framework Design
 
Developing a Product Vision by Amazon Sr Product Manager
Developing a Product Vision by Amazon Sr Product ManagerDeveloping a Product Vision by Amazon Sr Product Manager
Developing a Product Vision by Amazon Sr Product Manager
 
10 Guidelines for A/B Testing
10 Guidelines for A/B Testing10 Guidelines for A/B Testing
10 Guidelines for A/B Testing
 
How spotify makes product
How spotify makes productHow spotify makes product
How spotify makes product
 
Test & Learn: How to Find Your Product's North Star Metric
Test & Learn: How to Find Your Product's North Star Metric Test & Learn: How to Find Your Product's North Star Metric
Test & Learn: How to Find Your Product's North Star Metric
 
Measuring What Matters in Your Product by Amazon Product Leader.pdf
Measuring What Matters in Your Product by Amazon Product Leader.pdfMeasuring What Matters in Your Product by Amazon Product Leader.pdf
Measuring What Matters in Your Product by Amazon Product Leader.pdf
 
Basics of AB testing in online products
Basics of AB testing in online productsBasics of AB testing in online products
Basics of AB testing in online products
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
 
Startup Metrics for Pirates
Startup Metrics for PiratesStartup Metrics for Pirates
Startup Metrics for Pirates
 
The MVP checklist
The MVP checklistThe MVP checklist
The MVP checklist
 
SAMPLE SIZE – The indispensable A/B test calculation that you’re not making
SAMPLE SIZE – The indispensable A/B test calculation that you’re not makingSAMPLE SIZE – The indispensable A/B test calculation that you’re not making
SAMPLE SIZE – The indispensable A/B test calculation that you’re not making
 
Data-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingData-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B Testing
 
4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing4 Steps Toward Scientific A/B Testing
4 Steps Toward Scientific A/B Testing
 
From Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover WeeklyFrom Idea to Execution: Spotify's Discover Weekly
From Idea to Execution: Spotify's Discover Weekly
 
How to Product Manage a Marketplace Business by Uber PM
How to Product Manage a Marketplace Business by Uber PMHow to Product Manage a Marketplace Business by Uber PM
How to Product Manage a Marketplace Business by Uber PM
 
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PMControlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
Controlled Experimentation aka A/B Testing for PMs by Tinder Sr PM
 
Effect Mapping: A Better Way to Get Really Usable Results Out of IT Projects
Effect Mapping: A Better Way to Get Really Usable Results Out of IT ProjectsEffect Mapping: A Better Way to Get Really Usable Results Out of IT Projects
Effect Mapping: A Better Way to Get Really Usable Results Out of IT Projects
 
Product-Led Growth by Amazon Senior Product Manager
Product-Led Growth by Amazon Senior Product ManagerProduct-Led Growth by Amazon Senior Product Manager
Product-Led Growth by Amazon Senior Product Manager
 

Viewers also liked

The math-behind-ab-testing
The math-behind-ab-testingThe math-behind-ab-testing
The math-behind-ab-testing
Amit Sawhney
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
Adam Kawa
 
A/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong DirectionA/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong Direction
Kissmetrics on SlideShare
 

Viewers also liked (20)

Making Better Mistakes Tomorrow
Making Better Mistakes TomorrowMaking Better Mistakes Tomorrow
Making Better Mistakes Tomorrow
 
Measuring team performance at spotify slideshare
Measuring team performance at spotify slideshareMeasuring team performance at spotify slideshare
Measuring team performance at spotify slideshare
 
Data at Spotify
Data at SpotifyData at Spotify
Data at Spotify
 
Africa DevOps Day 2015
Africa DevOps Day 2015Africa DevOps Day 2015
Africa DevOps Day 2015
 
The math-behind-ab-testing
The math-behind-ab-testingThe math-behind-ab-testing
The math-behind-ab-testing
 
Managing Experiment at Spotify
Managing Experiment at SpotifyManaging Experiment at Spotify
Managing Experiment at Spotify
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
Algorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at SpotifyAlgorithmic Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
 
Explore and Evaluate - A different intro to research and data
Explore and Evaluate - A different intro to research and dataExplore and Evaluate - A different intro to research and data
Explore and Evaluate - A different intro to research and data
 
Being a Data Driven Business
Being a Data Driven Business Being a Data Driven Business
Being a Data Driven Business
 
Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved Growing up with agile - how the Spotify 'model' has evolved
Growing up with agile - how the Spotify 'model' has evolved
 
An Intro to Learning Organization
An Intro to Learning OrganizationAn Intro to Learning Organization
An Intro to Learning Organization
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotify
 
A/B Testing for Lean Startups
A/B Testing for Lean StartupsA/B Testing for Lean Startups
A/B Testing for Lean Startups
 
Fluent at agile - agile sverige 2014
Fluent at agile - agile sverige 2014Fluent at agile - agile sverige 2014
Fluent at agile - agile sverige 2014
 
A/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong DirectionA/B Testing: You Might be Driving in the Wrong Direction
A/B Testing: You Might be Driving in the Wrong Direction
 
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)온라인 서비스 개선을 데이터 활용법  - 김진영 (How We Use Data)
온라인 서비스 개선을 데이터 활용법 - 김진영 (How We Use Data)
 
Product Owner presentation for Spotify
Product Owner presentation for SpotifyProduct Owner presentation for Spotify
Product Owner presentation for Spotify
 
eMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype CycleeMetrics London - The AB Testing Hype Cycle
eMetrics London - The AB Testing Hype Cycle
 
Testing a 2D Platformer with Spock
Testing a 2D Platformer with SpockTesting a 2D Platformer with Spock
Testing a 2D Platformer with Spock
 

Similar to A/B Testing Pitfalls and Lessons Learned at Spotify

Uji liliefors
Uji lilieforsUji liliefors
Uji liliefors
jojun
 
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docxWEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
cockekeshia
 
Tips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative AptitudeTips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative Aptitude
Amber Bhaumik
 
CrashCourse_0622
CrashCourse_0622CrashCourse_0622
CrashCourse_0622
Dexen Xi
 

Similar to A/B Testing Pitfalls and Lessons Learned at Spotify (20)

Quantitative Aptitude Test (QAT)- Tips & Tricks
Quantitative Aptitude Test (QAT)- Tips & TricksQuantitative Aptitude Test (QAT)- Tips & Tricks
Quantitative Aptitude Test (QAT)- Tips & Tricks
 
Quantitative Aptitude Test (QAT)-Tips & Tricks
Quantitative Aptitude Test (QAT)-Tips & TricksQuantitative Aptitude Test (QAT)-Tips & Tricks
Quantitative Aptitude Test (QAT)-Tips & Tricks
 
Uji liliefors
Uji lilieforsUji liliefors
Uji liliefors
 
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docxWEEK 6 – HOMEWORK 6  LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
WEEK 6 – HOMEWORK 6 LANE CHAPTERS, 11, 12, AND 13; ILLOWSKY CHAP.docx
 
Quant tips edited
Quant tips editedQuant tips edited
Quant tips edited
 
7. the t distribution
7. the t distribution7. the t distribution
7. the t distribution
 
Design of Experiments
Design of Experiments Design of Experiments
Design of Experiments
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
MATHS
MATHSMATHS
MATHS
 
Tips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative AptitudeTips & tricks for Quantitative Aptitude
Tips & tricks for Quantitative Aptitude
 
CrashCourse_0622
CrashCourse_0622CrashCourse_0622
CrashCourse_0622
 
Statistics for CRO - Conversion Conference London
Statistics for CRO - Conversion Conference LondonStatistics for CRO - Conversion Conference London
Statistics for CRO - Conversion Conference London
 
01_SLR_final (1).pptx
01_SLR_final (1).pptx01_SLR_final (1).pptx
01_SLR_final (1).pptx
 
Conversion Conference Berlin
Conversion Conference BerlinConversion Conference Berlin
Conversion Conference Berlin
 
Lecture_Wk08.pdf
Lecture_Wk08.pdfLecture_Wk08.pdf
Lecture_Wk08.pdf
 
Memorization of Various Calculator shortcuts
Memorization of Various Calculator shortcutsMemorization of Various Calculator shortcuts
Memorization of Various Calculator shortcuts
 
Machine learning mathematicals.pdf
Machine learning mathematicals.pdfMachine learning mathematicals.pdf
Machine learning mathematicals.pdf
 
Lecture 4
Lecture 4Lecture 4
Lecture 4
 
Week8 Live Lecture for Final Exam
Week8 Live Lecture for Final ExamWeek8 Live Lecture for Final Exam
Week8 Live Lecture for Final Exam
 
MAT1033.3.1.ppt
MAT1033.3.1.pptMAT1033.3.1.ppt
MAT1033.3.1.ppt
 

Recently uploaded

anas about venice for grade 6f about venice
anas about venice for grade 6f about veniceanas about venice for grade 6f about venice
anas about venice for grade 6f about venice
anasabutalha2013
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
seri bangash
 
Enterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdfEnterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdf
KaiNexus
 
FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134
LR1709MUSIC
 
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
creerey
 
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
Khaled Al Awadi
 

Recently uploaded (20)

anas about venice for grade 6f about venice
anas about venice for grade 6f about veniceanas about venice for grade 6f about venice
anas about venice for grade 6f about venice
 
New Product Development.kjiy7ggbfdsddggo9lo
New Product Development.kjiy7ggbfdsddggo9loNew Product Development.kjiy7ggbfdsddggo9lo
New Product Development.kjiy7ggbfdsddggo9lo
 
Lookback Analysis
Lookback AnalysisLookback Analysis
Lookback Analysis
 
5 Things You Need To Know Before Hiring a Videographer
5 Things You Need To Know Before Hiring a Videographer5 Things You Need To Know Before Hiring a Videographer
5 Things You Need To Know Before Hiring a Videographer
 
Matt Conway - Attorney - A Knowledgeable Professional - Kentucky.pdf
Matt Conway - Attorney - A Knowledgeable Professional - Kentucky.pdfMatt Conway - Attorney - A Knowledgeable Professional - Kentucky.pdf
Matt Conway - Attorney - A Knowledgeable Professional - Kentucky.pdf
 
falcon-invoice-discounting-a-premier-platform-for-investors-in-india
falcon-invoice-discounting-a-premier-platform-for-investors-in-indiafalcon-invoice-discounting-a-premier-platform-for-investors-in-india
falcon-invoice-discounting-a-premier-platform-for-investors-in-india
 
Memorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.pptMemorandum Of Association Constitution of Company.ppt
Memorandum Of Association Constitution of Company.ppt
 
Enterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdfEnterprise Excellence is Inclusive Excellence.pdf
Enterprise Excellence is Inclusive Excellence.pdf
 
sales plan presentation by mckinsey alum
sales plan presentation by mckinsey alumsales plan presentation by mckinsey alum
sales plan presentation by mckinsey alum
 
FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134FINAL PRESENTATION.pptx12143241324134134
FINAL PRESENTATION.pptx12143241324134134
 
Taurus Zodiac Sign_ Personality Traits and Sign Dates.pptx
Taurus Zodiac Sign_ Personality Traits and Sign Dates.pptxTaurus Zodiac Sign_ Personality Traits and Sign Dates.pptx
Taurus Zodiac Sign_ Personality Traits and Sign Dates.pptx
 
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
Accpac to QuickBooks Conversion Navigating the Transition with Online Account...
 
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBdCree_Rey_BrandIdentityKit.PDF_PersonalBd
Cree_Rey_BrandIdentityKit.PDF_PersonalBd
 
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
RMD24 | Retail media: hoe zet je dit in als je geen AH of Unilever bent? Heid...
 
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...NewBase   24 May  2024  Energy News issue - 1727 by Khaled Al Awadi_compresse...
NewBase 24 May 2024 Energy News issue - 1727 by Khaled Al Awadi_compresse...
 
12 Conversion Rate Optimization Strategies for Ecommerce Websites.pdf
12 Conversion Rate Optimization Strategies for Ecommerce Websites.pdf12 Conversion Rate Optimization Strategies for Ecommerce Websites.pdf
12 Conversion Rate Optimization Strategies for Ecommerce Websites.pdf
 
What are the main advantages of using HR recruiter services.pdf
What are the main advantages of using HR recruiter services.pdfWhat are the main advantages of using HR recruiter services.pdf
What are the main advantages of using HR recruiter services.pdf
 
Business Valuation Principles for Entrepreneurs
Business Valuation Principles for EntrepreneursBusiness Valuation Principles for Entrepreneurs
Business Valuation Principles for Entrepreneurs
 
Meaningful Technology for Humans: How Strategy Helps to Deliver Real Value fo...
Meaningful Technology for Humans: How Strategy Helps to Deliver Real Value fo...Meaningful Technology for Humans: How Strategy Helps to Deliver Real Value fo...
Meaningful Technology for Humans: How Strategy Helps to Deliver Real Value fo...
 
Filing Your Delaware Franchise Tax A Detailed Guide
Filing Your Delaware Franchise Tax A Detailed GuideFiling Your Delaware Franchise Tax A Detailed Guide
Filing Your Delaware Franchise Tax A Detailed Guide
 

A/B Testing Pitfalls and Lessons Learned at Spotify

  • 1. A/B Testing: Avoiding Common Pitfalls Danielle Jabin March 5, 2014
  • 2. 1 Make all the world’s music available instantly to everyone, wherever and whenever they want it
  • 3. 2 As of March 5, 2014
  • 4. 3 Over 24 million active users As of March 5, 2014
  • 5. 4 Access to more than 20 million songs As of March 5, 2014
  • 6. 5
  • 7. 6 But can we make it even easier?
  • 8. 7 We can try… …with A/B testing!
  • 11. Pitfall #1: Not limiting your error rate
  • 13. 12 What if I flip a coin 100 times and get 51 heads?
  • 14. 13 What if I flip a coin 100 times and get 5 heads?
  • 15. 14
  • 16. 15 The likelihood of obtaining a certain value under a given distribution is measured by its p-value
  • 17. 16 If there is a low likelihood that a change is due to chance alone, we call our results statistically significant
  • 18. 17 What if I flip a coin 100 times and get 5 heads?
  • 19. 18 Statistical significance is measured by alpha ● alpha levels of 5% and 1% are most commonly used – Alternatively: P(significant) = .05 or .01
  • 20. 19 Each alpha has a corresponding Z-score alpha Z-score (two-sided test) .10 1.65 .05 1.96 .01 2.58
  • 21. 20 The Z-score tells us how far a particular value is from the mean (and what the corresponding likelihood is)
  • 23. 22 Compute the Z-score at the end of the test
  • 24. 23 Standard deviation (σ) tells us how spread out the numbers are
  • 25. 24
  • 26. 25 To lock in error rates before you start, fix your sample size
  • 27. 26 What should my sample size be? ● To lock in error rates before you start a test, fix your sample size Represents the desired power (typically .84 for 80% power). Sample size in each group (assumes equal sized groups) 2s (Z b + Za /2 ) n= 2 difference Represents the desired 2 Standard deviation of the outcome variable Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt 2 Effect Size (the difference in means) level of statistical significance (typically 1.96).
  • 28. 27 Recap: running an A/B test ● Compute your sample size – Using alpha, beta, standard deviation of your metric, and effect size ● Run your test! But stop once you’ve reached the fixed sample size stopping point ● Compute your z-score and compare it with the z-score for the chosen alpha level
  • 32. Pitfall #2: Stopping your test before the fixed sample size stopping point
  • 33. 32 Sample size for varying alpha levels ● With σ = 10, difference in means = 1 Two-sided test alpha = .10, beta = .80 1230 alpha = .05, beta = .80 1568 alpha = .01, beta = .80 2339
  • 34. 33 Let’s see some numbers ● 1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate 90% significance reached 95% significance reached 99% significance reached Source: destack.home.xs4all.nl/projects/significance/ Stop at first point of significance 654 of 1,000 Ended as significant 427 of 1,000 49 of 1,000 146 of 1,000 14 of 1,000 100 of 1,000
  • 35. 34 Remedies ● Don’t peek ● Okay, maybe you can peek, but don’t stop or make a decision before you reach the fixed sample size stopping point ● Sequential sampling
  • 37. Pitfall #3: Making multiple comparisons in one test
  • 38. 37 A test can be one of two things: significant or not significant ● P(significant) + P(not significant) = 1 ● Let’s take an alpha of .05 – P(significant) = .05 – P(not significant) = 1 – P(significant) = 1 - .05 = .95
  • 39. 38 What about for two comparisons? ● P(at least 1 significant) = 1 - P(none of the 2 are significant) ● P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95 = .9025 ● P(at least 1 significant) = 1 - .9025 = .0975
  • 40. 39 What about for two comparisons? ●That’s almost 2x (1.95x, to be precise) your .05 significance rate!
  • 41. 40 And it just gets worse… P(at least 1 signifcant) An increase of… 5 variations 1 – (1-.05)^5 = .23 4.6x 10 variations 1 – (1-.05)^10 = .40 8x 20 variations 1 – (1-.05)^20 = .64 12.8x
  • 42. 41 How can we remedy this? ●Bonferroni correction – Divide P(significant), your alpha, by the number of variations you are testing, n – alpha/n becomes the new level of statistical significance
  • 43. 42 So what about two comparisons now? ● Our new P(significant) = .05/2 = .025 ● Our new P(not significant) = 1 - .025 = .975 ● P(at least 1 significant) = 1 - P(none of the 2 are significant) ● P(none of the 2 are significant) = P(not significant)*P(not significant) = .975*.975 = .951 ● P(at least 1 significant) = 1 - .951 = .0499
  • 44. 43 P(significant) stays under .05  Corrected alpha P(at least 1 signifcant) 5 variations .05/5 = .01 1 – (1-.01)^5 = .049 10 variations .05/10 = .005 1 – (1-.005)^10 = .049 20 variations .05/20 = .0025 1 – (1-.0025)^20 = .049
  • 47. 46 A/B test steps: 1. Decide what to test 2. Determine a metric to test 3. Formulate your hypothesis 1. Select an effect size threshold: what change of the metric would make a rollout worthwhile? 4. Calculate sample size (your stopping point) 1. Decide your Type I (alpha) and Type 2 (beta) error levels and the corresponding zscores 2. Determine the standard deviation of your metric 5. Run your test! But stop once you’ve reached the fixed sample size stopping point 6. Compute your z-score and compare it with the z-score for your chosen alpha level
  • 48. 47 Type I and Type II error ● Type I error: incorrectly reject a true null hypothesis – alpha ● Type II error: incorrectly accept a false null hypothesis – beta – Power: 1 - beta
  • 49. 48 Z-score reference table alpha One-sided test Two-sided test .10 1.28 1.65 .05 1.65 1.96 .01 2.33 2.58
  • 50. 49 Z-score for proportions (e.g. conversion)