Most A/B testing results
are Illusory
Martin Goodson, Skimlinks
These are my opinions not those of my
employer!
What’s an A/B test?
Example: Free delivery
A: Control
B: Variant
‘How can you talk for 40 minutes
about A/B testing?’
A/B tests are very easy to get wrong
What my experience is based on
What this talk is about
3 Statistical concepts
Errors and consequences
These errors are exactly how A/B testing
software w...
What this talk is about
Statistical Power
Multiple Testing
Regression to the Mean
What is Statistical Power?
The probability that you will detect a true
difference between two samples
What is Statistical Power?
Example: are men taller than women, on
average?
What is Statistical Power?
Example: free delivery on a website
Why is Statistical Power important?
1. False negatives
2. False positives
Precision
Proportion of true positives in the positive
results
Its a function of power, significance level and
prevalence.
If you have good power?
Out of 100 tests
10 really drive uplift
You detect 8
5 false positives
8/13 of positive tests are ...
If you have bad power?
Out of 100 tests
10 really drive uplift
You detect 3
5 false positives
3/8 of winning tests are rea...
Marketer: ‘We need results in 2 weeks time’
Me: ‘We can’t run this test for only two weeks we won’t get robust results’
Marketer: ‘We need results in 2 weeks time’
Me: ‘We can’t run this test for only two weeks we won’t get robust results’
Ma...
Calculating Power
Alpha: probability of a positive result when
the null hypothesis is true (5%)
Beta: probability of not s...
Calculating Power
Use a power calculator:
Online
R (power.prop.test)
python (statsmodels.stats.power)
Approximate sample sizes
Using a power calculator and asking for 80%
power and significance level of 5%:
6000 conversions ...
Multiple testing
Effect of multiple testing
if you run 20 tests at a significance level of 5%
you will obtain 1 win, just by chance.
Giving targets for successful tests.
Stopping tests early
Stopping tests early
Simulations show that stopping an A/A test
when you see a positive results will result in
successful ...
Stopping tests early
That works out to a precision of 20%
Negative uplift.
Stopping an A/B test with negative effect
results in a win 9% of the time!
A True Story
Regression to the mean
Give 100 students a true/false test
They all answer randomly
Take only the top scoring 10% of the c...
Estimates of uplift are generally
wrong.
What you need to do to get it right
● Do a power calculation first to estimate
sample size
● Use a valid hypothesis - don’...
My details
martingoodson@gmail.com
@martingoodson
http://goo.gl/jvhwmB
Download my whitepaper on A/B testing here
Skimlinks After Party!
Levante Bar
5 minutes away
Come hungry!
Invites + Map at the booth
http://skimlinks.com/jobs
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
Upcoming SlideShare
Loading in …5
×

PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory

1,018 views

Published on

PyData London 2014 Martin Goodson - Most A/B Testing Results are Illusory

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,018
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory

  1. 1. Most A/B testing results are Illusory Martin Goodson, Skimlinks
  2. 2. These are my opinions not those of my employer!
  3. 3. What’s an A/B test? Example: Free delivery A: Control B: Variant
  4. 4. ‘How can you talk for 40 minutes about A/B testing?’
  5. 5. A/B tests are very easy to get wrong
  6. 6. What my experience is based on
  7. 7. What this talk is about 3 Statistical concepts Errors and consequences These errors are exactly how A/B testing software works
  8. 8. What this talk is about Statistical Power Multiple Testing Regression to the Mean
  9. 9. What is Statistical Power? The probability that you will detect a true difference between two samples
  10. 10. What is Statistical Power? Example: are men taller than women, on average?
  11. 11. What is Statistical Power? Example: free delivery on a website
  12. 12. Why is Statistical Power important? 1. False negatives 2. False positives
  13. 13. Precision Proportion of true positives in the positive results Its a function of power, significance level and prevalence.
  14. 14. If you have good power? Out of 100 tests 10 really drive uplift You detect 8 5 false positives 8/13 of positive tests are real
  15. 15. If you have bad power? Out of 100 tests 10 really drive uplift You detect 3 5 false positives 3/8 of winning tests are real!
  16. 16. Marketer: ‘We need results in 2 weeks time’ Me: ‘We can’t run this test for only two weeks we won’t get robust results’
  17. 17. Marketer: ‘We need results in 2 weeks time’ Me: ‘We can’t run this test for only two weeks we won’t get robust results’ Marketer: ‘Why are you being so negative?’
  18. 18. Calculating Power Alpha: probability of a positive result when the null hypothesis is true (5%) Beta: probability of not seeing a positive result when the null hypothesis is true Power = 1- Beta (80-90%)
  19. 19. Calculating Power Use a power calculator: Online R (power.prop.test) python (statsmodels.stats.power)
  20. 20. Approximate sample sizes Using a power calculator and asking for 80% power and significance level of 5%: 6000 conversions to detect 5% uplift 1600 conversions to detect 10% uplift
  21. 21. Multiple testing
  22. 22. Effect of multiple testing if you run 20 tests at a significance level of 5% you will obtain 1 win, just by chance.
  23. 23. Giving targets for successful tests.
  24. 24. Stopping tests early
  25. 25. Stopping tests early Simulations show that stopping an A/A test when you see a positive results will result in successful test 41% of the time.
  26. 26. Stopping tests early That works out to a precision of 20%
  27. 27. Negative uplift. Stopping an A/B test with negative effect results in a win 9% of the time!
  28. 28. A True Story
  29. 29. Regression to the mean Give 100 students a true/false test They all answer randomly Take only the top scoring 10% of the class Test them again What will the results be?
  30. 30. Estimates of uplift are generally wrong.
  31. 31. What you need to do to get it right ● Do a power calculation first to estimate sample size ● Use a valid hypothesis - don’t use a scattergun approach ● Do not stop the test early ● Perform a second ‘validation’ test
  32. 32. My details martingoodson@gmail.com @martingoodson http://goo.gl/jvhwmB Download my whitepaper on A/B testing here
  33. 33. Skimlinks After Party! Levante Bar 5 minutes away Come hungry! Invites + Map at the booth http://skimlinks.com/jobs

×