• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory
 

PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory

on

  • 285 views

PyData London 2014 Martin Goodson - Most A/B Testing Results are Illusory

PyData London 2014 Martin Goodson - Most A/B Testing Results are Illusory

Statistics

Views

Total Views
285
Views on SlideShare
284
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory PyData London 2014 Martin Goodson- Most A/B Testing Results are Illusory Presentation Transcript

    • Most A/B testing results are Illusory Martin Goodson, Skimlinks
    • These are my opinions not those of my employer!
    • What’s an A/B test? Example: Free delivery A: Control B: Variant
    • ‘How can you talk for 40 minutes about A/B testing?’
    • A/B tests are very easy to get wrong
    • What my experience is based on
    • What this talk is about 3 Statistical concepts Errors and consequences These errors are exactly how A/B testing software works
    • What this talk is about Statistical Power Multiple Testing Regression to the Mean
    • What is Statistical Power? The probability that you will detect a true difference between two samples
    • What is Statistical Power? Example: are men taller than women, on average?
    • What is Statistical Power? Example: free delivery on a website
    • Why is Statistical Power important? 1. False negatives 2. False positives
    • Precision Proportion of true positives in the positive results Its a function of power, significance level and prevalence.
    • If you have good power? Out of 100 tests 10 really drive uplift You detect 8 5 false positives 8/13 of positive tests are real
    • If you have bad power? Out of 100 tests 10 really drive uplift You detect 3 5 false positives 3/8 of winning tests are real!
    • Marketer: ‘We need results in 2 weeks time’ Me: ‘We can’t run this test for only two weeks we won’t get robust results’
    • Marketer: ‘We need results in 2 weeks time’ Me: ‘We can’t run this test for only two weeks we won’t get robust results’ Marketer: ‘Why are you being so negative?’
    • Calculating Power Alpha: probability of a positive result when the null hypothesis is true (5%) Beta: probability of not seeing a positive result when the null hypothesis is true Power = 1- Beta (80-90%)
    • Calculating Power Use a power calculator: Online R (power.prop.test) python (statsmodels.stats.power)
    • Approximate sample sizes Using a power calculator and asking for 80% power and significance level of 5%: 6000 conversions to detect 5% uplift 1600 conversions to detect 10% uplift
    • Multiple testing
    • Effect of multiple testing if you run 20 tests at a significance level of 5% you will obtain 1 win, just by chance.
    • Giving targets for successful tests.
    • Stopping tests early
    • Stopping tests early Simulations show that stopping an A/A test when you see a positive results will result in successful test 41% of the time.
    • Stopping tests early That works out to a precision of 20%
    • Negative uplift. Stopping an A/B test with negative effect results in a win 9% of the time!
    • A True Story
    • Regression to the mean Give 100 students a true/false test They all answer randomly Take only the top scoring 10% of the class Test them again What will the results be?
    • Estimates of uplift are generally wrong.
    • What you need to do to get it right ● Do a power calculation first to estimate sample size ● Use a valid hypothesis - don’t use a scattergun approach ● Do not stop the test early ● Perform a second ‘validation’ test
    • My details martingoodson@gmail.com @martingoodson http://goo.gl/jvhwmB Download my whitepaper on A/B testing here
    • Skimlinks After Party! Levante Bar 5 minutes away Come hungry! Invites + Map at the booth http://skimlinks.com/jobs