Successfully reported this slideshow.
Upcoming SlideShare
×

# 10 Guidelines for A/B Testing

546 views

Published on

My talk for PyData NYC 2018.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### 10 Guidelines for A/B Testing

1. 1. Emily Robinson @robinson_es 10 Guidelines for A/B Testing
2. 2. About Me ➔ R User (😱) ➔ Background in the social sciences ➔ Formerly at Etsy ➔ Data Scientist at DataCamp
3. 3. What is A/B Testing?
4. 4. A/B testing is everywhere
5. 5. My perspective Millions of visitors daily Data engineering pipeline set-up
6. 6. Generating numbers is easy; generating numbers you should trust is hard! Source: Trustworthy online controlled experiments: five puzzling outcomes explained
7. 7. Guidelines
8. 8. Disclaimer
9. 9. This is Bowen This is Bowen Bobo He is our fictional PM for the day
10. 10. Situation Problem Bobo: Well, we’re hoping this test will increase registrations, search clicks, and course starts The test increased registrations by 5%, but decreased course starts by 3%
11. 11. 1. Have one key metric per experiment ➔ Clarifies decision-making ➔ Can have additional “guardrail” metrics that you don’t want to negatively impact
12. 12. Situation Problem Bobo : I have 100 test ideas. How long is each going to take to run? And which ones should we choose? Ideas are cheap; prioritizing them is difficult
13. 13. 2. Use your key metric to do a power calculation ➔ 80% power = if there’s an effect of this size, 80% chance you detect it ➔ 10,000 daily visitors, 10% conversion rate, how many days to detect a 5% increase? ➔ https://bookingcom.github.io/power calculator/
14. 14. Situation Problem Bobo : I checked the experiment today and we significantly increased conversion rate! Quick, stop the test! Source: http://varianceexplained.org/r/bayesian-ab-testing/, David Robinson
15. 15. 3. Run your experiment the length you’ve planned on ➔ Stick to what you arrived to with your power analysis ➔ Advanced: always Valid Inference and sequential testing
16. 16. Situation Problem Bobo : I know the test didn’t work overall, but when I look at Canadian users on mobile we increased conversion by 10%! This is multiple hypothesis testing and will increase your false positive rate.
17. 17. 4. Don’t look for differences in every possible segment ➔ Pre-specify hypotheses ➔ Run separate tests ➔ Can use methods to adjust for multiple hypothesis testing
18. 18. Situation 5 Situation Problem Bobo : The experiment was a big success! The split was 50.5/49.5 instead of 50/50 as planned, but that’s so small it doesn’t matter, right? If you have 200k people in your experiment, a 50.5/49.5 has p < .0001. You have bucketing skew or sample-ratio mismatch.
19. 19. 5. Make sure your experiment is balanced ➔ Use a proportion test to check your split ➔ If unbalanced, do not use the results ➔ Bad news: difficult to debug. Check segments
20. 20. Situation Problem Bobo : I read this article about how much better multi-armed bandits is better than traditional A/B tests. Why don’t we use that? Not full of understanding of assumptions of those method
21. 21. 6. Don’t overcomplicate your methods ➔ Get the basics right ➔ Designing tests right > super sophisticated methods
22. 22. Situation Problem Bobo: Well, nothing went up, but nothing went down either, so let’s just launch it! May be a negative effect too small to detect. Adds technical upkeep.
23. 23. 7. Be careful of launching things because they “don’t hurt” ➔ Decide whether to “launch on neutral” beforehand ➔ Non-inferiority testing
24. 24. Situation Problem Bobo: Hey, we just finished this experiment. Can you analyze it for us? “To consult the statistician after an experiment is finished is often merely to ask [them] to conduct a post mortem examination. [They] can perhaps say what the experiment died of.” - Robert Fisher
25. 25. 8. Have a data scientist/analyst involved in the whole process ➔ Helps decide whether it should be an experiment at all ➔ Make sure you can measure what you want ➔ Can surface problems along the way
26. 26. Situation Problem Bobo: Hey, we accidentally added everyone to the experiment. Can we still use our dashboards to monitor it? Non-impacted people add noise, decreasing power
27. 27. 9. Only include people in your analysis who could have been affected ➔ Start tracking people after the user sees the change ➔ Can be tricky – e.g. changing threshold for free shipping offer from \$25 to \$35
28. 28. Situation Problem Bobo: We spent 6 months redesigning this page, made 50 changes to make it awesome, but the A/B test shows it did worse. Why? Time was wasted, and with many changes hard or impossible to tell what was the problem
29. 29. 10. Focus on smaller, incremental tests ➔ Work in small design-develop- measure cycles ➔ Test assumptions
30. 30. Conclusion
31. 31. Recap 1. Have one key metric per experiment 2. Use your key metric to do a power calculation 3. Run your experiment for the length you’ve planned on 4. Don’t look for differences in every possible segment 5. Make sure your experiment groups are balanced 6. Don’t overcomplicate your methods 7. Be careful of launching things because they don’t hurt 8. Have a data scientist/analyst involved in the whole process 9. Only include people in your analysis who could have been affected 10. Focus on smaller, incremental tests
32. 32. Research papers ➔ Controlled experiments on the web: survey and practical guide (2008) ➔ Seven rules of thumb for web site experiments (2014) ➔ A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments (2017) ➔ Democratizing online controlled experiments at Booking.com (2017)
33. 33. Blog posts and presentations ➔ Design for Continuous Experimentation by Dan McKinley ➔ Scaling Airbnb’s Experimentation Platform by Jonathan Parks ➔ Please, please don’t A/B test that by Tal Raviv ➔ How Etsy handles peeking in A/B Testing by Callie McRee and Kelly Shen
34. 34. Thank you! hookedondata.org @robinson_es bit.ly/guidelinesab