A walkthrough of the most common statistical pitfalls that marketers encounter in CRO testing.

- 1. @THCapper YOUR RESULTS ARE INVALID STATISTICS FOR CRO
- 2. A Good A/B Test Result: “10% Uplift, With 95% Significance”
- 3. ● What does this mean? ● Is this correct? “10% Uplift, With 95% Significance”
- 4. Easy Questions?
- 5. Most Tools Encourage Mistakes
- 6. Risk
- 7. Marketer: “Roll it out!” Statistician (me): *sobs*
- 8. “That’s Risky”
- 9. “Advanced”
- 10. You will learn today: ● The most common serious errors in A/B testing ● How to avoid them ● How to interpret your result ● Whether to roll it out
- 11. How to Run an A/B Test
- 12. 1. Test design 2. Results interpretation 3. Decision
- 13. Jargon: Null Hypothesis
- 14. Jargon: Null Hypothesis ● The hypothesis that your variant and original are functionally equivalent
- 15. e.g. an A/A Test vs. A A
- 16. Jargon: P-Value
- 17. Jargon: P-Value ● The chance of a result this extreme if the null hypothesis is true ● E.g. 0.05 for 95% significance
- 18. Jargon: Critical Value
- 19. Jargon: Critical Value ● What you compare your p-value with when deciding whether to reject the null hypothesis
- 20. 1. Test Design
- 21. How Many Tests?
- 22. A B C D E F Multivariate Testing Landing Page: Product Pages:
- 23. Multivariate Testing A B C D Landing Page: Product Page:
- 24. Multivariate Testing A C D B
- 25. Multivariate Testing A C D BA C D B
- 26. Multivariate Testing A C D BA C D BA C D B
- 27. Multivariate Testing A C D B A C D BA C D BA C D B
- 28. Multivariate Testing A BLanding Page: A: 5% B: 7.5%
- 29. Multivariate Testing C D C: 5% D: 7.5% Product Page:
- 30. Multivariate Testing A C D B A C D BA C D BA C D B
- 31. A C Multivariate Testing D B A C D BA C D BA C D B AC: 0% BD: 5% BC: 10% AD: 10%
- 32. A B
- 33. “Constantly Iterate”
- 34. Multiple Testing A B C D E F
- 35. False Positives Test: Healthy Test: Ill
- 36. False Positives Test: Healthy Test: Ill False Negatives
- 37. False Positives Test: Healthy Test: Ill False Positives False Negatives
- 38. Multiple Testing 1 A/A test: 5% chance of achieving 95% significance.
- 39. Multiple Testing 1 A/A Test: 5% chance
- 40. Multiple Testing 1 A/A Test: 2 A/A Tests: 5% chance 9.75% chance
- 41. Multiple Testing 1 A/A Test: 2 A/A Tests: 3 A/A Tests: 5% chance 9.75% chance 14.26% chance
- 42. Multiple Testing 1 A/A Test: 2 A/A Tests: 3 A/A Tests: 4 A/A Tests: 5% chance 9.75% chance 14.26% chance 18.55% chance
- 43. Multiple Testing 1 A/A Test: 2 A/A Tests: 3 A/A Tests: 4 A/A Tests: n A/A Tests: 5% chance 9.75% chance 14.26% chance 18.55% chance 1-0.95^n
- 44. Multiple Testing Solutions: 1. Accept risk of false positives
- 45. Multiple Testing Solutions: 1. Accept risk of false positives 2. Bonferroni correction
- 46. Bonferroni Approximation Standard: P-value vs………..…. 0.05
- 47. Bonferroni Approximation Standard: P-value vs………..…. Approximation: P-value vs…... 0.05 0.05/N
- 48. Bonferroni Correction Standard: P-value vs………..…. Bonferroni: P-value vs………. 0.05 1-(1-0.05)^(1/N)
- 49. Multiple Testing Solutions: 1. Accept risk of false positives 2. Bonferroni correction 3. Holm-Bonferroni correction
- 50. Choosing the Right Metric
- 51. Choosing the Right Metric Conversion Rate vs. Average Session Value
- 52. Choosing the Right Metric Conversion Rate vs. Average Session Value Profit?
- 53. Stopping Rules
- 54. Stopping Rules Common: When my test reaches significance.
- 55. “Significance so far” varies over time.
- 56. Stopping Rules Y Y Y Y Y N N N N N
- 57. Stopping Rules Y Y Y Y Y Y YN N N
- 58. Stopping Rules 20000
- 59. 20000
- 60. Exceptions https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
- 61. Stopping Rules Solutions: 1. Sequential testing - e.g. Optimizely 2. Bayesian testing - e.g. VWO 3. Predetermined sample size
- 62. evanmiller.org/ab-testing/sample-size.html
- 63. Sample Size for Average Session Value Testing
- 64. =stdev(B:B) =stdev.s(B:B) Standard Deviation
- 65. powerandsamplesize.com/Calculators/
- 66. Cutting Your Losses
- 67. Test Design Recap Contamination Multiple Testing Metric Choice Stopping Rules
- 68. 1. Test design 2. Results interpretation 3. Decision
- 69. 2. Results Interpretation
- 70. Interpreting the P-Value
- 71. Interpreting the P-value 1 test reaches 95% significance: 5% chance of data this extreme if variants functionally equivalent.
- 72. 0
- 73. Analogy
- 74. Analogy Question: How likely is it that my analytics or site are broken?
- 75. Analogy Question: How likely is it that my analytics or site are broken? Non-Answer: We only go a whole day with no conversions once every 2 months.
- 76. Analytics is broken with probability 1 or 0.
- 77. Interpreting the P-value Question: How likely is it that this variation actually does nothing? Non-Answer: We’d only see a difference this big 5% of the time.
- 78. Meanwhile in Industry Tools: ● “Chance to beat baseline” ● “We are 95% certain that the changes in test “B” will improve your conversion rate”
- 79. Unanswered Questions
- 80. Unanswered Questions Question: How likely is it that the increase will be less than predicted?
- 81. Unanswered Questions Question: How likely is it that the increase will be negative?
- 82. One Mistake Probability of Outcome given Data vs. Probability of Data given Null
- 83. Unanswered Questions Question: How likely is it that these results are a fluke?
- 84. Confidence Intervals
- 85. Confidence Interval of Conversion Rate
- 86. Overlapping Confidence Intervals
- 87. Everything Else Still Applies
- 88. Choosing the Right Metric
- 89. evanmiller.org/ab-testing/t-test.html
- 90. Results Interpretation Recap Check Revenue P-Value Confidence Intervals
- 91. 1. Test design 2. Results interpretation 3. Decision
- 92. 3. Decision
- 93. A Good A/B Test Result: “10% Uplift, With 95% Significance”
- 94. But what about this? “10% Uplift, With 60% Significance”
- 95. Jargon: P-Value ● The chance of a result this extreme if the null hypothesis is true ● E.g. 0.05 for 95% significance
- 96. “10% Uplift, With 60% Significance” ● 40% chance of data at least this extreme if variation functionally identical
- 97. “10% Uplift, With 60% Significance” ● 40% chance of data at least this extreme if variation functionally identical ● The variation is probably better than the baseline
- 98. Drug Trials vs. Investment Banking
- 99. Are You OK With False Positives?
- 100. Data is Expensive
- 101. Data is Expensive: ● Opportunity Cost ● Exploration vs. Exploitation
- 102. Historical Comparisons are Invalid
- 103. Hang on… Why Should I Care About Significance?
- 104. 1. Ignoring Significance Doesn’t Allow You to Ignore Statistics
- 105. 2. Risk Aversion
- 106. Risk Factors: ● Agility ● Business attitudes ● What’s the worst that could happen?
- 107. Decision Recap Significant vs. Winning Risk Exploration vs. Exploitation
- 108. Conclusion: 3 Takeaways
- 109. 1. Think about significance and risk during test design
- 110. 2. Remember your real KPI: Profit
- 111. 3. You’re not testing medicines
- 112. @THCapper Takeaways: 1. Think about significance and risk during test design 2. Remember your real KPI: Profit 3. You’re not testing medicines

