Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Conversion Conference Berlin

902 views

Published on

A walkthrough of the most common statistical pitfalls that marketers encounter in CRO testing.

Published in: Marketing
  • Be the first to comment

Conversion Conference Berlin

  1. 1. @THCapper YOUR RESULTS ARE INVALID STATISTICS FOR CRO
  2. 2. A Good A/B Test Result: “10% Uplift, With 95% Significance”
  3. 3. ● What does this mean? ● Is this correct? “10% Uplift, With 95% Significance”
  4. 4. Easy Questions?
  5. 5. Most Tools Encourage Mistakes
  6. 6. Risk
  7. 7. Marketer: “Roll it out!” Statistician (me): *sobs*
  8. 8. “That’s Risky”
  9. 9. “Advanced”
  10. 10. You will learn today: ● The most common serious errors in A/B testing ● How to avoid them ● How to interpret your result ● Whether to roll it out
  11. 11. How to Run an A/B Test
  12. 12. 1. Test design 2. Results interpretation 3. Decision
  13. 13. Jargon: Null Hypothesis
  14. 14. Jargon: Null Hypothesis ● The hypothesis that your variant and original are functionally equivalent
  15. 15. e.g. an A/A Test vs. A A
  16. 16. Jargon: P-Value
  17. 17. Jargon: P-Value ● The chance of a result this extreme if the null hypothesis is true ● E.g. 0.05 for 95% significance
  18. 18. Jargon: Critical Value
  19. 19. Jargon: Critical Value ● What you compare your p-value with when deciding whether to reject the null hypothesis
  20. 20. 1. Test Design
  21. 21. How Many Tests?
  22. 22. A B C D E F Multivariate Testing Landing Page: Product Pages:
  23. 23. Multivariate Testing A B C D Landing Page: Product Page:
  24. 24. Multivariate Testing A C D B
  25. 25. Multivariate Testing A C D BA C D B
  26. 26. Multivariate Testing A C D BA C D BA C D B
  27. 27. Multivariate Testing A C D B A C D BA C D BA C D B
  28. 28. Multivariate Testing A BLanding Page: A: 5% B: 7.5%
  29. 29. Multivariate Testing C D C: 5% D: 7.5% Product Page:
  30. 30. Multivariate Testing A C D B A C D BA C D BA C D B
  31. 31. A C Multivariate Testing D B A C D BA C D BA C D B AC: 0% BD: 5% BC: 10% AD: 10%
  32. 32. A B
  33. 33. “Constantly Iterate”
  34. 34. Multiple Testing A B C D E F
  35. 35. False Positives Test: Healthy Test: Ill
  36. 36. False Positives Test: Healthy Test: Ill False Negatives
  37. 37. False Positives Test: Healthy Test: Ill False Positives False Negatives
  38. 38. Multiple Testing 1 A/A test: 5% chance of achieving 95% significance.
  39. 39. Multiple Testing 1 A/A Test: 5% chance
  40. 40. Multiple Testing 1 A/A Test: 2 A/A Tests: 5% chance 9.75% chance
  41. 41. Multiple Testing 1 A/A Test: 2 A/A Tests: 3 A/A Tests: 5% chance 9.75% chance 14.26% chance
  42. 42. Multiple Testing 1 A/A Test: 2 A/A Tests: 3 A/A Tests: 4 A/A Tests: 5% chance 9.75% chance 14.26% chance 18.55% chance
  43. 43. Multiple Testing 1 A/A Test: 2 A/A Tests: 3 A/A Tests: 4 A/A Tests: n A/A Tests: 5% chance 9.75% chance 14.26% chance 18.55% chance 1-0.95^n
  44. 44. Multiple Testing Solutions: 1. Accept risk of false positives
  45. 45. Multiple Testing Solutions: 1. Accept risk of false positives 2. Bonferroni correction
  46. 46. Bonferroni Approximation Standard: P-value vs………..…. 0.05
  47. 47. Bonferroni Approximation Standard: P-value vs………..…. Approximation: P-value vs…... 0.05 0.05/N
  48. 48. Bonferroni Correction Standard: P-value vs………..…. Bonferroni: P-value vs………. 0.05 1-(1-0.05)^(1/N)
  49. 49. Multiple Testing Solutions: 1. Accept risk of false positives 2. Bonferroni correction 3. Holm-Bonferroni correction
  50. 50. Choosing the Right Metric
  51. 51. Choosing the Right Metric Conversion Rate vs. Average Session Value
  52. 52. Choosing the Right Metric Conversion Rate vs. Average Session Value Profit?
  53. 53. Stopping Rules
  54. 54. Stopping Rules Common: When my test reaches significance.
  55. 55. “Significance so far” varies over time.
  56. 56. Stopping Rules Y Y Y Y Y N N N N N
  57. 57. Stopping Rules Y Y Y Y Y Y YN N N
  58. 58. Stopping Rules 20000
  59. 59. 20000
  60. 60. Exceptions https://en.wikipedia.org/wiki/Sequential_probability_ratio_test
  61. 61. Stopping Rules Solutions: 1. Sequential testing - e.g. Optimizely 2. Bayesian testing - e.g. VWO 3. Predetermined sample size
  62. 62. evanmiller.org/ab-testing/sample-size.html
  63. 63. Sample Size for Average Session Value Testing
  64. 64. =stdev(B:B) =stdev.s(B:B) Standard Deviation
  65. 65. powerandsamplesize.com/Calculators/
  66. 66. Cutting Your Losses
  67. 67. Test Design Recap Contamination Multiple Testing Metric Choice Stopping Rules
  68. 68. 1. Test design 2. Results interpretation 3. Decision
  69. 69. 2. Results Interpretation
  70. 70. Interpreting the P-Value
  71. 71. Interpreting the P-value 1 test reaches 95% significance: 5% chance of data this extreme if variants functionally equivalent.
  72. 72. 0
  73. 73. Analogy
  74. 74. Analogy Question: How likely is it that my analytics or site are broken?
  75. 75. Analogy Question: How likely is it that my analytics or site are broken? Non-Answer: We only go a whole day with no conversions once every 2 months.
  76. 76. Analytics is broken with probability 1 or 0.
  77. 77. Interpreting the P-value Question: How likely is it that this variation actually does nothing? Non-Answer: We’d only see a difference this big 5% of the time.
  78. 78. Meanwhile in Industry Tools: ● “Chance to beat baseline” ● “We are 95% certain that the changes in test “B” will improve your conversion rate”
  79. 79. Unanswered Questions
  80. 80. Unanswered Questions Question: How likely is it that the increase will be less than predicted?
  81. 81. Unanswered Questions Question: How likely is it that the increase will be negative?
  82. 82. One Mistake Probability of Outcome given Data vs. Probability of Data given Null
  83. 83. Unanswered Questions Question: How likely is it that these results are a fluke?
  84. 84. Confidence Intervals
  85. 85. Confidence Interval of Conversion Rate
  86. 86. Overlapping Confidence Intervals
  87. 87. Everything Else Still Applies
  88. 88. Choosing the Right Metric
  89. 89. evanmiller.org/ab-testing/t-test.html
  90. 90. Results Interpretation Recap Check Revenue P-Value Confidence Intervals
  91. 91. 1. Test design 2. Results interpretation 3. Decision
  92. 92. 3. Decision
  93. 93. A Good A/B Test Result: “10% Uplift, With 95% Significance”
  94. 94. But what about this? “10% Uplift, With 60% Significance”
  95. 95. Jargon: P-Value ● The chance of a result this extreme if the null hypothesis is true ● E.g. 0.05 for 95% significance
  96. 96. “10% Uplift, With 60% Significance” ● 40% chance of data at least this extreme if variation functionally identical
  97. 97. “10% Uplift, With 60% Significance” ● 40% chance of data at least this extreme if variation functionally identical ● The variation is probably better than the baseline
  98. 98. Drug Trials vs. Investment Banking
  99. 99. Are You OK With False Positives?
  100. 100. Data is Expensive
  101. 101. Data is Expensive: ● Opportunity Cost ● Exploration vs. Exploitation
  102. 102. Historical Comparisons are Invalid
  103. 103. Hang on… Why Should I Care About Significance?
  104. 104. 1. Ignoring Significance Doesn’t Allow You to Ignore Statistics
  105. 105. 2. Risk Aversion
  106. 106. Risk Factors: ● Agility ● Business attitudes ● What’s the worst that could happen?
  107. 107. Decision Recap Significant vs. Winning Risk Exploration vs. Exploitation
  108. 108. Conclusion: 3 Takeaways
  109. 109. 1. Think about significance and risk during test design
  110. 110. 2. Remember your real KPI: Profit
  111. 111. 3. You’re not testing medicines
  112. 112. @THCapper Takeaways: 1. Think about significance and risk during test design 2. Remember your real KPI: Profit 3. You’re not testing medicines

×