Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Don't Take Your Brand For Granted B... by Search Marketing ... 330 views
- Enterprise SEO - The World Is In Yo... by Search Marketing ... 558 views
- 10 Search Commandments By Colleen H... by Search Marketing ... 368 views
- What's Hot in SEO Ranking Factors B... by Search Marketing ... 483 views
- Deconstructing the Gospel of Rankin... by Search Marketing ... 317 views
- The Journey of Conversion Optimizat... by Search Marketing ... 465 views

3,157 views

Published on

Published in:
Marketing

No Downloads

Total views

3,157

On SlideShare

0

From Embeds

0

Number of Embeds

2,483

Shares

0

Downloads

105

Comments

0

Likes

1

No embeds

No notes for slide

If you are squeezing out fractions of performance, it will take longer.

http://www.evanmiller.org/how-not-to-run-an-ab-test.html for the 26.1% number

- 1. #SMX #12C3 @AdriaK How to Avoid the First Two When Producing the Latter Lies, Damned Lies, and Search Marketing Statistics Adria Kyne Vistaprint
- 2. #SMX #12C3 @AdriaK Problems • Using samples that are too small • Using significance as a stopping point for a test Solutions • More rigor with fixed-sample tests • Using sequential sampling tests • Bayesian testing Bonus Pro Tip for achieving valid samples Today’s Topics
- 3. #SMX #12C3 @AdriaK 1) Make sure that we understand what actually happened 2) Be sure that we can use these results to predict the future What is the Whole Point of This Anyway?
- 4. #SMX #12C3 @AdriaK 1. We want to know whether the variation is better, worse, or the same as the original. 2. We don’t want to see a positive outcome that isn’t really there— a false positive or Type I error 3. We don’t want to miss a positive outcome—a Type II error. Basics of Hypothesis Testing
- 5. #SMX #12C3 @AdriaK Your product page has an average 2.0% CR. You make a bunch of tweaks to the design, and after 30,000 visits, your CR is 2.25%. You think you’re a genius, and so you tell your boss. Score! #1 A Common (Sad) Story
- 6. #SMX #12C3 @AdriaK At the end of the month, your revenue is no higher. You look bad. The change you saw was not “significant,” because your sample size wasn’t big enough. Yes, 30,000 visits was not enough. You spoke too soon.
- 7. #SMX #12C3 @AdriaK I gotta be cruel to be kind.
- 8. #SMX #12C3 @AdriaK The smaller the difference, the bigger the sample you’ll need: 2% - 3% is a 50% increase 2.0%-2.5% is a 25% increase 2.0% - 2.25% is a 12.5% increase For standard A/B hypothesis tests
- 9. #SMX #12C3 @AdriaK Decide on How Much Impact Your Change Should Have Visits CR Orders AOV Revenue Annual Increase 20,000 2.00% 400 $50 $20,000 20,000 2.25% 450 $50 $22,500 $30,000 20,000 2.50% 500 $50 $25,000 $60,000 How much of a difference do you want to be able to detect with your test?
- 10. #SMX #12C3 @AdriaK “power analysis for two independent proportions” Pick a Sample Size Calculator minimum sample size we’re showing the variants to different visitors we’re comparing rates, which are proportions
- 11. #SMX #12C3 @AdriaK Is variation is higher or lower than the original? “two-tailed test.” A 5% significance level is common— that is, there’s a 5% chance of a false positive 80% statistical power is common— there is a 20% chance (1 in 5) that if there was an effect, we’d miss it. Calculator Options http://bit.ly/25zI5Rv P1 = your control CR, e.g. 0.02 for 2% P2= your likely test CR, e.g. 0.025 for 2.5%
- 12. #SMX #12C3 @AdriaK The effect of using 0.05 and 80% is that we are 4 times more likely to get a false negative than a false positive We’re more concerned about making things worse We accept a higher chance that won’t see a positive effect that is actually there Consequences of Significance and Power Choices
- 13. #SMX #12C3 @AdriaK Those are arbitrary choices. We’re not testing pharmaceuticals. Are we really so terrified that we’ll roll out a page that isn’t an improvement? NOBODY IS GOING TO DIE
- 14. #SMX #12C3 @AdriaK Means that I love you. Baby.
- 15. #SMX #12C3 @AdriaK Necessary Sample Sizes 1% change 13,8093,826 0.5% change 52,238
- 16. #SMX #12C3 @AdriaK Requires 52,238 Visits Detecting a 12.5% increase in Conversion Rate For each sample
- 17. #SMX #12C3 @AdriaK Photo by Marilynn Windust https://ronmitchelladventure.com
- 18. #SMX #12C3 @AdriaK You’re hoping for a 0.25% uplift on a 2.0% average CR. The Control is getting 2.0% CR, and the Variant is getting 3.0% CR! #2 Another Common (Sad) Story “Why haven’t we switched to the test variant? It’s CLEARLYWINNING.”
- 19. #SMX #12C3 @AdriaK So you test the significance level. Success! The difference is significant. You roll out the new page, and... ...nothing happens And this is how things go awry
- 20. #SMX #12C3 @AdriaK A significance calculation assumes that the sample size was fixed in advance It assumes that you have a valid sample So when you ignore this and run until you get a “significant result,” you’re misusing the math Why didn’t it work?
- 21. #SMX #12C3 @AdriaK If you hit a period that happens to be performing well You may succumb to the temptation to stop while you’re ahead Repeated significance testing increases the rate of false positives Friends don’t let friends test significance prematurely Image: Public Domain, via Wikipedia
- 22. #SMX #12C3 @AdriaK Why repeated significance testing is a problem
- 23. #SMX #12C3 @AdriaK 5% significance means that even if there is no difference between the test and the control We’ll see an imaginary difference in the test 5% of the time Remember what significance means?
- 24. #SMX #12C3 @AdriaK Repeated Significance Testing is The Devil Given: there is no actual difference between two test variants Option 1 Option 2 Option 3 Option 4 1st observation Significant No difference Significant No difference 2nd observation - Significant - No difference End ofTest Significant Significant Significant No difference Likelihood ? ? Option 1 Option 2 Option 3 Option 4 1st observation Significant No difference Significant No difference 2nd observation Significant Significant No difference No difference End ofTest Significant Significant No Difference No difference Likelihood 5% chance 95% chance Option 1 Option 2 Option 3 Option 4 1st observation Significant No difference Significant No difference 2nd observation - Significant - No difference End ofTest Significant Significant Significant No difference Likelihood 26% chance 74% chance Option 1 Option 2 1st observation Significant No difference Likelihood 5% chance 95% chance
- 25. #SMX #12C3 @AdriaK See the slippery slope in action! Day 1 Control 150 2.00% 2.01% Variant 175 2.25% 2.35% Visits/Variant 7,460 not Day 1 Day 2 Control 150 313 2.00% 2.01% 2.10% Variant 175 332 2.25% 2.35% 2.23% Visits/Variant 7,460 14,920 not not Day 1 Day 2 Day 3 Control 150 313 448 2.00% 2.01% 2.10% 2.00% Variant 175 332 498 2.25% 2.35% 2.23% 2.23% Visits/Variant 7,460 14,920 22,380 not not not Day 1 Day 2 Day 3 Day 4 Control 150 313 448 636 2.00% 2.01% 2.10% 2.00% 2.13% Variant 175 332 498 695 2.25% 2.35% 2.23% 2.23% 2.33% Visits/Variant 7,460 14,920 22,380 29,840 not not not not Day 1 Day 2 Day 3 Day 4 Day 5 Control 150 313 448 636 750 2.00% 2.01% 2.10% 2.00% 2.13% 2.01% Variant 175 332 498 695 835 2.25% 2.35% 2.23% 2.23% 2.33% 2.24% Visits/Variant 7,460 14,920 22,380 29,840 37,300 not not not not SIGNIFICANT Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Control 150 313 448 636 750 922 2.00% 2.01% 2.10% 2.00% 2.13% 2.01% 2.06% Variant 175 332 498 695 835 993 2.25% 2.35% 2.23% 2.23% 2.33% 2.24% 2.22% Visits/Variant 7,460 14,920 22,380 29,840 37,300 44,760 not not not not SIGNIFICANT not Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Control 150 313 448 636 750 922 1098 2.00% 2.01% 2.10% 2.00% 2.13% 2.01% 2.06% 2.10% Variant 175 332 498 695 835 993 1174 2.25% 2.35% 2.23% 2.23% 2.33% 2.24% 2.22% 2.25% Visits/Variant 7,460 14,920 22,380 29,840 37,300 44,760 52,220 not not not not SIGNIFICANT not not
- 26. #SMX #12C3 @AdriaK Smart marketers PRE-COMMIT to a valid sample size And do not test for significance before they’ve collected it! Therefore:
- 27. #SMX #12C3 @AdriaK Because you have to be able to satisfy impatient observers But I neeeeeeed to test significance repeatedly!
- 28. #SMX #12C3 @AdriaK Solves the problem of repeated significance testing Allows you to stop the test early if the Variant is a winner Works with low conversion rates (under 10%) Sequential A/B Testing Image: http://geneticsandbeyond.blogspot.com/2014/08/the-puffinss-lair-sweat-of-hippos.html
- 29. #SMX #12C3 @AdriaK 1. Determine your sample size N (number of total conversions) 2. Measure the success of your Control and Variant groups 3. Check for stopping points If Variant - Control = 2.25√N the Variant wins If Control - Variant = 2.25√N the Control wins If Variant + Control = N, there is no winner Sequential experiment design http://bit.ly/1sSDz29
- 30. #SMX #12C3 @AdriaK Sequential Sampling Calculator http://bit.ly/1TM1LKv
- 31. #SMX #12C3 @AdriaK Given a baseline conversion rate p Minimum detectable effect you want to see is d 1.5p + d < 36% When less than 36%, a sequential test will be shorter p = 2.0%, d = 12.5% (2.25% CR) 1.5p + d = 15.5% When to choose a fixed sample vs. a sequential test
- 32. #SMX #12C3 @AdriaK Variant CR = better than Control P-value = 0.18 (i.e. greater than our 0.05 significance level) When Good Math Leads to Bad Career Moves So how did the test go? Neither.We didn’t achieve significance. So which version won? We stopped this morning. So why did you stop it?! Just show it to another 10,000 visitors. We can’t do that.We have to accept that the test is over. This guy is not a team player. I am so screwed.. Well, the null hypothesis... blah blah blah Blah blah p-value blah blah hlah blah Image: 20th Century Fox viaAmazon
- 33. #SMX #12C3 @AdriaK Communicating results is hard. So which one performs better? There is a 95% probability that the results we saw are not due to random chance! Why can’t this guy just answer a straight question? I hate my life. Image: 20th Century Fox viaAmazon
- 34. #SMX #12C3 @AdriaK How to stop your test at any time and still make valid inferences!! Much easier to understand and explain the results!! Bayes’ Theorem Image via Wikipedia
- 35. #SMX #12C3 @AdriaK Frequentist Bayesian Assumes that there is no difference, and finds the probability that chance alone could have produced the experimental results seen Focuses on not getting Type I errors Most people don’t understand what the results mean What’s the Difference? Finds the probability that the test is better More forgiving of Type I errors Easier to understand and communicate to non-technical audiences
- 36. #SMX #12C3 @AdriaK Calculus Why Don’t Marketers Use Bayes’ Theorem? This formula determines is the probability that B will beat A in the long run.There’s a slightly different one if you have three test groups, etc.
- 37. #SMX #12C3 @AdriaK Online calculators are your friends! But Wait!
- 38. #SMX #12C3 @AdriaK Wins and losses data Graph • Probability distributions Table • Probability of being best • Spread of conversion rates Cool online Bayesian calculator http://bit.ly/24mKJaY
- 39. #SMX #12C3 @AdriaK 1. Decide on the probability you’re comfortable with 2. Decide how much variance you’re willing to accept How to use this calculator
- 40. #SMX #12C3 @AdriaK 96% probability that B is better But what’s the real CR? Needs more data High spread, less overlap
- 41. #SMX #12C3 @AdriaK Not very much CR variance But B is only 70% likely to be better Low spread, high overlap
- 42. #SMX #12C3 @AdriaK Variance of CR isn’t as bad Separation of peaks means that the CRs are different 94% probability that B is probably better We aren’t certain about the actual CR Less spread, less overlap Sample size is only 100 conversions each!
- 43. #SMX #12C3 @AdriaK You might actually see
- 44. #SMX #12C3 @AdriaK Allows you to start the test with some assumptions, called “priors” Can include: • the prior success probability (our belief about the average conversion rate) • How much variance you expect Bayesian’s interesting twist
- 45. #SMX #12C3 @AdriaK 1. Set your “priors” 2. Input your test data 3. Get back the probability that the test variant performs better Different cool Bayesian calculator http://bit.ly/1Wzrtro
- 46. #SMX #12C3 @AdriaK Actual, Understandable Results
- 47. #SMX #12C3 @AdriaK You can make inferences from low traffic and low conversions When someone says "What's the probability that the new page outperforms the old one?", you can give them an answer! Advantage of Bayesian results
- 48. #SMX #12C3 @AdriaK 1. You know how not to run a fixed sample test 2. You know you can run a sequential sample test when you need ongoing information about the results 3. You know how to run a Bayesian test, where you can keep checking your progress AND explain the results easily So now what?
- 49. #SMX #12C3 @AdriaK Are you trying to detect a big difference, or a small difference? Use the formula 1.5p + d big difference - use a normal fixed sample test (>36%) small difference - use a sequential test (< 36%) Do the people you report to get confused or unhappy when you try to explain significance and p-values to them? Run a Bayesian test Review: How to Design your Experiment
- 50. #SMX #12C3 @AdriaK Tests using significance Bayesian test 1. Use a sample calculator 2. Run the test for the specified sample 3. Profit! So That’s It, Then? 1. Decide how solid you want your probability estimate to be 2. Run the test and update the data 3. Profit!
- 51. #SMX #12C3 @AdriaK I’m all about the tough love.
- 52. #SMX #12C3 @AdriaK We are not measuring consistent user groups • Time of day • Day of week • Seasonality • Sales The Problem of Illusory Lift
- 53. #SMX #12C3 @AdriaK Run your tests long enough to cover at least one entire traffic/conversion cycle Monday-Sunday or equivalent full week Account for business cycles
- 54. #SMX #12C3 @AdriaK Daily differences in performance
- 55. #SMX #12C3 @AdriaK Don’t run your test too long Visitors delete their cookies and will pollute your samples Account for user behavior
- 56. #SMX #12C3 @AdriaK Nearly 40 percent of Internet users delete cookies from their primary computers on at least a monthly basis 53 percent delete cookies, cache or browsing history to help protect their privacy online It’s probably more than you think JupiterResearch 2005 TRUSTe/NationalCyber Security Alliance U.S.Consumer Privacy Index January 2016
- 57. #SMX #12C3 @AdriaK • Pre-commit to a sample size/experimental design • Fixed Sample A/B testing – no peeking before it’s done • Sequential A/B testing – built-in peeking • Bayesian – easier to understand the results • Collect samples for a full business cycle, but not too long Summary
- 58. #SMX #12C3 @AdriaK Fixed sample calculator Stats Dept., U of British Columbia http://bit.ly/25zI5Rv Sequential sampling calculator Evan Miller http://bit.ly/1TM1LKv Simple Bayesian calculator Peak Conversion http://bit.ly/24mKJaY Bayesian calculator with priors Lyst http://bit.ly/1Wzrtro Calculators I used
- 59. #SMX #12C3 @AdriaK LEARN MORE: UPCOMING @SMX EVENTS THANK YOU! SEE YOU AT THE NEXT #SMX

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment