Advertisement

Test for business growth - analyzing A/B-test with a Bayesian approach

Online Dialogue
May. 10, 2016
Advertisement

Test for business growth - analyzing A/B-test with a Bayesian approach

1. SUBTITLE BELOW
2. @AM_Klaassen A bit about me…
3. @AM_Klaassen
4. What I do…
5. @AM_Klaassen My lovely colleagues
6. @AM_Klaassen Conversion rate optimization Analytics Psychology
7. @AM_Klaassen
8. Lots of A/B-tests
9. @AM_Klaassen Adding direct value Learning user behavior
10. The challenge of a successful A/B-test program
11. You need at least 1 winner in 2 weeks
12. Low energy in test team
13. Low visibility in the organization
14. A/B-test program dies
15. We have 1 in 4 significant winners
16. Most only 1 in 8
17. So you need 4 tests per week
18. You need high traffic volumes
19. So, we need more winners!
20. 2 Solutions
21. 1. Improve your test program
22. @AM_Klaassen Improve your implementation rate
23. @AM_Klaassen Improve your implementation rate
24. @AM_Klaassen Improve your implementation rate
25. 2. Redefine your winners
26. Challenges of Frequentist statistics
27. @AM_Klaassen Challenge 1: Hard to Understand In a frequentist test you state a null hypothesis: H0 = Variation A and B have the same conversion rate
28. @AM_Klaassen Say for example you did an experiment and the p-value of that test was 0.01. Challenge 1: Hard to Understand http://onlinedialogue.com/abtest-visualization-excel/
29. @AM_Klaassen Which statement about the p-value (p=0.01) is true? a) You have absolutely disproved the null hypothesis: that is, there is no difference between the variations b) There is a 1% chance of observing a difference as large as you observed even if the two means are identical c) There’s a 99% chance that B is better than A Challenge 1: Hard to Understand
30. @AM_Klaassen Which statement about the p-value is true? a) You have absolutely disproved the null hypothesis: that is, there is no difference between the variations b) There is a 1% chance of observing a difference as large as you observed even if the two means are identical c) There’s a 99% chance that B is better than A Challenge 1: Hard to Understand
31. @AM_Klaassen Which statement about the p-value is true? a) You have absolutely disproved the null hypothesis: that is, there is no difference between the variations b) There is a 1% chance of observing a difference as large as you observed even if the two means are identical c) There’s a 99% chance that B is better than A Challenge 1: Hard to Understand
32. @AM_Klaassen So the p-value only tells you: How unlikely is it that you found this result, given that the null hypothesis is true (that there is no difference between the conversion rates) Challenge 1: Hard to Understand
33. @AM_Klaassen Confused HiPPO
34. @AM_Klaassen Challenge 2: Focus on finding proof
35. @AM_Klaassen Challenge 2: Focus on finding proof
36. @AM_Klaassen Challenge 2: Focus on finding proof
37. @AM_Klaassen What’s the alternative? Frequentist statistics Bayesian statistics
38. Advantages of Bayesian statistics
39. @AM_Klaassen 1. No statistical terminology involved 2. Answers the question directly: ‘what is the probability that variation B is better than A’ Advantage 1: Easy to understand
40. @AM_Klaassen Advantage 1: Easy to understand
41. @AM_Klaassen Remember…?
42. @AM_Klaassen Advantage 1: Easy to understand
43. Happy HiPPO
44. @AM_Klaassen A test result is the probability that B outperforms A: ranging from 0% - 100% Adv 2: Focus on risk assessment
45. @AM_Klaassen Adv 2: Focus on risk assessment 11% 89% Download PDF: ondi.me/change
46. Depends on the cost
47. @AM_Klaassen Take the cost into account
48. @AM_Klaassen Take the cost into account
49. @AM_Klaassen Ondi.me/bayes/
50. @AM_Klaassen Make a risk assessment IMPLEMENT B PROBABILITY Expected risk 10.9% Expected uplift 89.1% Contribution
51. @AM_Klaassen Make a risk assessment
52. @AM_Klaassen Make a risk assessment IMPLEMENT B PROBABILITY AVERAGE DROP/UPLIFT Expected risk 10.9% -1.85% Expected uplift 89.1% 5.92% Contribution
53. @AM_Klaassen Make a risk assessment IMPLEMENT B PROBABILITY AVERAGE DROP/UPLIFT * EFFECT ON REVENU Expected risk 10.9% -1.85% - \$ 115,220 Expected uplift 89.1% 5.92% \$ 370,700 Contribution \$ 317,936 * Based on 6 months and an average order value of € 100
54. @AM_Klaassen Calculate the ROI IMPLEMENT B BUSINESS CASE Contribution \$ 317,936
55. @AM_Klaassen Calculate the ROI IMPLEMENT B BUSINESS CASE Contribution \$ 317,936 Margin (20%) \$ 63,587 Cost of implementation \$ 15,000
56. @AM_Klaassen Calculate the ROI IMPLEMENT B BUSINESS CASE Contribution \$ 317,936 Margin (20%) \$ 63,587 Cost of implementation \$ 15,000 ROI 424%
57. @AM_Klaassen Or the payback period IMPLEMENT B BUSINESS CASE Average CR change 5.00%
58. @AM_Klaassen Or the payback period IMPLEMENT B BUSINESS CASE Average CR change 5.00% Extra margin per week \$ 2,400 Cost of implementation \$ 15,000
59. @AM_Klaassen Or the payback period IMPLEMENT B BUSINESS CASE Average CR change 5.00% Extra margin per week \$ 2,400 Cost of implementation \$ 15,000 Payback period 6.25 weeks
60. We still need the scientist
61. @AM_Klaassen Adding direct value Learning user behavior
62. @AM_Klaassen The cut-off probability for implementation is not the same as the cut-off probability for a learning CHANCE LEARNING? < 70 % No learning 70 – 85 % Indication – need retest to confirm 85 – 95 % Strong indication – need follow-up test to confirm > 95 % Learning We still need the scientist!
63. Comparison
64. @AM_Klaassen Comparison both methods • 50 A/B-tests, • 50.000 visitors per variation, • conversion rate of 2%, • average order value of \$100, • minimum contribution of \$150,000 in 6 months time • (equivalent to \$30,000 extra margin : ROI of 200%)
65. @AM_Klaassen Comparison both methods
66. @AM_Klaassen Comparison both methods
67. @AM_Klaassen Comparison both methods FREQUENTIST BAYESIAN Implementations 10 29
68. @AM_Klaassen Comparison both methods FREQUENTIST BAYESIAN Implementations 10 29 Expected uplift \$ 4,682,600 \$11,068,800 Expected risk \$ 234,130 \$ 2,984,800
69. @AM_Klaassen Comparison both methods FREQUENTIST BAYESIAN Implementations 10 29 Expected uplift \$ 4,682,600 \$11,068,800 Expected risk \$ 234,130 \$ 2,984,800 Risk % 5% 27% Contribution \$4,448,470 \$9,757,489 Margin (20%) \$ 889,974 \$1,951,498
70. @AM_Klaassen Maximize margin \$
71. Implementation rate Higher
72. Revenue and margin Maximize
73. Happy HiPPO
74. Higher in test teamenergy
75. Higher in the organization visibility
76. A/B-test programSuccessful
77. THANK YOU! Download PDF: ondi.me/change Bayesian calculator: ondi.me/bayes Slide deck: ondi.me/annemarie @AM_Klaassen annemarie@onlinedialogue.com nl.linkedin.com/in/amklaassen

Editor's Notes

1. First of all thank you so much for having me over! I already spend a few days on Jamaica and it is truly amazing!
2. So, a bit about me. I flew in Saturday from the Netherlands. So I live in this tiny country in Europe. And I can tell you the sun doesn’t shine as often and strong as in Jamaica. I got already a pretty nasty sunburn  I have 8 years of webanalytics experience and I ‘m always looking to find real insights from user data.
3. I may not look the part, but I’m basically a nerd. Every A/B-test I have ever done has been analyzed using Excel. I have build my own test evaluation tools based on my statistical knowledge from University. You will get to see some screenshots of these Excel tools I’ve build in this presentation. And one of these tools has now been released as a webtool as well. I’ll come back to this later.
4. And the other thing about me: I love to travel and explore the world, especially to islands. So this conference with fellow nerds on a tropical island is just perfect!
5. So what is it that I do on a daily basis?
6. I work at Online Dialogue: a data-driven conversion optimization agency in the Netherlands. Some of you may be familiar with my crazy bosses Ton and Bart. We have a mixed group of analyst, psychologists, ux designers, developers and project leads.
7. We combine data insights with psychological insights for evidence based growth.
8. With every client we use this framework for conversion optimization. First we look at the data and determine the pages with the highest test power: which pages have enough visitors and conversions to be able to test on? Then we look at the paths visitors take on the website to make a booking or place an order. So, what are the main online customer journeys? And where are the biggest leaks in this process? These data findings are then send to the psychologist. He or she combines this data with scientific research to come up with hypothesis to test These hypothesis are then briefed to the designer who will come up with test variations. These variations are then tested in several A/B-tests (since you cannot prove an hypothesis based on one experiment) The learnings of these A/Btests are then combined in overall learnings which can then be shared with the rest of the organization.
9. In order to run this program, we run lots and lots of A/B-tests.
10. The purpose of those A/B-tests is to add direct value in the short term – you want to increase the revenue that is coming in to the website. And in the long run to really learn from user behavior. What is it that triggers the visitors on that particular website? And how can we use those insights to come up with even better optimization efforts on other parts of the site?
11. This of course sounds terrific, but in practice we see a lot of A/B-test programs cancelled. There is a real challenge in keeping such a program alive. If not everyone in your organisation believes in A/B-testing you will have a hard time proving its worth.
12. In order to have successful A/B-test program we believe you need at least 1 winner in 2 weeks. So that every other week the site changes for the better.
13. Otherwise you will drain the energy out of your test team. Test team members put a lot of time and energy in finding the insights, developing test variations and analyzing them. If these efforts aren’t rewarded their energy drops.
14. And another more important consequence is that you will have lower visibility in the organization. If you only deliver a winning variation once a month or less you will not be perceived as very important to the business. So you will be deprioritized.
15. So if the energy of the test team drops and visibility in the organization is low, your A/B-test program will die!
16. On average we have a success rate of 25% with our clients.
17. In the market this success rate is even lower. Some companies only have like 1 in 8 winners.
18. So in order to get to 1 implementation within 2 weeks you need at least 4 tests per week
19. But you don’t just run 4 tests a week. You need a lot of resources for this AND you need high traffic volumes. Which you don’t always have.
20. So if you cannot run 4 tests a week, you need a higher implementation rate to get the 1 winner in 2 weeks
21. There are 2 solutions for this:
22. This can be achieved by a couple things: you can get more data insights before you run the test. So you improve your conversion study (refer to talk Peep)
23. And you can start using consumer psychology and scientific research to combine customer journey analysis with scientific insights for better hypothesis
24. And in the test fase you could - Test bolder changes. Bolder changes normally means you are more likely to change visitor behaviour. - And/or run it on more traffic, to be able to recognize lower uplifts
25. And the other solution is to look at the statistics you are using to determine winners. Because you can redefine what is perceived as a winner. Should you really not implement a non-significant test variation?
26. There are a couple of challenges with the traditional t-test, which is most commonly used.
27. First of all. It’s really hard to understand what a test result actually tells you. When you use a t-test (which we all have been using) you state a null hypothesis. You may recall this from your statistics classes. You calculate the p-value and decide to reject the null hypothesis or not. So you try to reject the hypothesis that the conversion rates are the same.
28. So, suppose you did an experiment and the p-value of that test was 0.01 You measured an uplift of 9,58% - and the graph indicates you have a winner
29. It is the second one, but you have to read it more than once just to get a grip of what is says. You will have a hard time explaining this to your team and higher up the organization.
30. What you actually want the result to tell you is what the chance is that B is better than A, but that is not what the p-value tells you.
31. Try explaining that to your manager. The hippo in your organization or higher management won’t understand what the heck you are talking about. They just want to know if they should implement the variation in order to make more money.
32. The second challenge is that with a t-test an A/B-test can only have 2 outcomes: winner of no winner. And the focus is on finding those real winners. You want to take as little risk as possible. This stems from the fact that t-tests have been used in a lot of medical research as well. Of course you don’t want to bring a medicine to the market if you’re not 100% sure that it won’t make people worse of kill them. But businesses aren’t run this way. You need to take some risk to grow your business.
33. When you look at this test result you will conclude that the experiment was a success. The p-value is very low. So it needs to be implemented. And you can expect an uplift in conversion rate of well over 8% after implementation
34. But based on this test result you would conclude that there is no winner, that it mustn’t be implemented and that the measured uplift in conversion rate wasn’t enough. So you will see this a loser and move on to another test idea. However, there seems to be a positive movement (the measured uplift is 5%), but it isn’t big enough to recognize as a significant winner. You probably only need a few more conversions.
35. So what’s the alternative then? The most common approach to analysing A/B-tests is the t-test (which is based on frequentist statistics). But, over the last couple of years more and more software packages (like VWO) are switching to Bayesian statistics.
36. So what are the advantages of using Bayesian statistics instead?
37. First, there is no statistical terminology involved. There’s no null hypothesis and no p-value and no false positives. You don’t have to explain that if you have a winning variation there’s still a chance the variation won’t make you money: the false positive rate. If you use a chance you automatically see there’s also a chance it won’t be a winner. Second, it shows you the probability that B is better than A. Probability is very easy to explain. Everyone understands this.
38. So this is the excel visualization of a Bayesian A/B-test result I developed. You see the number of users and conversions per variation, the conversion rates, the measured uplift and the chance that B is better than A. Easy right? In the graph you see the chances of the expected uplift after implementation. These lie in a range. The more traffic and conversions the more certain you are of the actual uplift.
39. But these numbers are the same as in our previous example. With a t-test you would conclude that there is no winner and you need to move on to another test idea.
40. So you have a 89,1% chance that B will outperform A. I think every manager will take that chance and implement the variation. And he / she will understand this result.
41. So you get a happy Hippo!
42. The second advantage of using a Bayesian A/B-test evaluation method is that it doesn’t have a binary outcome like the t-test does. A test result won’t tell you winner / no winner, but a percentage between 0 and 100% whether the variation performs better than the original
43. Instead of the focus of trying to find absolute truths, you can do a risk assessment. Does the chance of an uplift outweigh the chance of loosing money when you would implement the variation?
44. That of course depends on the cost of implementing the variation.
45. In this example you have a 89,1% chance the variation is better than the original. So in 89,1% of the samples is the difference in conversion rate of B and A higher than 0. But in order to earn back the costs of implementation you want to know the chance it will earn you at least x revenue.
46. So how big should the increase in revenue be to justify implementation? Say for example that a test implementation costs 15.000 and your margin is 20% - so you need at least 75.000 extra revenue. So what is the chance that the implementation of the variation will earn you that amount within 6 months? In this example this is still a 82% chance. So you will probably implement it. We calculated the expected revenue over 6 months time ( this is a ballpark for how long a A/B-test result has effect on the business – the environment changes / some effects will be longer, some shorter)
47. So this is the webtool I talked about earlier to make these Bayesian calculations yourself. Just input your test and business case data and calculate! The first graph will show you the main test result + also the chance of at least x revenue.
48. When you know the cost of testing (the cost to implement the variation) you can also calculate the ROI of the test.
49. The average drop of the red bars is -1,85% The average uplift of the green bars is 5,92%
50. So you have a 10.9% chance of a drop in conversion rate of 1.85% And you have a 89.1% chance of an uplift in conversion rate of 5.92%
51. In money terms this translates to a drop in revenue of 115 thousand 220 or an increase of revenue of 370 thousand 700 dollar. Multiply 10.9 times the drop in revenue plus 89.1 times the uplift in revenue and you have the contribution of this test.
52. With this contribution you calculate the ROI
53. So if you have a margin of 20% this means this test will earn you over 63 thousand dollar The cost of implementation is 15 thousand dollar
54. This then results in a ROI of 424%. Pretty positive right?
55. You could also look at the payback period. So you don’t look at the 6 months revenue, but calculate how long it would take to earn back the investment. In this example you have measured a change in conversion rate of 5%
56. A 5% uplift means extra margin of 24 hundred dollar each week. If you take the 15 thousand dollar investment
57. It will take 6 and a quarter week to earn back this investment. You can set certain cut-off values as to when to implement. If it takes longer than 3 months to earn it back, ‘then don’t implement.
58. However, there is a word of caution to this. Because, we still need the scientist!
59. As you might recall we test to add direct value, but we still want to learn about user behavior
60. We take these numbers as a ballpark. If the test has a probability lower than 70% we won’t see it as a learning. If the percentage lies between 70 and 85% we see it as an indication something is there, but we need a retest to confirm the learning. Anything between 85 en 95% is a very strong indication. So we would do follow-up tests on other parts of the website to see if it works there too. And the same as with a t-test: when the chance is higher than 95% we see it as a real learning. So even though you would implement the previous test, it doesn’t prove the stated hypothesis. It shows a strong indication, but to be sure the hypothesis is true you need follow-up tests to confirm this learning.
61. So what does this mean in terms of revenue over time if you compare the two methods?
62. So I looked at 50 example test results and whether it would be implemented based on a t-test and based on a Bayesian risk assessment.
63. So these are the numbers of the first 10 tests with the Bayesian probability and the significance indication based on a t-test
64. I also calculated the revenue it would add to the business in 6 months for each test based on a bayesian and a frequentist approach
65. Based on 50 tests, 1 in 5 was a significant winner if you use frequentist statistics: so you will implement 10 test variation over time When you use a Bayesian test evaluation than the number of tests that are implemented rise to 29!. This is a whopping 58%.
66. As you see the expected uplift of using bayesian is way higher than using frequentist statistics. But the risk is also higher.
67. With frequentist statistics you have a 5% risk (the false positive rate), with this bayesian example 27%. But because the uplift is way higher as well you end up with a higher contribution for the Bayesian approach in the end.
68. When you put these numbers in a graph you will see this: the bayesian approach will increase your implementation rate and you earn you way more money
69. So you get a happy Hippo!
Advertisement