@AM_Klaassen
Challenge 1: Hard to Understand
In a frequentist test you state a null hypothesis:
H0 = Variation A and B have the same conversion rate
@AM_Klaassen
Say for example you did an experiment and the p-value of that test
was 0.01.
Challenge 1: Hard to Understand
http://onlinedialogue.com/abtest-visualization-excel/
@AM_Klaassen
Which statement about the p-value (p=0.01) is true?
a) You have absolutely disproved the null hypothesis: that is,
there is no difference between the variations
b) There is a 1% chance of observing a difference as large as
you observed even if the two means are identical
c) There’s a 99% chance that B is better than A
Challenge 1: Hard to Understand
@AM_Klaassen
Which statement about the p-value is true?
a) You have absolutely disproved the null hypothesis: that is,
there is no difference between the variations
b) There is a 1% chance of observing a difference as large as
you observed even if the two means are identical
c) There’s a 99% chance that B is better than A
Challenge 1: Hard to Understand
@AM_Klaassen
Which statement about the p-value is true?
a) You have absolutely disproved the null hypothesis: that is,
there is no difference between the variations
b) There is a 1% chance of observing a difference as large as
you observed even if the two means are identical
c) There’s a 99% chance that B is better than A
Challenge 1: Hard to Understand
@AM_Klaassen
So the p-value only tells you:
How unlikely is it that you found this result,
given that the null hypothesis is true (that there
is no difference between the conversion rates)
Challenge 1: Hard to Understand
@AM_Klaassen
1. No statistical terminology involved
2. Answers the question directly: ‘what is the probability
that variation B is better than A’
Advantage 1: Easy to understand
@AM_Klaassen
Make a risk assessment
IMPLEMENT B PROBABILITY
AVERAGE
DROP/UPLIFT
Expected
risk
10.9% -1.85%
Expected
uplift
89.1% 5.92%
Contribution
@AM_Klaassen
Make a risk assessment
IMPLEMENT B PROBABILITY
AVERAGE
DROP/UPLIFT
* EFFECT ON
REVENU
Expected
risk
10.9% -1.85% - $ 115,220
Expected
uplift
89.1% 5.92% $ 370,700
Contribution $ 317,936
* Based on 6 months and an average order value of € 100
@AM_Klaassen
Or the payback period
IMPLEMENT B BUSINESS CASE
Average CR change 5.00%
Extra margin per week $ 2,400
Cost of implementation $ 15,000
@AM_Klaassen
Or the payback period
IMPLEMENT B BUSINESS CASE
Average CR change 5.00%
Extra margin per week $ 2,400
Cost of implementation $ 15,000
Payback period 6.25 weeks
@AM_Klaassen
The cut-off probability for implementation is not the same as the
cut-off probability for a learning
CHANCE LEARNING?
< 70 % No learning
70 – 85 % Indication – need retest to confirm
85 – 95 % Strong indication – need follow-up test to confirm
> 95 % Learning
We still need the scientist!
@AM_Klaassen
Comparison both methods
• 50 A/B-tests,
• 50.000 visitors per variation,
• conversion rate of 2%,
• average order value of $100,
• minimum contribution of $150,000 in 6 months time
• (equivalent to $30,000 extra margin : ROI of 200%)
First of all thank you so much for having me over! I already spend a few days on Jamaica and it is truly amazing!
So, a bit about me. I flew in Saturday from the Netherlands. So I live in this tiny country in Europe. And I can tell you the sun doesn’t shine as often and strong as in Jamaica. I got already a pretty nasty sunburn
I have 8 years of webanalytics experience and I ‘m always looking to find real insights from user data.
I may not look the part, but I’m basically a nerd.
Every A/B-test I have ever done has been analyzed using Excel.
I have build my own test evaluation tools based on my statistical knowledge from University.
You will get to see some screenshots of these Excel tools I’ve build in this presentation. And one of these tools has now been released as a webtool as well. I’ll come back to this later.
And the other thing about me: I love to travel and explore the world, especially to islands.
So this conference with fellow nerds on a tropical island is just perfect!
So what is it that I do on a daily basis?
I work at Online Dialogue: a data-driven conversion optimization agency in the Netherlands. Some of you may be familiar with my crazy bosses Ton and Bart. We have a mixed group of analyst, psychologists, ux designers, developers and project leads.
We combine data insights with psychological insights for evidence based growth.
With every client we use this framework for conversion optimization.
First we look at the data and determine the pages with the highest test power: which pages have enough visitors and conversions to be able to test on?
Then we look at the paths visitors take on the website to make a booking or place an order. So, what are the main online customer journeys? And where are the biggest leaks in this process?
These data findings are then send to the psychologist. He or she combines this data with scientific research to come up with hypothesis to test
These hypothesis are then briefed to the designer who will come up with test variations.
These variations are then tested in several A/B-tests (since you cannot prove an hypothesis based on one experiment)
The learnings of these A/Btests are then combined in overall learnings which can then be shared with the rest of the organization.
In order to run this program, we run lots and lots of A/B-tests.
The purpose of those A/B-tests is to add direct value in the short term – you want to increase the revenue that is coming in to the website.
And in the long run to really learn from user behavior.
What is it that triggers the visitors on that particular website?
And how can we use those insights to come up with even better optimization efforts on other parts of the site?
This of course sounds terrific, but in practice we see a lot of A/B-test programs cancelled. There is a real challenge in keeping such a program alive.
If not everyone in your organisation believes in A/B-testing you will have a hard time proving its worth.
In order to have successful A/B-test program we believe you need at least 1 winner in 2 weeks. So that every other week the site changes for the better.
Otherwise you will drain the energy out of your test team. Test team members put a lot of time and energy in finding the insights, developing test variations and analyzing them. If these efforts aren’t rewarded their energy drops.
And another more important consequence is that you will have lower visibility in the organization. If you only deliver a winning variation once a month or less you will not be perceived as very important to the business. So you will be deprioritized.
So if the energy of the test team drops and visibility in the organization is low, your A/B-test program will die!
On average we have a success rate of 25% with our clients.
In the market this success rate is even lower. Some companies only have like 1 in 8 winners.
So in order to get to 1 implementation within 2 weeks you need at least 4 tests per week
But you don’t just run 4 tests a week. You need a lot of resources for this AND you need high traffic volumes. Which you don’t always have.
So if you cannot run 4 tests a week, you need a higher implementation rate to get the 1 winner in 2 weeks
There are 2 solutions for this:
This can be achieved by a couple things: you can get more data insights before you run the test.
So you improve your conversion study (refer to talk Peep)
And you can start using consumer psychology and scientific research to combine customer journey analysis with scientific insights for better hypothesis
And in the test fase you could
- Test bolder changes. Bolder changes normally means you are more likely to change visitor behaviour.
- And/or run it on more traffic, to be able to recognize lower uplifts
And the other solution is to look at the statistics you are using to determine winners. Because you can redefine what is perceived as a winner. Should you really not implement a non-significant test variation?
There are a couple of challenges with the traditional t-test, which is most commonly used.
First of all. It’s really hard to understand what a test result actually tells you.
When you use a t-test (which we all have been using) you state a null hypothesis. You may recall this from your statistics classes. You calculate the p-value and decide to reject the null hypothesis or not. So you try to reject the hypothesis that the conversion rates are the same.
So, suppose you did an experiment and the p-value of that test was 0.01
You measured an uplift of 9,58% - and the graph indicates you have a winner
It is the second one, but you have to read it more than once just to get a grip of what is says. You will have a hard time explaining this to your team and higher up the organization.
What you actually want the result to tell you is what the chance is that B is better than A, but that is not what the p-value tells you.
Try explaining that to your manager. The hippo in your organization or higher management won’t understand what the heck you are talking about. They just want to know if they should implement the variation in order to make more money.
The second challenge is that with a t-test an A/B-test can only have 2 outcomes: winner of no winner.
And the focus is on finding those real winners. You want to take as little risk as possible.
This stems from the fact that t-tests have been used in a lot of medical research as well. Of course you don’t want to bring a medicine to the market if you’re not 100% sure that it won’t make people worse of kill them. But businesses aren’t run this way. You need to take some risk to grow your business.
When you look at this test result you will conclude that the experiment was a success. The p-value is very low. So it needs to be implemented. And you can expect an uplift in conversion rate of well over 8% after implementation
But based on this test result you would conclude that there is no winner, that it mustn’t be implemented and that the measured uplift in conversion rate wasn’t enough. So you will see this a loser and move on to another test idea.
However, there seems to be a positive movement (the measured uplift is 5%), but it isn’t big enough to recognize as a significant winner. You probably only need a few more conversions.
So what’s the alternative then?
The most common approach to analysing A/B-tests is the t-test (which is based on frequentist statistics).
But, over the last couple of years more and more software packages (like VWO) are switching to Bayesian statistics.
So what are the advantages of using Bayesian statistics instead?
First, there is no statistical terminology involved. There’s no null hypothesis and no p-value and no false positives. You don’t have to explain that if you have a winning variation there’s still a chance the variation won’t make you money: the false positive rate. If you use a chance you automatically see there’s also a chance it won’t be a winner.
Second, it shows you the probability that B is better than A. Probability is very easy to explain. Everyone understands this.
So this is the excel visualization of a Bayesian A/B-test result I developed.
You see the number of users and conversions per variation, the conversion rates, the measured uplift and the chance that B is better than A.
Easy right?
In the graph you see the chances of the expected uplift after implementation. These lie in a range. The more traffic and conversions the more certain you are of the actual uplift.
But these numbers are the same as in our previous example. With a t-test you would conclude that there is no winner and you need to move on to another test idea.
So you have a 89,1% chance that B will outperform A. I think every manager will take that chance and implement the variation.
And he / she will understand this result.
So you get a happy Hippo!
The second advantage of using a Bayesian A/B-test evaluation method is that it doesn’t have a binary outcome like the t-test does.
A test result won’t tell you winner / no winner, but a percentage between 0 and 100% whether the variation performs better than the original
Instead of the focus of trying to find absolute truths, you can do a risk assessment.
Does the chance of an uplift outweigh the chance of loosing money when you would implement the variation?
That of course depends on the cost of implementing the variation.
In this example you have a 89,1% chance the variation is better than the original. So in 89,1% of the samples is the difference in conversion rate of B and A higher than 0.
But in order to earn back the costs of implementation you want to know the chance it will earn you at least x revenue.
So how big should the increase in revenue be to justify implementation?
Say for example that a test implementation costs 15.000 and your margin is 20% - so you need at least 75.000 extra revenue.
So what is the chance that the implementation of the variation will earn you that amount within 6 months? In this example this is still a 82% chance. So you will probably implement it.
We calculated the expected revenue over 6 months time ( this is a ballpark for how long a A/B-test result has effect on the business – the environment changes / some effects will be longer, some shorter)
So this is the webtool I talked about earlier to make these Bayesian calculations yourself. Just input your test and business case data and calculate!
The first graph will show you the main test result + also the chance of at least x revenue.
When you know the cost of testing (the cost to implement the variation) you can also calculate the ROI of the test.
The average drop of the red bars is -1,85%
The average uplift of the green bars is 5,92%
So you have a 10.9% chance of a drop in conversion rate of 1.85%
And you have a 89.1% chance of an uplift in conversion rate of 5.92%
In money terms this translates to a drop in revenue of 115 thousand 220 or an increase of revenue of 370 thousand 700 dollar.
Multiply 10.9 times the drop in revenue plus 89.1 times the uplift in revenue and you have the contribution of this test.
With this contribution you calculate the ROI
So if you have a margin of 20% this means this test will earn you over 63 thousand dollar
The cost of implementation is 15 thousand dollar
This then results in a ROI of 424%. Pretty positive right?
You could also look at the payback period. So you don’t look at the 6 months revenue, but calculate how long it would take to earn back the investment.
In this example you have measured a change in conversion rate of 5%
A 5% uplift means extra margin of 24 hundred dollar each week.
If you take the 15 thousand dollar investment
It will take 6 and a quarter week to earn back this investment.
You can set certain cut-off values as to when to implement. If it takes longer than 3 months to earn it back, ‘then don’t implement.
However, there is a word of caution to this. Because, we still need the scientist!
As you might recall we test to add direct value, but we still want to learn about user behavior
We take these numbers as a ballpark. If the test has a probability lower than 70% we won’t see it as a learning. If the percentage lies between 70 and 85% we see it as an indication something is there, but we need a retest to confirm the learning.
Anything between 85 en 95% is a very strong indication. So we would do follow-up tests on other parts of the website to see if it works there too. And the same as with a t-test: when the chance is higher than 95% we see it as a real learning.
So even though you would implement the previous test, it doesn’t prove the stated hypothesis. It shows a strong indication, but to be sure the hypothesis is true you need follow-up tests to confirm this learning.
So what does this mean in terms of revenue over time if you compare the two methods?
So I looked at 50 example test results and whether it would be implemented based on a t-test and based on a Bayesian risk assessment.
So these are the numbers of the first 10 tests with the Bayesian probability and the significance indication based on a t-test
I also calculated the revenue it would add to the business in 6 months for each test based on a bayesian and a frequentist approach
Based on 50 tests, 1 in 5 was a significant winner if you use frequentist statistics: so you will implement 10 test variation over time
When you use a Bayesian test evaluation than the number of tests that are implemented rise to 29!. This is a whopping 58%.
As you see the expected uplift of using bayesian is way higher than using frequentist statistics. But the risk is also higher.
With frequentist statistics you have a 5% risk (the false positive rate), with this bayesian example 27%.
But because the uplift is way higher as well you end up with a higher contribution for the Bayesian approach in the end.
When you put these numbers in a graph you will see this: the bayesian approach will increase your implementation rate and you earn you way more money