Workshop at MeasureCamp Amsterdam about building a data driven test strategy. Where can you test? What should you test? How do you analyze the results?
34. Scientific literature Verified
(2nd party)
What do we know from scientific literature?
In general about decision-making processes
And specifically about the type of products sold
107. So the p-value only tells you:
How unlikely is it that you found this result,
given that the null hypothesis is true (that there
is no difference between the conversion rates)
I am Annemarie Klaassen and I work as an analytics and optimization expert at online dialogue. I studied at Tilburg University where I completed my master in Leisure Studies and Marketing Management. I have a real passion for data and traveling. I actually just returned from a trip to NY, so I’m a bit jetlagged. Hopefully you won’t be able to notice it too much.
We work at OD: a conversion rate optimization agency in Utrecht. Our goal is to grow businesses by improving their conversion rate.
There are a couple more conversion rate optimization agencies in the Netherlands, but our USP is the combination we make between analytics and psychology.
We combine data insights with psychological insights for evidence based growth.
AN
We do this for a bunch of clients in the Netherlands and also for some pretty cool international clients.
For most we do high velocity testing. Which means , we run multiple tests per week for them.
P: Potential
I: Impact
P: Power
E: Ease
Waar zit de aandacht op de pagina? Welke elementen worden wel/niet gebruikt?
P: Potential
I: Impact
P: Power
E: Ease
The first thing you do is map out all the different page types you have on your website, then look at the weekly unique visitors you have on that page type and the conversions through that page as well.
Then you determine whether the pages have enough test power – based on these numbers.
Now you might wonder what test Power actually means, well..
Frequentist testing is very much like a court trial in the US.
The null hypothesis says that the defendant is innocent
and the alternative hypothesis says that the defendant is guilty.
We then present evidence or, or in other words, collect data.
Then, we judge this evidence and ask ourselves the question, could the data plausibly have happened by chance if the null hypothesis were true?
If the data were likely to have occurred under the assumption that the null hypothesis were true, then we would fail to reject the null hypothesis, and state that the evidence is not sufficient to suggest that the defendant is guilty.
If the data were very unlikely to have occurred, then the evidence raises more than a reasonable doubt about the null hypothesis, and hence we reject the null hypothesis.
This judging of evidence is done with the p-value.
If you test against a significance level of 90%, then you will have a 10% false positive rate (10% of your declared winners aren’t real winners)
If you test against a Power of 80%, then in 20% of the tests you won’t declare a winner, when in fact it is.
The test power is the likelihood that an experiment will detect an effect when there is an effect to be detected. You want to make sure you can find the winning variation in the collected data.
The power depends on 3 elements: the sample size (so on how much traffic you run your test), the effect size (that means the actual uplift in conversion) and the chosen significance level.
If you visit Abtestguide.com you can calculate the Power of a test given the number of visitors and conversions and the expected uplift of the test. In this case you have 10.000 visitors per variation and 1000 conversions in the control. You expect an uplift of 5%. This results in a Power of only 65%. This is not very high! You will only detect 65% of the time a winner when there is a winner to be detected.
A ground rule for the Power is at least 80%. To increase the Power of the test you can do 3 things:
If you visit Abtestguide.com you can calculate the Power of a test given the number of visitors and conversions and the expected uplift of the test. In this case you have 10.000 visitors per variation and 1000 conversions in the control. You expect an uplift of 5%. This results in a Power of only 65%. This is not very high! You will only detect 65% of the time a winner when there is a winner to be detected.
A ground rule for the Power is at least 80%. To increase the Power of the test you can do 3 things:
You can increase the sample size: or the number of visitors in your experiment. If you double the test duration (so you get 20.000 visitors and 2000 conversions), then you see that the distributions of the 2 variations lie further apart. Hence, the power increases to 85,4%.
You can increase the sample size: or the number of visitors in your experiment. If you double the test duration (so you get 20.000 visitors and 2000 conversions), then you see that the distributions of the 2 variations lie further apart. Hence, the power increases to 85,4%.
The other element is effect size; how much uplift do you expect from your variation? If you expect an uplift of 10% instead of 5% then your test Power increases immensely.
You need to be aware what kind of uplift can be expected of the A/B-test. You learn this by doing a lot of experiments, but it’s quite rare to find winning variation with an uplift higher than 10%. Most of the time it’s not higher than 5%. This of course also depends on the type of test your doing. If you only change a headline you probably won’t get a 10% uplift.
The other element is effect size; how much uplift do you expect from your variation? If you expect an uplift of 10% instead of 5% then your test Power increases immensely.
You need to be aware what kind of uplift can be expected of the A/B-test. You learn this by doing a lot of experiments, but it’s quite rare to find winning variation with an uplift higher than 10%. Most of the time it’s not higher than 5%. This of course also depends on the type of test your doing. If you only change a headline you probably won’t get a 10% uplift.
P: Potential
I: Impact
P: Power
E: Ease
You can look at different segments in your data, look at click behavior per variation, time on page and other micro conversions.
What are the main ways of analysing A/B-tests then?
The most common approach to analysing A/B-tests is the t-test (which is based on frequentist statistics).
But, over the last couple of years Bayesian statistics have grown in popularity.
I will try to explain both in a bit.
We will start with frequentist statistics.
Frequentist testing is very much like a court trial in the US.
The null hypothesis says that the defendant is innocent
and the alternative hypothesis says that the defendant is guilty.
We then present evidence or, or in other words, collect data.
Then, we judge this evidence and ask ourselves the question, could the data plausibly have happened by chance if the null hypothesis were true?
If the data were likely to have occurred under the assumption that the null hypothesis were true, then we would fail to reject the null hypothesis, and state that the evidence is not sufficient to suggest that the defendant is guilty.
If the data were very unlikely to have occurred, then the evidence raises more than a reasonable doubt about the null hypothesis, and hence we reject the null hypothesis.
IT’s a mnemonic to remember what to do
I will give an example how this translates to an A/B-test.
When you use a t-test you first state a null hypothesis. You calculate the p-value and decide to reject the null hypothesis or not. So you try to reject the hypothesis that the conversion rates are the same.
So, suppose you did an experiment and the p-value of that test was 0.01. The p-value in this experiment tells you that There is a 1% chance of observing a difference as large as you observed even if the two means are identical.
The p-value is very low, so the H0 gets to go.
The other challenge with using frequentist statistics is that an A/B-test can only have 2 outcomes: you either have a winner of no winner.
And the focus is on finding those real winners. You want to take as little risk as possible.
This is not so surprising if you take into account that t-tests have been used in a lot of medical research as well. Of course you don’t want to bring a medicine to the market if you’re not 100% sure that it won’t make people worse of kill them. You don’t want to take any risk whatsoever.
But businesses aren’t run this way. You need to take some risk in order to grow your business.
If you take a look at this test-result you would conclude that there is no winner, that it mustn’t be implemented and that the measured uplift in conversion rate wasn’t enough. So you will see this a loser and move on to another test idea.
However, there seems to be a positive movement (the measured uplift is 5%), but it isn’t big enough to recognize as a significant winner. You probably only need a few more conversions.
If Frequentists statistics confronts us with these kind of challenges, what’s the alternative then?
Well as I said earlier, the most common approach to analysing A/B-tests is the t-test (which is based on frequentist statistics).
But, over the last couple of years more and more software packages (like VWO and Google Optimize) are switching to Bayesian statistics.
And that’s not without reason, because using Bayesian statistics makes more sense, since it better suits how businesses are run and I will show you why.
So, when you use Bayesian statistics, to evaluate your A/B-test, then there is no difficult statistical terminology involved anymore. There’s no null hypothesis, no p-value or z-value et cetera. It just shows you the measured uplift and the probability that B is better than A. Easy right?
Everyone can understand this.
Based on the same numbers of the A/B-test we showed you earlier you have a 89,1% chance that B will actually be better than A.
Probably every manager would understand this and will like these odds.
Recently we turned this Bayesian Excel calculator into a webtool as well. It’s for everyone free to use.
If you visit this URL you can input your test data and calculate! It will return the chance that B outperforms A.
When using a Bayesian A/B-test evaluation method you no longer have a binary outcome like the t-test does.
A test result won’t tell you winner / no winner, but a percentage between 0 and 100% whether the variation performs better than the original.
In this example 89,1%.
The question that remains is: is this enough to be implemented?
What you can do is make a risk assessment. You can calculate what the results mean in terms of revenue.
When the client decides to implement the variation they have a 10.9% chance of a drop in revenue of 200.000 in 6 months time (and an average order value of 175)
But on the other hand, they also have a 89.1% chance that the variation is actually better and brings in nearly 650.000 euro.
You can show this table to your boss and ask whether he would place the bet.
Well that depends on a couple of things. If you would implement a test variation with a probability of 51% then you’re not doing much better than just flipping a coin. The risk of implementing a losing variation is quite high.
Depending in the type of business you may be more or less willing to take risks. If you are a start-up you might want to take more risk then a full grown business, but still we don’t really like the chance to lose money so what we see with our clients that most need at least a probability of 70%.
But it also depends on the type of test. If you only changed a headline then the risk is lower, then when you need to implement a new functionality on the page. This will consume much more resources. Hence, you will need a higher probability.
The purpose of A/B-testing is of course to add direct value, but we still want to learn about user behavior. If you really want to learn from user behavior then you need to test very strict (say with >95%). Otherwise you only have a hunch, but you don’t have proof.
We take these numbers as a ballpark. If the test has a probability lower than 70% we won’t see it as a learning. If the percentage lies between 70 and 85% we see it as an indication something is there, but we need a retest to confirm the learning.
Anything between 85 en 95% is a very strong indication. So we would do follow-up tests on other parts of the website to see if it works there too. And the same as with a t-test: when the chance is higher than 95% we see it as a real learning.
So even though you would implement the previous test, it doesn’t prove the stated hypothesis. It shows a strong indication, but to be sure the hypothesis is true you need follow-up tests to confirm this learning.
Recently we turned this Bayesian Excel calculator into a webtool as well. It’s for everyone free to use.
If you visit this URL you can input your test data and calculate! It will return the chance that B outperforms A.