Workshop data driven test strategy

SUBTITLE BELOW
Data Driven Test Strategy

© 2017 – Online Dialogue - for MeasureCamp workshop attendees only – Do not duplicate, share or copy without permission of Online Dialogue
10:00 Introduction
10:15 Where to test?
10:45 What to test?
11:15 How to test?
11:45 Break
12:00 Bayesian statistics
12:30 Post-analysis
13:00 The end… 
Program
WORKSHOP DATA DRIVEN TEST STRATEGY

Analytics Psychology

Junior analyst
Director Digital Analytics
Customer Experience Marketeer
CRO & Analytics
CRO Specialist
Conversion Specialist
Data Scientist
Webanalist
Web Analytics Consultant
Product Owner
Online Data Specialist

• Potential: Where can we get the biggest lift?
Where can we get the biggest lift?

Opinions are like **sholes
everyone has one

Without an analyzed reason
to start, it doesn’t make any
sense to start at all

Goal Customer Behaviour Study
 Insight in the most important customer journeys
 Understand behaviour
 Input for setting (test) hypothesis

Customer Behaviour Study
View Voice Verified
(1st party)
Verified
(2nd party)
Value

Web Analytics
Heatmaps
Recordings
Market data
View Voice Verified
(1st party)
Verified
(2nd party)
Value

ViewWeb analytics
 Where do visitors start on the site?
 Where do they come from?
 What is the flow of those visitors?
 Are there notable differences between segments or products?
 What’s the behavior on specific test pages?

 Where do visitors start
their journey?
 Difference between new /
existing customers?
 Difference per device?
Landing pages View

 Where do visitors come
from?
 Do they already have a
product in mind?
 Do they already know the
brand?
Traffic source View

Funnel Analysis View
 What’s the CTR (and
return rate) to the
next step?
 Exit rate and time on
page per step?
 Determine for each
segment / product

Customer Journey Analysis View

Behavior on important pages View
 What’s the next action visitors take on the page?
 What was the previous action taken by visitors?
 What choices do they make?
 Does this differ among segments / products?

ViewHeat- and scroll maps

ViewSession recordings

Web Analytics
Heatmaps
Recordings
Market data
Customer Service
Surveys
Feedback tools
User research
View Voice Verified
(1st party)
Verified
(2nd party)
Value

Talk to customer service!

Ask for feedback online Voice

 Interview customers
 Focus groups
 Usability studies
VoiceUser research

Web Analytics
Heatmaps
Recordings
Customer Service
Surveys
Feedback tools
User research
Previous tests
View Voice Verified
(1st party)
Verified
(2nd party)
Value

Insight previous tests Verified
(1st party)
 What have you tested already?
 What have you learned form those experiments?

Web Analytics
Heatmaps
Recordings
Market data
Customer Service
Surveys
Online Chat
Feedback tools
User research
Previous tests
View Voice Verified
(1st party)
Verified
(2nd party)
Value
Scientific research
Competitors

Scientific literature Verified
(2nd party)
 What do we know from scientific literature?
 In general about decision-making processes
 And specifically about the type of products sold

Web Analytics
Heatmaps
Recordings
Market data
Customer Service
Surveys
Online Chat
Feedback tools
User research
Mission
Vision
Strategy
Goals
Previous tests
View Voice Verified
(1st party)
Verified
(2nd party)
Value
Scientific research
Competitors

What to do with all this info?
1. Fix all the bugs and implement no-brainers
2. Growth-hack on places where you can not test
3. Set hypotheses and build an A/B-test road map

I APPLY THISIf , then THIS BEHAVORIAL CHANGE
will happen, ( among THIS GROUP ),
THIS REASONbecause of .
Set up concrete hypothesis

Challenge your hypothesis
• Impact: Score hypothesis based on 5V

Impact: Score Hypothesis based on 5V

10:00 Introduction
10:15 What to test?
10:45 Where to test?
11:15 How to test?
11:45 Break
12:00 Bayesian statistics
12:30 Post-analysis
13:00 The end… 
Program
WORKSHOP DATA DRIVEN TEST STRATEGY

How should we test the hypothesis?
• Impact: Score Hypothesis based on 5V
• Power: How should we test the hypothesis?

Time span
Conversionspermonth
Risk + Optimization + Automation Re-think
10.000 conversions
per month
1.000 conversions
per month

Where can we test?
Map out your different page(type)s and
determine:
Can I run a test on this page with a
Power of >=80%?

Test Power Determination
WHY?
 To determine what page are eligible to test on
 To structure A/B-tests
 To define the pages with the highest impact

Power & Significance
Do not reject H0 Reject H0
H0 is true
H0 is false
Reality
Measured

Null hypothesis
H0: Defendant is innocent
Alternative hypothesis
Ha: Defendant is guilty
Present the evidence
Collect data
Judge the evidence
“Could the data plausibly have
happened by chance if the
defendant is actually innocent?”
Yes
Fail to reject H0
No
Reject H0

P-value
happened by chance if the null
hypothesis is true?”
Null hypothesis
H0: Conversion rates of default
and variation B are the same
Ha: Variation B is better
Collect data
Yes
Fail to reject H0
No
Reject H0

Significance
H0 is true
H0 is false
Correct decision
 (Significance)
Measured
Reality

Significance
H0 is true
Type I
False Positive (α)
H0 is false
Correct decision
 (Significance)
Measured
Reality

Power
H0 is true
Correct decision
 (Power)
Type I
False Positive (α)
H0 is false
Correct decision
 (Significance)
Measured
Reality

Power
H0 is true
Correct decision
 (Power)
Type I
False Positive (α)
H0 is false
Type II
False Negative (β)
Correct decision
 (Significance)
Measured
Reality

Power & Significance
Power
Only test on pages with a high Power (>80%) 
otherwise you don’t detect effects when there is an
effect to be detected (False negatives).
Significance
Test against a high enough significance level (90%
or 95%)  otherwise you’ll declare a winner, when
in reality there isn’t an effect (False positives).

Power
“Statistical power is the likelihood that an
experiment will detect an effect when there
is an effect there to be detected”.
Depends on:
• Sample size
• Effect size
• Significance level

Power examples abtestguide.com/calc

Power examples abtestguide.com/calc
Power: 64,7%

Power: sample size +

Power: sample size +
Power: 85,4%

Power: effect size +

Power: effect size +
Power: 97,5%

HOW?
1. Determine important KPI’s, segments / test platforms
2. Map out all the different page types for each flow
3. Determine unique weekly visitors per page type
4. Determine unique visitors with a conversion per page type
5. Determine test eligibility (test duration and minimum effect)

2. MAP OUT ALL THE DIFFERENT PAGE TYPES PER FLOW
1. Homepage
2. Listing
3. Product page
4. Campaign landing page
5. Cart
6. Checkout step 1
7. …

3. DETERMINE UNIQUE WEEKLY VISITORS PER PAGE TYPE
 We run and evaluate A/B tests on the unique visitor metric: we want
to influence unique users

4. DETERMINE UNIQUE VISITORS WITH A CONVERSION PER PAGE TYPE
 Visitors must have seen the test page before they converted
€
Converted

Word of Caution
KPI NEEDS TO BE BINARY!
1. A/B-test tools and calculators in the market are only
compatible with binary variables (0/1 variables)
2. You either convert or you don’t
3. Most important assumption: the distribution follows
the normal distribution (symmetrical)
 KPI can’t be AOV, average satisfaction, number of pageviews etc.

5. DETERMINE TEST ELIGIBILITY (TEST DURATION AND MINIMUM EFFECT)
Ondi.me/bandwidth
If weekly conversions are
lower than 250, then testing
becomes very challenging

How many weeks should we run the test?
How big of an uplift is feasible?
 The longer you test, the smaller the uplift you are able to recognize
 Aim for uplifts at least lower than 10%

Sample size <> MDE Ondi.me/samplesize

Ondi.me/bandwidth

Power: How much uplift do we expect?

How easy is it to test and implement?
• Impact: Score Hypothesis based on 5V
• Power: How should we test these hypothesis?
• Ease: How easy is it to test and implement?

Test duration
 You need a sample large enough not to be vulnerable to the
data’s natural variability
 You need a sample representative of your overall audience

Variability – 10 coin tosses
 % heads varies between 30 and 80%

Variability – 100 coin tosses
 % heads varies between 49 and 54%

Central Limit Theorem
“The sampling distribution of the mean of any independent, random
variable will be normal or nearly normal, if the sample size is large
enough.”

Representativeness
 Make sure the proportion of each
group in the sample is
representative of the total
population
 Highly dependent upon sample
size

Representativeness
Variation A Variation B
Visitors 1.000 1.000
Conversions 184 162
Conversion rate 18,4% 16,2%
 Variation B performs 12,3% worse

Representativeness
Variation A Variation B
Visitors 1.000 1.000
Conversions 184 162
Conversion rate 18,4% 16,2%
% new visitors 53% 60%
 Just because the sample isn’t representative (because of low sample
size), variation B seems to perform worse
CR new 3,2%
CR repeat 35,6%

Day-of-the-week effects
 Test for full weeks
to rule out day-of-
the-week effects

Don’t stop test too early!
 Determine the test duration (sample size) upfront
 Stick to the duration!

http://www.einarsen.no/is-your-ab-testing-effort-just-chasing-statistical-ghosts/

https://destack.home.xs4all.nl/projects/significance/#

Sample pollution
“Sample pollution happens when people
in a test see both variations (ABBA-effect)
due to any uncontrolled external factor.

Cookie deletion
 When a visitor deletes its cookie, he/she will be
newly assigned to a variation in their next visit
 In case of an A/B-test this means a probability of
50% of seeing the wrong variation
 The more variations you test, the higher the
probability he/she will see the wrong variation

Cookie deletion
 5-10% deletes their cookie in every session
 Estimations in the market: 50% cookie deletion
within a few months, 31% within a month
 How big is your cookie deletion rate within the test
period?

Cross-device usage
 When a visitor visits the site on another
device as well, he/she has a 50% of seeing
the wrong variation (with 2 variations)
 The more variations you test, the higher the
probability that he/she will see the wrong
variation
 Do research how big this issue is on your site
Did you visit our website on other
devices in the last week?
(i.e. desktop/smartphone/tablet)
a) No, I only visited website.com
with my current device
b) Yes, on 1 other device
c) Yes, on 2 or more other devices

Customer Journey Effects
 When it takes a couple of visits for a visitors
to convert, it’s likely that he/she visited the
site prior to the test
 The longer the customer journey, the higher
this likelihood
 Do research how big this issue is on your site

Test Duration vs Sample Pollution
The longer the test runs, the smaller the difference in conversion
rate you can detect, but…the longer the test runs, the higher the
chance of sample pollution
 It’s a balance between the two:
we recommend testing for max 4 weeks

Test Integration
&
extra measurements

Can’t I just rely on the tool?
 You could, but tools are a black-box
 If you integrate with you analytics tool you
have more data: you control the data, you
can do the calculations yourself and analyse
more!

Test set-up & integration
 Integrate your test tool with your analytics tool
 Use Custom Dimensions / eVars or Event
tracking to measure the variations
 Rather use the Code Editor instead of the
standard integration, so you have better control
over when to fire the measurement
//Set variable information
var testID = 'OD000',
testNamed = 'Testname',
testVariation = 'A: Control';
//Integration Google Analytics
ga('create', 'UA-xxxxxx-x', 'auto');
ga('send', 'event', 'AB-Test', testID+' -
'+testNamed+' - '+testVariation,
testVariation, {'nonInteraction': 1});

Q & A your test
 Is the variation working correctly?
 Are all the extra measurements working correctly?
 Always check browser compatibility
 And device compatibility
 Do not break dynamic stuff!

Frequentist
statistics
Bayesian
statistics

Null hypothesis
H0: Conversion rates of default
and variation B are the same
Ha: Conversion rate of variation
B is better
Collect data
Judge the evidence (p-value)
happened by chance if the null
hypothesis is true?”
Yes
Fail to reject H0
No
Reject H0

H0 = Variation A and B have the same conversion rate
Hard to Understand
Ondi.me/vis

So the p-value only tells you:
How unlikely is it that you found this result,
given that the null hypothesis is true (that there
is no difference between the conversion rates)

Focus on finding proof

Challenge 2: Focus on finding proof

What’s the alternative?
Frequentist
statistics
Bayesian
statistics

Bayesian Test evaluation

abtestguide.com/bayesian/

Bayesian Test evaluation
89,1%
A test result is the probability that B outperforms A:
ranging from 0% - 100%

IMPLEMENT B PROBABILITY * EFFECT ON REVENU
Expected risk 10,9% - € 204.400
Expected uplift 89,1% € 647.150
Contribution € 554.552
* Based on 6 months and an average order value of € 175
Make a risk assessment

What’s an acceptable probability?
 Depends on the business : how much risk is the
business willing to take?
 Depends on the type of test : how invasive is the
test? How much resources does it cost?

Adding
direct value
Learning
user behavior

The cut-off probability for implementation is not the same as the
cut-off probability for a learning
We still need the scientist!
CHANCE
LEARNING?
(Customer Intelligence)
< 70 % No learning
70 – 85 %
Indication
(need retest to confirm)
85 – 95 %
Strong indication
(need congruent other data)
> 95 % Learning

Before you start analyzing
 Know the test duration. Only check during the test that conversion are
coming in and the traffic is evenly distributed.
 Check if the population of users that have seen the test are about the
same per variation (to make sure you have a representative sample)
 Determine how to isolate the test population (users who have seen the
variation).
 Determine the test goals and how to isolate those users (users who have
seen the variation followed by a test goal).

Post-analysis - Basics
 Analyse in the analytics tool and not in the test tool.
 Avoid sampling .
 Analyse users not sessions.
 Analyse users who have converted (0/1 variable) not users and total
conversions.

Users of the test
 We run and evaluate A/B tests on the unique visitor metric: we want
to influence unique users

Users in the test
 Build a segment for the specific targeting of the test

Users with a conversion
 Visitors must have seen the test page before they converted
€
Converted

Users with a conversion
 Build a 2nd sequential segment with page seen  converted

Build a Custom Report in GA
AND APPLY THE SEGMENTS FOR USERS AND USERS WITH A CONVERSION

Or even automate it! https://groups.google.com/forum/#!forum/
google-analytics-spreadsheet-add-on

Or even automate it!

Calculate the chance that B > A

Main result (a)
THE TEST VARIATION LEADS TO A HIGHER ADD TO CART RATIO
 In the original 14,89% adds a product to their cart, in the
variation 15,49%. This is an uplift of +3,99%
 The chance that the variation leads to more visitors who
put a product in their cart is 96,7%
VISITORS ADD TO CART ATC RATIO UPLIFT
A 24.590 3.662 14,89%
B 24.396 3.778 15,49% +3,99%
Chance 96,7%
Chance 3,3%
Chance of B outperforming A

Main result (b)
THE TEST VARIATION LEADS TO A HIGHER CONVERSION RATE
 In the original 5,82% finishes a order, in the variation
6,11%. This is an uplift of +5,02%
 The chance that the variation leads to more orders is
91,3%
VISITORS ORDERS CR UPLIFT
A 24.590 1.431 5,82%
B 24.396 1.491 6,11% +5,02%
Chance 91,3%
Chance 8,7%
Chance of B outperforming A

Post-analysis - Deep dive
Analyse relevant segments:
 User type
 Device category
 Channel
 Entry page
Also look at other relevant statistics:
 Time on test page / exit% / bounce%
 Interactions on the test page

Word of Caution
DON’T SEGMENT ON EVERYTHING!
• Each analyzed segment needs to have enough visitors and conversions
• Be careful with too many segments: you might end up with False
Positives! (each segment counts as a new A/B-test, so the likelihood of
a False Positive increases with each segment)
• If you find a difference between segments  run a separate A/B-test to
confirm the finding

Results per important segment
 The uplift in conversion rate is only apparent
on desktop devices (+7,54%). On tablet
devices a drop in conversion rate was
measured of -2,47%
 For both new and returning visitors the
variation performed better (new: +3,25%,
returning: +7,62%)
THE VARIATION PERFORMS BEST FOR DEKSTOP AND RETURNING USERS

Behavioral changes
 The changes on the product page lead to more
interaction with the page: more visitors navigated
through the images and more visitors selected a
size or color.
 The product description and details were placed
below the fold (at least for tablet). Interaction with
this element decreased. But still 13,5% clicks on
this element.
 Exit rate and time on page weren’t influenced by the
variation
MORE INTERACTION ABOVE THE FOLD
CLICKED ON
PHOTO
SELECTED
SIZE/COLOR
CLICKED ON
PRODUCT
INFO
ADD TO
WISHLIST
A 43,8% 16,9% 15,0% 1,4%
B 53,4% 17,5% 13,5% 1,4%
+21,9% +3,3% -9,9% +1,4%
PRODUCT PAGE INTERACTIONS
EXIT RATE TIME ON PAGE
A 8,60% 41 sec
B 8,64% 41 sec
PRODUCT PAGE STATISTICS

ABTESTGUIDE.COM/BAYESIAN/
Calculate the impact on revenue

Business CaseEXPECTED EFFECT ON REVENUE IN 6 MONTHS AFTER IMPLEMENTATION
IMPLEMENT
WINNER
IN 6 MONTHS
EXTRA REVENUE
+ € 132.616

Draw conclusions
WHAT DID YOU LEARN FROM THIS TEST?
 A more balanced and structured product page (lower cognitive load) leads to more
interaction above the fold and more bookers
 The product description and details are important for visitors.
 For tablet visitors the new variation did not perform better. This might have been
caused by the product description that was placed below the main image
ADVICE:
- implement the variation on desktop devices and re-test with the product description above the
fold for tablets.
- Run more tests on lowering cognitive load on other pages of the site

Adjust the Test Roadmap

Towards a Data Driven Test Strategy
From PIE to PIPE:
• Potential: where can we get the biggest lift?
• Impact: score hypothesis based on 5V
• Power: where should we test these hypothesis?
• Ease: how easy is it to test and implement?

Determine & Analyze Customer Journeys

Set-up Hypothesis and challenge them

Determine how and where you can test
these hypothesis (length & MDE)

Prioritize based on Ease

And then…
 Keep in mind the possibility of sample pollution
 Determine extra needed measurements
 Run your test for the designated time period (!)
 Analyze your tests in the analytics tool

And then…
 Draw conclusions and implement winning variations asap
 Determine follow-up tests
 Adjust the prioritization model
 … and run another test 

Successful program?
 Keep track of the number of tests
 Keep track of the percentage of winners
 Keep track of the specific test learnings and overall insights

Thank you!!
@AM_Klaassen
annelytics@outlook.com
nl.linkedin.com/in/amklaassen
Bayesian calculator: abtestguide.com/bayesian
Power calculator: abtestguide.com/calc
Test bandwidth: ondi.me/bandwidth
Sample size / MDE calculator: ondi.me/samplesize

Workshop data driven test strategy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Workshop data driven test strategy

Similar to Workshop data driven test strategy (20)

More from Annemarie Klaassen

More from Annemarie Klaassen (10)

Recently uploaded

Recently uploaded (20)

Workshop data driven test strategy

Editor's Notes