There are countless ways to create flawed A/B tests, and even many CRO experts fall into these traps. These flawed tests will either not yield any data, or will give you faulty data that leads you astray. From test setup to the concluding analysis, this talk covered where online testing goes awry.
5. #INBOUND17
S CI E NTI FI C ME THO D
Unlike intuitive, philosophical or religious methods for acquiring
knowledge, the scientific method relies on empirical, repeatable tests to
reveal the truth
QUESTION OBSERVATIONS HYPOTHESIS
1. NOT DOING ENOUGH RESEARCH
6. #INBOUND17
RE S E ARCH AP P RO ACHE S
QUALITATIVE QUANTITATIVE
1. NOT DOING ENOUGH RESEARCH
7. #INBOUND17
R E S E A R C H - D R I V E N T E S T D E S I G N
1. NOT DOING ENOUGH RESEARCH
9. #INBOUND17
F A C T O R S F O R T E S T D U R A T I O N
2. SKIPPING TEST DURATION CALCULATION
1. Traffic Volume (Visitors)
2. Baseline Success Rate (KPI Completions)
3. Difference Between Experiences
(Minimum Detectable Effect (MDE))
4. Statistical Significance
5. Statistical Power
15. #INBOUND17
E X P E RI ME NTAL DE S I G N
QUESTION OBSERVATIONS HYPOTHESIS
3. MISSING A HYPOTHESIS
16. #INBOUND17
S O U R C E : L I V E
S C I E N C E
“The basic idea of a hypothesis is
that there is no pre-determined
outcome. ”
3. MISSING A HYPOTHESIS
20. #INBOUND17
RI S KS
• Not testing impactful changes
• Lack of focus in the experiment
• Can take way longer than it needs to
• Chance of sample pollution (users with different devices or cleared cookies getting
into a different experience)
• False positives (see next)
5. SETTING UP TOO MANY VARIATIONS
21. #INBOUND17
S TATI S TI CAL RI S K
Confidence Level = 100%-Significance
Significance = .05, Confidence = 95%
Significance = .05 * 20 variations =
1 (significant purely by chance)
Significance = .05 * 80 variations =
4 (significant purely by chance)
5. SETTING UP TOO MANY VARIATIONS
22. #INBOUND17
BO NFE RRO NI CO RRE CTI O N
Confidence Level = 100%-Significance
Desired Confidence Level = 95%
.05/20 = .0025 = .25 significance level
100% - .25% = 99.75% confidence level for an
individual test
.05/80 = .000625 = .0625 significance level
100% - .0625% = 99.9375% confidence level for an individual test
5. SETTING UP TOO MANY VARIATIONS
25. #INBOUND17
FO O C FI X E S
• Make sure the testing snippet is in the <head> (as high as possible!)
• Reduce the size of your testing snippet
• Don’t use testing software to make development changes (slows down the test load)
• Make sure jQuery is above the testing snippet on the page
• Use raw JavaScript instead of jQuery
• QA, QA, QA!
6. IGNORING PAGE FLICKER (FOOC)
29. #INBOUND17
E X AMP LE
Desired Update (Mid-Test)
Unfortunate Result
7. CHANGING THE SITE MID-TEST
30. #INBOUND17
EXAMP LE 2 ( SAME TEST)
Original KPI measurement: stores.site.com
Day 3: URL changed to site.com/stores
KPI tracking broken
7. CHANGING THE SITE MID-TEST
33. #INBOUND17
CHANG I NG THE KP I
If you use a metric further up a conversion funnel to speed
up testing, you have to make sure that there is a direct,
measurable relationship between that metric and your
actual KPI
8. MEASURING THE WRONG KPI (OR CHANGING IT)
36. #INBOUND17
W A Y S T O C H A N G E T R A F F I C M I D - T E S T
• Start a paid search (PPC) campaign centered around new keyword
grouping
• Start a promotion (deal-focused users)
• Have a publicity moment (news channels, reddit, social media
virality, etc)
• Many, many more
9. DRIVING NEW TRAFFIC MID-TEST
38. #INBOUND17
S O U R C E :
O P T I M I Z E L Y
“When you change a variation’s traffic allocation mid-
experiment, all new users will be allocated accordingly
from then on.
However, all users that entered your experiment before the
change will be bucketed into the same variation they
entered previously, altering the results and making it
difficult to interpret the conversion rate.”
10. CHANGING TRAFFIC ALLOCATION
39. #INBOUND17
WHY DO E S THI S HAP P E N?
10. CHANGING TRAFFIC ALLOCATION
• You have a risk-averse company
• The executives feel like this will speed up the test (it doesn’t)
• Someone gets antsy (or excited) when results aren’t behaving the
way they expected
• Etc.
40. #INBOUND17
W H A T ’ S T H E R I S K ?
Risk of changing allocation mid-test:
N = 1M visitors per day
Friday = Treatment performed better
Saturday = Treatment performed better
Combined = Appears treatment did worse
10. CHANGING TRAFFIC ALLOCATION
Simpson’s Paradox
49. #INBOUND17
If daily traffic isn’t representative of all traffic, you need to have statistical
significance in ‘time segment rotations,’ not necessarily days
Think of your traditional behavior cycles as ‘time segments’
TI ME S E G ME NTS
13. IGNORING TEMPORAL FLUCTUATIONS
52. #INBOUND17
S E G ME NTATI O N G RO UP S
• Buyer modalities
• Gender
• Age
• Region
• Social groupings
• Etc
14. ASSUMING SOMETHING WORKS FOR EVERYONE
54. #INBOUND17
S TRATE G I E S
• Find allies in your
organization with site
owners, project managers,
key stakeholders
• Understand product
development cycles and
request queue structures
15. NOT IMPLEMENTING POSITIVE RESULTS