5. AB Test Hype Cycle
Zen Plumbing
@OptimiseOrDie
Timeline
Tested stupid ideas,
lots
Most AB or MVT tests are
bullshit
Discovered AB
testing
Triage,
Triangulation,
Prioritisation, Maths
6. Craig’s Cynical Quadrant
Improves
revenue
Yes Client delighted
No Yes
Improves UX
No
(and fires you for another agency)
Client fucking
delighted
Client absolutely
fucking furious
Client fires you
(then wins an award for your work)
7. #1 : You’re doing it in the wrong
place
@OptimiseOrDie
8. #1 : You’re doing it in the wrong place
There are 4 areas a CRO expert always looks at:
1. Inbound attrition (medium, source, landing page,
keyword, intent and many more…)
2. Key conversion points (product, basket, registration)
3. Processes, lifecycles and steps (forms, logins,
registration, checkout, onboarding, emails, push)
4. Layers of engagement (search, category, product, add)
1. Use visitor flow reports for attrition – very useful.
2. For key conversion points, look at loss rates &
interactions
3. Processes and steps – look at funnels or make your own
4. Layers and engagement – make a ring model
@OptimiseOrDie
12. Within a layer
Page 1
Page 2
Page 3
Page 4 Page 5
Exit
Deeper
Layer
Email
Wishlist
Contact Like
Micro
Conversions
@OptimiseOrDie
13. #1 : Make a Money Model
• Get to know the flow and loss (leaks) inbound, inside and
through key processes or conversion points.
• Once you know the key steps you’re losing people at and how
much traffic you have – make a money model.
• 20,000 see the basket page – what’s the basket page to
checkout page ratio?
• Estimate how much you think you can shift the key metric
(e.g. basket adds, basket -> checkout)
• What downstream revenue or profit would that generate?
• Sort by the money column
• Congratulations – you’ve now built the worlds first IT plan for
growth with a return on investment estimate attached!
• I’ll talk more about prioritising later – but a good real world
analogy for you to use:
@OptimiseOrDie
14. Think like a
store owner!
If you can’t
refurbish the
entire store,
which floors or
departments will
you invest in
optimising?
Wherever there
is:
• Footfall
• Low return
@OptimiseOrDie
15. #2 : Your hypothesis is
crap!
Insight - Inputs
#FAIL
Competitor
copying
Guessing
Dice rolling
Panic
Competitor
change
An article
the CEO
read
Ego
Opinion
Cherished
notions
Marketing
whims Cosmic rays
Not ‘on
brand’
enough
IT
inflexibility
Internal
company
needs
Some
dumbass
consultant
Shiny
feature
blindness
Knee jerk
reactons
@OptimiseOrDie
16. #2 : These are the inputs you
need…
Insight - Inputs
Insight
Eye tracking
Segmentation
Surveys
Sales and
Call Centre
Customer
contact
Social
analytics
Session
Replay
Usability
testing
Forms
analytics
Search
analytics Voice of
Customer
Market
research
A/B and
MVT testing
Big &
unstructured
data
Web
analytics
Competitor
Customer evals
services
@OptimiseOrDie
17. Insight - Inputs
@OptimiseOrDie
#2 : Brainstorming the test
• Check your inputs
• Assemble the widest possible team
• Share your data and research
• Design Emotive Writing guidelines
18. Insight - Inputs
@OptimiseOrDie
#2 : Emotive Writing - example
Customers do not know what to do and need support and advice
• Emphasize the fact that you understand that their situation is stressful
• Emphasize your expertise and leadership in vehicle glazing and will help them get the best
solution for their situation
• Explain what they will need to do online and during the call-back so that they know what the
next steps will be
• Explain that they will be able ask any other questions they might have during the call-back
Customers do not feel confident in assessing the damage
• Emphasize the fact that you will help them assess the damage correctly online
Customers need to understand the benefits of booking online
• Emphasize that the online booking system is quick, easy and provides all the information
they need in regards with their appointment and general cost information
Customers mistrust insurers and find dealing with their insurance situation very frustrating
• Where possible communicate the fact that the job is most likely to be free for insured
customers, or good value for money for cash customers
• Show that you understand the hassle of dealing with insurance companies – emphasise that
you will help with their insurance paperwork for them, freeing them of this burden
Some customers cannot be bothered to take action to fix their car glass
• Emphasize the consequences of not doing anything,
e.g. ‘It’s going to cost you more if the chip develops into a crack’
19. Insight - Inputs
@OptimiseOrDie
#2 : THE DARK SIDE
“Keep your family safe and get back on the
road fast with Autoglass.”
20. Insight - Inputs
@OptimiseOrDie
#2 : NOW YOU CAN BEGIN
• You should have inputs, research, data, guidelines
• Sit down with the team and prompt with 12 questions:
– Who is this page (or process) for?
– What problem does this solve for the user?
– How do we know they need it?
– What is the primary action we want people to take?
– What might prompt the user to take this action?
– How will we know if this is doing what we want it to do?
– How do people get to this page?
– How long are people here on this page?
– What can we remove from this page?
– How can we test this solution with people?
– How are we solving the users needs in different and better ways than other
places on our site?
– If this is a homepage, ask these too (bit.ly/1fX2RAa)
21. Insight - Inputs
@OptimiseOrDie
#2 : PROMPT YOURSELF
• Check your UX or Copywriting
guidelines.
• Use Get Mental Notes
• What levers can we apply now?
• Create a hypothesis:
“WE BELIEVE THAT DOING [A]
FOR PEOPLE [B] WILL MAKE
OUTCOME [C] HAPPEN.
WE'LL KNOW THIS WHEN WE
SEE DATA [D] AND FEEDBACK
[E]”
www.GetMentalNotes.com
22. Insight - Inputs
@OptimiseOrDie
#2 : THE FUN BIT!
• Collaborative Sketching
• Brainwriting
• Refine and Test!
23. We believe that doing [A] for
People [B] will make
outcome [C] happen.
We’ll know this when we
observe data [D] and obtain
feedback [E]. (reverse)
@OptimiseOrDie
24. #2 : Solutions
• You need multiple tool inputs
– Tool decks are here : www.slideshare.net/sullivac
• Collaborative, Customer connected team
– If you’re not doing this, you’re hosed
• Session replay tools provide vital input
– Get vital additional customer evidence
• Simple page Analytics don’t cut it
– Invest in your analytics, especially event
tracking
• Ego, Opinion, Cherished notions – fill gaps
– Fill these vacuums with insights and data
• Champion the user
– Give them a chair at every meeting @OptimiseOrDie
25. #2 : HYPOTHESIS DESIGN SUMMARY
Insight - Inputs
@OptimiseOrDie
• Inputs – get the right stuff
• Research, Guidelines, Data
• Framing the problem(s)
• Questions to get you going
• Use card prompts for Psychology
• Create a hypothesis
• Collaborative Sketching
• Brainwriting
• Refine and Check Hypothesis
• Instrument and Test
26. We believe that doing [A] for
People [B] will make
outcome [C] happen.
We’ll know this when we
observe data [D] and obtain
feedback [E]. (reverse)
@OptimiseOrDie
27. #3 : No analytics integration
• Investigating problems with tests
• Segmentation of results
• Tests that fail, flip or move around
• Tests that don’t make sense
• Broken test setups
• What drives the averages you see?
@OptimiseOrDie
29. These Danish
porn sites are
so hardcore!
We’re still
waiting for our
AB tests to
finish!
#4 : The test will finish after you die
• Use a test length calculator like this one:
• visualwebsiteoptimizer.com/ab-split-test-duration/
30. #5 : You don’t test for long enough
• The minimum length
– 2 business cycles (so you can cross check)
– Usually a week, 2 weeks, Month
– Always test ‘whole’ not partial cycles
– Be aware of multiple cycles
– Don’t self stop!
– PURCHASE CYCLES – KNOW THEM
31. Business & Purchase Cycles
@OptimiseOrDie
Start Test Finish Avg Cycle
• Customers change
• Your traffic mix changes
• Markets, competitors
• Be aware of all the waves
• Always test whole cycles
• Minimum 2 cycles (wk/mo)
• Don’t exclude slower buyers
32. #5 : You don’t test for long enough
• How long after that
– I aim for a minimum 250 outcomes, ideally 350+ for each ‘creative’
– If you test 4 recipes, that’s 1400 outcomes needed
– You should have worked out how long each batch of 350 needs
before you start!
– 95% confidence is the cherry – not the cake - BUT BIG SECRET -> (p
values are unreliable)
– If you segment, you’ll need more data
– It may need a bigger sample if the response rates are similar*
– Use a test length calculator but be aware of BARE MINIMUM TO
EXPECT
– Important insider tip – watch the error bars! The +/- stuff
* Stats geeks know I’m glossing over something here. That test time depends
on how the two experiments separate in terms of relative performance as
well as how volatile the test response is. I’ll talk about this when I record this
one! This is why testing similar stuff sux. 32
33. #5 : You put faith in the Confidence
value
95%, 99%, 99.99%
‘Confidence’ or ‘Chance to beat baseline’ – what’s that?
• It’s a stats thing
• Seriously, look at this one LAST in your testing
• Purchase Cycle, Business Cycles, Sample Size, Error bar
separation – ALL come before this one. Got it?
• Why? It’s to do with p-values. Read this article:
• http://bit.ly/1gq9dtd
• If you rely on confidence, you are relying upon
something that’s unreliable and moves around,
particularly early in testing.
• Don’t be fooled by your testing package – watch the
error bars instead of confidence.
34. #5 : The tennis court
– Let’s say we want to estimate, on average, what height Roger Federer
and Nadal hit the ball over the net at. So, let’s start the match:
@OptimiseOrDie
35. First Set Federer 6-4
– We start to collect values
62cm
+/- 2cm
63.5cm
+/- 2cm
@OptimiseOrDie
36. Second Set – Nadal 7-6
– Nadal starts sending them low over the net
62cm
+/- 1cm
62.5cm
+/- 1cm
@OptimiseOrDie
37. Final Set Nadal 7-6
– We start to collect values
61.8cm
+/- .3cm
62cm
+/- .3cm
38. Let’s look at this a different way
62.5cm
+/- 1cm
@OptimiseOrDie
9.1%
± 0.3
9.3%
± 0.3
40. Graph is a range, not a line:
9.1 ± 1.9% 9.1 ± 0.9% 9.1 ± 0.3%
41. #5 : How long to test?
• The minimum length:
– 2 business cycles and > purchase cycle as a minimum, regardless of
outcomes. Test for less and you’re biasing the sample.
– ALWAYS ALWAYS TEST WHOLE CYCLES.
– 250 ABSOLUTE MINIMUM FOR ANY SAMPLE, 350+ nicer, 1000 sweet!
– Error bar separation (or minimal overlap) between creatives
– Ignore 95%+ confidence (it’s unreliable)
– Use a test calculator (VWO have a nice one).
– Work out your ‘test units’ – how long to get 350 outcomes for each
creative in your test.
– This is a minimum you should expect but sample size (or overlap) may
mean you need longer
– When to stop?
@OptimiseOrDie
42. #5 : When to stop
• Self stopping is a huge problem:
– “I stopped the test when it looked good”
– “It hit 20% on Thursday, so I figured – time to cut and run”
– “We need test time for something else. Looks good to us”
– “We’ve got a big sample now so why not finish it today?”
• False Positives and Negatives
– If you cut part of a business cycle, you bias the segments you have in
the test.
– So if you ignore weekend shoppers by stopping your test on Friday, that
will affect results
– The other problems is FALSE POSITIVES and FALSE NEGATIVES
@OptimiseOrDie
43. #5 : When to stop
Scenario 1 Scenario 2 Scenario 3 Scenario 4
@OptimiseOrDie
After 200
observations
Insignificant Insignificant Significant! Significant!
After 500
observations
Insignificant Significant! Insignificant Significant!
End of
experiment
Insignificant Significant! Insignificant Significant!
Scenario 1 Scenario 2 Scenario 3 Scenario 4
After 200
observations
Insignificant Insignificant Significant! Significant!
After 500
observations
Insignificant Significant! trial stopped trial stopped
End of
experiment
Insignificant Significant! Significant! Significant!
44. #5 : When to stop
• So – what to do?
• Run a test calculator
• Set the test time to hit the highest of the minimums
• What minimums do you mean?
– Minimum sample (250, 350, higher)
– Business cycles (2+)
– Purchase cycles (1 or 2+)
– What your test calculator says
• The longest one is how long it’s gonna take.
• Set the test time
• Run the test
• Stop the test at the end, on a whole cycle
• Analyse
• That’s it!
@OptimiseOrDie
45. #6 : The early stages of a test…
• Ignore the graphs. Don’t draw conclusions. Don’t dance. Calm down.
• Get a feel for the test but don’t do anything yet!
• Remember – in A/B - 50% of returning visitors will see a new shiny website!
• Until your test has had at least 2 business cycles and 250+ outcomes, don’t bother
even getting remotely excited!
• Watching regularly is good though. You’re looking for anything that looks really
odd – if everyone is looking (but not concluding) then oddities will get spotted.
• All tests move around or show big swings early in the testing cycle. Here is a very
high traffic site – it still takes 10 days to start settling. Lower traffic sites will
stretch this period further.
45
47. #7 – BIG SECRET!
• Over 40% of tests have had QA issues.
• It’s very easy to break or bias the testing
Browser testing www.crossbrowsertesting.com
www.browserstack.com
www.spoon.net
www.cloudtesting.com
www.multibrowserviewer.com
www.saucelabs.com
Mobile devices www.deviceanywhere.com
www.perfectomobile.com
www.opendevicelab.com
@OptimiseOrDie
48. #7 : What other QA testing should I do?
• Testing from several locations (office, home, elsewhere)
• Testing the IP filtering is set up
• Test tags are firing correctly (analytics and the test tool)
• Test as a repeat visitor and check session timeouts
• Cross check figures from 2+ sources
• Monitor closely from launch, recheck, watch
• WATCH FOR BIAS!
@OptimiseOrDie
49. #8 : Tests are random and not
prioritised
Once you have a list of
potential test areas, rank
them by opportunity vs.
effort.
The common ranking
metrics that I use include:
•Opportunity (revenue,
impact)
•Dev resource
•Time to market
•Risk / Complexity
Make yourself a quadrant
50. #9 : Your cycles are too slow
0 6 12 18
Months
Conversio
n
@OptimiseOrDie
51. #9 : Solutions
• Give Priority Boarding for opportunities
– The best seats reserved for metric shifters
• Release more often to close the gap
– More testing resource helps, analytics ‘hawk eye’
• Kaizen – continuous improvement
– Others call it JFDI (just f***ing do it)
• Make changes AS WELL as tests, basically!
– These small things add up
• RUSH Hair booking – Over 100 changes
– No functional changes at all – 37% improvement
• Inbetween product lifecycles?
– The added lift for 10 days work, worth 360k
@OptimiseOrDie
53. #10 : How do I know when it’s ready?
• The hallmarks of a cooked test are:
– It’s done at least 1 or preferably 2+ business and at least one if
not two purchase cycles
– You have at least 250-350 outcomes for each recipe
– It’s not moving around hugely at creative or segment level
performance
– The test results are clear – even if the precise values are not
– The intervals are not overlapping (much)
– If a test is still moving around, you need to investigate
– FIND OUT WHAT MARKETING ARE DOING
– FIND OUT WHAT EVERYONE IS DOING
– Be careful about limited time period campaigns (e.g. TV, print,
online)
– If you know when TV (or other big campaigns) are running, try
one week with TV and one without during tests – very
interesting. 53
55. #11: Your test fails
• Learn from the failure! If you can’t learn from the failure, you’ve
designed a crap test.
• Next time you design, imagine all your stuff failing. What would
you do? If you don’t know or you’re not sure, get it changed so
that a negative becomes insightful.
• So : failure itself at a creative or variable level should tell you
something.
• On a failed test, always analyse the segmentation and analytics
• One or more segments will be over and under
• Check for varied performance
• Now add the failure info to your Knowledge Base:
• Look at it carefully – what does the failure tell you? Which
element do you think drove the failure?
• If you know what failed (e.g. making the price bigger) then you
have very useful information
• You turned the handle the wrong way
• Now brainstorm a new test
@OptimiseOrDie
56. #12 : The test is ‘about the same’
• Analyse the segmentation
• Check the analytics and instrumentation
• One or more segments may be over and under
• They may be cancelling out – the average is a lie
• The segment level performance will help you (beware of
small sample sizes)
• If you genuinely have a test which failed to move any
segments, it’s a crap test – be bolder
• This usually happens when it isn’t bold or brave enough in
shifting away from the original design, particularly on
lower traffic sites
• Get testing again!
@OptimiseOrDie
57. #13 : The test keeps moving
around
• There are three reasons it is moving around
– Your sample size (outcomes) is still too small
– The external traffic mix, customers or reaction has
suddenly changed or
– Your inbound marketing driven traffic mix is
completely volatile (very rare)
• Check the sample size
• Check all your marketing activity
• Check the instrumentation
• If no reason, check segmentation
@OptimiseOrDie
58. #14 : The test has flipped on me
• Something like this can happen:
• Check your sample size. If it’s still small, then expect this until the test
settles.
• If the test does genuinely flip – and quite severely – then something has
changed with the traffic mix, the customer base or your advertising.
Maybe the PPC budget ran out? Seriously!
• To analyse a flipped test, you’ll need to check your segmented data. This
is why you have a split testing package AND an analytics system.
• The segmented data will help you to identify the source of the shift in
response to your test. I rarely get a flipped one and it’s always something
59. • No – and this is why:
– It’s a waste of time
– It’s easier to test and monitor instead
– You are eating into test time
– Also applies to A/A/B/B testing
– A/B/A running at 25%/50%/25% is the best
• Read my post here :
http://bit.ly/WcI9EZ
59
#15 : Should I run an A/A test
first
60. #16 : Nobody feels the
test
• You promised a 25% rise in checkouts - you only see 2%
• Traffic, Advertising, Marketing may have changed
• Check they’re using the same precise metrics
• Run a calibration exercise
• I often leave a 5 or 10% stub running in a test
• This tracks old creative once new one goes live
• If conversion is also down for that one, BINGO!
• Remember – the AB test is an estimate – it doesn’t
precisely record future performance
• This is why infrequent testing is bad
• Always be trying a new test instead of basking in the
glory of one you ran 6 months ago. You’re only as good
as your next test.
@OptimiseOrDie
61. #17 : You forgot about Mobile &
Tablet
• If you’re AB testing a responsive site, pay attention
• Content will break differently on many screens
• Know thy users and their devices
• Use bango or google analytics to define a test list
• Make sure you test mobile devices & viewports
• What looks good on your desk may not be for the user
• Harder to design cross device tests
• You’ll need to segment mobile, tablet & desktop response
in the analytics or AB testing package
• Your personal phone is not a device mix
• Ask me about making your device list
• Buy core devices, rent the rest from deviceanywhere.com
@OptimiseOrDie
62. #18 : Oh shit – no traffic
• If small volumes, contact customers – reach out.
• If data volumes aren’t there, there are still customers!
• Drive design from levers you can apply – game the system
• Pick clean and simple clusters of change (hypothesis driven)
• Use a goal at an earlier ring stage or funnel step
• Beware of using clickthroughs when attrition is high on the
other side
• Try before and after testing on identical time periods
(measure in analytics model)
• Be careful about small sample sizes (<100 outcomes)
• Are you working automated emails?
• Fix JFDI, performance and UX issues too!
63. #19 : Oh shit – no traffic
• Forget MVT or A/B/N tests – run your numbers
• Test things with high impact – don’t be a wuss!
• Use UX, Session Replay to aid insight
• Run a task gap survey (4Q style)
• Run a dropped basket survey (LF style)
• Run a general survey + check social + other sites
• Run sitewide tests that appear on all pages or large clusters
of pages –
• UVPs (“We are a cool brand”), USPs (“Free returns!”), UCPs
(“10% off today”).
• Headers, Footers, Nudge Bars, USP bars, footer changes,
Navigation, Product pages, Delivery info etc.
64. #19 : I chose the wrong test
type
• A/B testing – good for:
– A single change of content or design layout
– A group of related changes (e.g. payment security)
– Finding a new and radical shift for a template design
– Lower traffic pages or shorter test times
• Multivariate testing – good for:
– Higher traffic pages
– Groups of unrelated changes (e.g. delivery & security)
– Multiple content or design style changes
– Finding specific drivers of test lifts
– Testing multiple versions (e.g. click here, book now, go)
– Where you need to understand strong and weak cross variable
interactions
– Don’t use to settle arguments or sloppy thinking!
66. #20 – Other flavours of testing
• Micro testing (tiny change) – good for:
– Proving to the boss that testing works
– Demonstrating to IT that it works without impact
– Showing the impact of a seemingly tiny change
– Proof of concept before larger test
• Funnel testing – good for:
– Checkouts
– Lead gen
– Forms processes
– Quotations
– Any multi-step process with data entry
• Fake it and Build it – good for:
– Testing new business ideas
– Trying out promotions on a test sample
– Estimating impact before you build
– Helps you calculate ROI
– You can even split test entire server farms
Vs.
67. #20 – Other flavours of testing
“Congratulations!
Today you’re the
lucky winner of our
random awards
programme. You
get all these extra
features for free,
on us. Enjoy.”
68. Top F***ups for 2014
1. Testing in the wrong place
2. Your hypothesis inputs are crap
3. No analytics integration
4. Your test will finish after you die
5. You don’t test for long enough
6. You peek before it’s ready
7. No QA for your split test
8. Opportunities are not prioritised
9. Testing cycles are too slow
10. You don’t know when tests are ready
11. Your test fails
12. The test is ‘about the same’
13. Test flips behaviour
14. Test keeps moving around
15. You run an A/A test and waste time
16. Nobody ‘feels’ the test
17. You forgot you were responsive
18. You forgot you had no traffic
19. You ran the wrong test type
20. You didn’t try all the flavours of testing
@OptimiseOrDie
80. #4 : GREAT COPYWRITING
“On the average, five times as many people
read the headline as read the body copy. When
you have written your headline, you have spent
eighty cents out of your dollar.”
David Ogilvy
“In 9 years and 40M split tests with visitors, the
majority of my testing success came from
playing with the words.”
@OptimiseOrDie
81. The 5 Legged Optimisation
#1 Culture & Team
#2 UX, CX, Service Design, Insight
#3 Toolkit & Analytics investment
#4 Persuasive Copywriting
#5 Experimentation tools & process
@OptimiseOrDie
Barstool
86. The Best Companies…
• Invest continually in analytics instrumentation, tools, people
• Use an Agile, iterative, cross-silo, one team project culture
• Prefer collaborative tools to having lots of meetings
• Prioritise development based on numbers and insight
• Practice real continuous product improvement, not SLEDD*
• Are fixing bugs, cruft, bad stuff as well as optimising
• Source photos and content that support persuasion and utility
• Have cross channel, cross device design, testing and QA
• Segment their data for valuable insights, every test or change
• Continually reduce cycle (iteration) time in their process
• Blend ‘long’ design, continuous improvement AND split tests
• Make optimisation the engine of change, not the slave of ego
* Single Large Expensive Doomed Developments
87. THE FUTURE OF TESTING –
CONDUCTRICS.COM
slidesha.re/1ivS68s