A/B Testing for Everyone

A/B Testing for Everyone
Pavel Dmitriev
Some slides taken from talks by Ronny Kohavi

About Me
B.S. Applied Math @ Moscow State University, Russia
 Ph.D. Computer Science @ Cornell University focused on applied Machine
Learning
 3 years @ Yahoo!, worked on web crawling and indexing optimization
 8 years @ Microsoft, worked on experimentation in Bing, MSN, O365,
Skype, Windows
 5 months @ Outreach, working on experimentation, ML, NLP
3

About Me
B.S. Applied Math @ Moscow State University, Russia
 Ph.D. Computer Science @ Cornell University focused on applied Machine
Learning
 3 years @ Yahoo!, worked on web crawling and indexing optimization
 8 years @ Microsoft, worked on experimentation in Bing, MSN, O365,
Skype, Windows
 5 months @ Outreach, working on experimentation, ML, NLP
4

Outline
• Intro to A/B testing
• Examples of real experiments
• Experimentation adoption across industries
• Five challenges preventing faster adoption, in Sales and in Software
5

The Life of a Great Idea – True Bing
Story
6
Control – Existing Display Treatment – new idea called Long Ad Titles

The Life of a Great Idea
• It was one of hundreds of ideas on the table, and it seemed
• Stayed in the backlog in
• Many features were above it, it was clear the idea was not going to make it any time
soon
• The engineer thought it was trivial to implement. He implemented it and started an A/B
test.
• Immediately an alert fired: the Revenue was abnormally high (usually indicates a bug)
• But in this case there was no bug. The idea increased Bing’s revenue by 12% (over
$100M/year), without hurting user experience metrics!
7
…meh…
Feb March April May June

We are bad at assessing the value of
ideas
• The best revenue generating idea in Bing history was badly rated and delayed for
months!
At Microsoft, we ran a study in Bing and found that only ~1/3 of ideas developed were actually
good for users and business, ~1/3 were neutral, and ~1/3 were bad
• Only in Software Engineering?
In Sales, contradicting “best practices” are abundant. For example, best day to contact the
prospect is …
In Medicine, correctly evaluating an idea, e.g. a new drug, is a matter of life and death. FDA and
EMA do not trust expert opinions and mandates the use of Randomized Controlled Trials
8We can’t trust our gut! To make the right choices we need data from real users!

Collecting Usage Data
• Companies have always been collecting data to learn what their users appear to
value
Interviews, focus groups, questionnaires, and other similar techniques are great at revealing
what users say they do
Although rich with qualitative information, the learnings from these techniques are typically
based on small samples and risk being biased, making it hard to generalize
• With the internet connectivity of the products, companies can collect feedback data
to learn what their customers actually value
Telemetry and logging reveal what the customers actually do
10

Use Data Correctly - Correlation is not
Causation
• Seattle is known for its rain
• Whenever I see people on the street carrying
umbrellas, very soon it starts raining
• I may conclude that umbrellas cause the rain,
and decide to ban them
• Banning umbrellas, however, won’t stop the
rain; it will just make everyone more wet
11
Photo by Mike Waller, taken from Flickr
Relying on correlations isn’t just neutral, it’s often harmful to the business!

Correlation is not Causation – Real
Example
• You observe the churn rates for users using/not-using your feature:
25% of new users who do NOT use your feature churn (stop using product 30 days later); only
10% of new users who use your feature churn
• [Wrong] Conclusion: your feature reduces churn and thus critical for retention
Flaw: Relationship between the feature and retention is correlational, the data above is
insufficient for any causal conclusion
• Example: Users who see error messages in Office 365 churn less.
This does NOT mean we should show more error messages. They are just heavier users of
Office 365 12

Using Data Correctly – Before and After
13
Flaw: This approach misses
time related factors such as
external events, weekends,
holidays, seasonality, etc.
0
5
10
15
20
25
30
35
Amazon Kindle Sales
Website A Website B
Before and after example
0
5
10
15
20
25
30
35
Amazon Kindle Sales
Website A Website B
Oprah calls
Kindle "her new
favorite thing"
The new site (B) is always worse
than the original (A), opposite of
what observational data
suggests

A/B Tests in One Slide
• Other names: Controlled Experiments, Randomized Clinical Trials (RCTs)
• Can have more than two variants: A/B/C/etc. tests are common
• Must run statistical tests to confirm differences are not due to chance
14A/B Tests are the best scientific way to prove causality!

Real Examples
• Three experiments
• Each had enough users for statistical validity
• For each experiment I’ll tell you the success metric
• Your job is to guess the result
Please stand up
You’ll chose between three options by raising you left hand, right hand, or leave both
hand down
If you get it wrong, please sit down
• Since there are 3 choices for each question, random guessing implies 100%/3^3 =~ 4% will
get all three questions right. Let’s see how much better than random you can do.

Example 1: Outreach Email (Step 9, Day 7)
• Success metric: Reply Rate 16
Hey {{first_name}},
In short, we're a sales automation platform that makes your
reps life a lot easier. Our average companies (based on 1100+
companies) have tripled their reply rates on cold outbound
emails and boosted rep productivity by 2x.
We take what your best reps are doing and automate that
across your entire team so your weaker reps can work at the
highest possible same level. We also solve the issue of follow
up falling through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{sender.first_name}}
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are
spam you have no interest in. My goal is to provide enough value to warrant a 15 minute
call with you.
What we do is put your sales process into a structured series of touch points which takes
care of your follow-up process for you. This ramps up reps activities and ensures that every
lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of
sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your
reps time, logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is
performing, what activities they're doing, and what is and isn't working. This provides a solid
foundation to accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep
saves 2 hrs a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
• Left: shorter, more “salesy”
• Right: longer, more “socially
• Raise your left hand if you think the Left version wins (stat-sig)
• Raise your right hand if you think the Right version wins (stat-sig)
• Don’t raise your hand if they are the about the same (no stat-sig difference)

Example 1: Outreach Email (Step 9, Day 7)
17
Hey {{first_name}},
In short, we're a sales automation platform that makes your
reps life a lot easier. Our average companies (based on 1100+
companies) have tripled their reply rates on cold outbound
emails and boosted rep productivity by 2x.
We take what your best reps are doing and automate that
across your entire team so your weaker reps can work at the
highest possible same level. We also solve the issue of follow
up falling through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are
spam you have no interest in. My goal is to provide enough value to warrant a 15 minute
call with you.
What we do is put your sales process into a structured series of touch points which takes
care of your follow-up process for you. This ramps up reps activities and ensures that every
lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of
sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your
reps time, logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is
performing, what activities they're doing, and what is and isn't working. This provides a solid
foundation to accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep
saves 2 hrs a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
• Left template has 70% higher reply rate… However, most replies are
negative or unsubscribe requests. The right template has higher positive
• If you did not raise your hand, sit down…
• If you raise your right hand, sit down…

Example 2: SERP Truncation
• SERP is a Search Engine Result Page
(shown on the right)
• Success Metric: Clickthrough Rate on first SERP
(ignore issues with click/back, page 2, etc.)
• Version A: show 10 algorithmic results
• Version B: show 8 algorithmic results by
removing the last two results (shown on the right)
• All else the same: task pane, ads, related
searches
18
• Raise your left hand if you think version A wins (10 results)
• Raise your right hand if you think version B wins (8 results)
• Don’t raise your hand if they are the about the same

Example 2: SERP Truncation
• If you raised your left hand, sit down…
• If you raised your right hand, sit down…
• With over 3M users in each variant, we could not
detect a stat-sig delta. Users simply shifted the
clicks from the last two algorithmic results to
other elements of the page.
• Rule of Thumb: Shifting clicks is easy. Reducing
abandonment is hard.
19

Example 3: Windows Search Box
• The search box in the lower left corner of the screen on Windows machines
20
• Success metrics: more searches (and thus more Bing revenue)
• Raise your left hand if you think the Left version wins
• Raise your right hand if you think the Right version wins
• Don’t raise your hand if they are the about the same

Example 3: Windows Search Box
21
• If you did not raise your hand, sit down…
• If you raised your left hand, sit down…
• The four variants we actually tested in order of performance are:
Type here to search (winner)
What can I help you find?
Ask me anything (Control - the design that shipped with Windows 10)
Search the web and Windows (worst)
Stop guessing – get the data!

Experimentation Adoption:
Microsoft
22

Experimentation Adoption: Software
Industry
• http://www.exp-growth.com/ -
survey to determine the state
of experimentation maturity
(Fabijan et al, ICSE 2017,
SEAA 2018)
23
0
5
10
15
20
25
Crawl Walk Run Fly
State of Exp Growth

Other industries? Let’s look at
Sales
• Most of Outreach ~2500 customers fall into Crawl stage, with many not doing any
A/B testing at all
• Few sales organizations have a systematic experimentation program
• Huge potential: some experiments we ran doubled reply rates!
24

What are the reasons for low adoption in
Sales?
• A few facts about sales
Very traditional industry (some say the oldest profession on earth), slow to change
No formal education or degrees, considered entry level and pays low
Requires extreme mental toughness. You are constantly ignored and told no. You’ve got a
monthly quota, and if you don’t meet it 3 months in row – you are fired
• There’s a fear of change: sales managers are afraid to try new ideas, fearing it may
cause harm and result in missing their quota
25

What are the reasons for low adoption in
Sales?
• Inadequate support for experimentation in sales tools, leading to most tests being
invalid, and inability to confidently make decisions even on valid tests
26
no statistical
testing
any user can turn the
variants on/off any time
during the test
Any user can edit the
email being tested any
time during the test
Vast majority of the
tests are broken (e.g.
imbalance in deliveries)

How to increase the adoption?
• We need to make experimentation
Trustworthy – results are correct and easily understood
Safe – impact of testing bad ideas is limited
Easy to use – enable non-technical sales managers and executives answer their
questions
• These are the same things I worked on trying to increase adoption of
experimentation at Microsoft!
Except… the bar is higher!!!
27

Five Gaps
1. No open source trustworthy A/B testing solution
2. Difficult to come up with the right metrics
3. Small sample sizes
4. Difficult to understand results of statistical tests
5. Hard to translate business questions into experiment designs
Between the needs of Sales Industry and the experimentation State of the
Art
28
Solving these issues will help accelerate experimentation adoption in Sales, Software, and other domains

#1. Open Source A/B Testing
Platform
• Pretty much anything “platform” is open
source, except A/B testing
Wasabi, the only option, is not maintained
• Our http://exp-growth.com survey showed
that most companies build their own platform
from scratch (Fabijan et al, SEAA 2018).
This is hard - a big investment few
companies can afford.
29
0%
10%
20%
30%
40%
50%
60%
70%
80%
Internally developed
platform
Third party platform No platform (manual coding
of experiments)
Type of Exp Platform
• There’s a need for an easy to deploy and integrate open source A/B testing solution
that is easy to use, supports several common experiment designs, and provides safety
features

#2. Determining the right metrics
• How to judge the result of an A/B test?
OEC = Overall Evaluation Criteria, or OMTM = One Metric That Matters
A single metric or a few key metrics with a well-defined decision criteria
• Two key properties:
1. Alignment with long-term company goals (directionality)
2. Ability to impact (sensitivity)
• Finding a good OMTM is hard, in Sales and in Software Products
Simple metrics like Opens or Replies to sales emails are not predictive of future sale (fail directionality)
Long-term metrics like Sales or Revenue take too long to measure - typical sales cycle takes months - and are
hard to impact via small changes like email content (fail sensitivity)
Outreach solution – Positive Replies, where “positive” is determined via an ML classifier
See A/B Testing at Scale Tutorial for examples from Software industry 30

#3. Small Sample Sizes
• A typical 2-week A/B test for a mid-size Outreach customer will only have hundreds-
to-thousands data points in each variant
This translates to being able to detect only changes of ~20% or more
• Solutions:
Run bigger tests (at Outreach we recommend to always run 50/50 tests)
Select more sensitive metrics: 20% increase in Revenue is hard, 20% increase in Positive
Replies is easier
Start by focusing on bigger changes rather than small tweaks. As the company grows and
volume of sales activity increases, can focus on smaller and smaller changes
Implement smarter experiment designs (e.g. cross-over design) and analysis methods (e.g.
CUPED)
31

#4. Understanding Experiment
Results
• Standard way of evaluating experiments via Null
Hypothesis Testing can be easily misinterpreted, leading
to wrong conclusions
See Steve Goodman’s A Dirty Dozen for 12 ways to get it
wrong
Can’t show p-values to sales reps, need an easier way to
interpret results
32
• Treatment effect may be different on different sub-populations
Results may vary depending on country, browser, location, prospect persona, sales step, etc.
How to automatically detect and visualize such heterogeneous results?

#4. Understanding Experiment
Results
• Each experiment needs to have clear success criteria, mapping unambiguously to
positive/negative outcomes
• Summarize results and learnings in an easy to understand visual way (Fabijan et al,
SEAA 2018)
33

#5. Answering Business Questions
• Traditionally, A/B testing have been used to answer simple yes/no questions like
Does my new medicine help?
Should I ship my new feature?
Is my new email subject line better?
• However, managers and execs think of bigger more difficult questions
Does embedding videos in e-mails help?
How urgently should sales reps reply to prospects?
How much should I invest in improving performance of my site?
• Using A/B testing to help answer these questions can help greatly accelerate adoption of
experimentation
Run a series of experiments on embedding video across all key scenarios
Run a series of experiments notifying users to reply with different delays across multiple scenarios
Run a series of “slowdown” experiments to estimate impact of performance on revenue
• Need to develop design patterns for such “learning experiment series”
34

Summary
• We are bad at assessing the value of our ideas. Don’t trust experts – get the data!
• A/B testing is the best scientific way to measure causal impact of your work on users and business
• Experimentation adoption is growing in Software Industry, but very low in other industries like
Sales
• Five challenges slowing down the adoption:
1. No open source trustworthy A/B testing solution
2. Difficult to come up with the right metrics
3. Small sample sizes
4. Difficult to understand results of statistical tests
5. Hard to translate business questions into experiment designs
• Solving these challenges will not only help Sales, it will accelerate experimentation
adoption in Software and other industries, bringing experimentation to Everyone!35

Questions?
Slides will be posted on my LinkedIn page: www.linkedin.com/in/paveldmitriev/

A/B Testing for Everyone

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A/B Testing for Everyone

Similar to A/B Testing for Everyone (20)

Recently uploaded

Recently uploaded (20)

A/B Testing for Everyone

Editor's Notes