Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A/B Testing for Everyone

423 views

Published on

At Microsoft I experienced how A/B testing grew from being occasionally used by a few teams in Bing and MSN several years ago to becoming widely used by many Microsoft products including Office, Windows, xBox, Skype, Visual Studio, and others. In some products it is already a standard required part of the software release process, helping ensure software quality, understand customer value, and make better data driven decisions. In others products it is growing steadily. At Microsoft, A/B testing is winning and will soon be part of everyone's daily job. However, when I left Microsoft to join Outreach, a startup that makes sales automation software, I got exposed to a different world. Even though Outreach provided A/B testing functionality, it was rarely used and the usage was often incorrect. While the need for trustworthy decision making through A/B testing in sales was clear, it was also clear that simply giving sales teams an A/B testing system like the one we had at Microsoft will not be enough. I learned that there is a big difference between a Microsoft engineer and a sales representative, with respect to their needs for successfully using A/B testing. In this talk I will discuss the gaps. What are our experimentation platforms, tools and processes, which were built for highly trained engineers, missing to make A/B testing truly available to everyone? I will also discuss ongoing work and future research directions to fill these gaps. While required to make A/B testing a success in sales, I believe that solving these problem will also help to increase adoption and successful usage of A/B testing in the software industry.

Published in: Technology
  • Be the first to comment

A/B Testing for Everyone

  1. 1. A/B Testing for Everyone Pavel Dmitriev Some slides taken from talks by Ronny Kohavi
  2. 2. About Me B.S. Applied Math @ Moscow State University, Russia  Ph.D. Computer Science @ Cornell University focused on applied Machine Learning  3 years @ Yahoo!, worked on web crawling and indexing optimization  8 years @ Microsoft, worked on experimentation in Bing, MSN, O365, Skype, Windows  5 months @ Outreach, working on experimentation, ML, NLP 3
  3. 3. About Me B.S. Applied Math @ Moscow State University, Russia  Ph.D. Computer Science @ Cornell University focused on applied Machine Learning  3 years @ Yahoo!, worked on web crawling and indexing optimization  8 years @ Microsoft, worked on experimentation in Bing, MSN, O365, Skype, Windows  5 months @ Outreach, working on experimentation, ML, NLP 4
  4. 4. Outline • Intro to A/B testing • Examples of real experiments • Experimentation adoption across industries • Five challenges preventing faster adoption, in Sales and in Software 5
  5. 5. The Life of a Great Idea – True Bing Story 6 Control – Existing Display Treatment – new idea called Long Ad Titles
  6. 6. The Life of a Great Idea • It was one of hundreds of ideas on the table, and it seemed • Stayed in the backlog in • Many features were above it, it was clear the idea was not going to make it any time soon • The engineer thought it was trivial to implement. He implemented it and started an A/B test. • Immediately an alert fired: the Revenue was abnormally high (usually indicates a bug) • But in this case there was no bug. The idea increased Bing’s revenue by 12% (over $100M/year), without hurting user experience metrics! 7 …meh… Feb March April May June
  7. 7. We are bad at assessing the value of ideas • The best revenue generating idea in Bing history was badly rated and delayed for months! At Microsoft, we ran a study in Bing and found that only ~1/3 of ideas developed were actually good for users and business, ~1/3 were neutral, and ~1/3 were bad • Only in Software Engineering? In Sales, contradicting “best practices” are abundant. For example, best day to contact the prospect is … In Medicine, correctly evaluating an idea, e.g. a new drug, is a matter of life and death. FDA and EMA do not trust expert opinions and mandates the use of Randomized Controlled Trials 8We can’t trust our gut! To make the right choices we need data from real users!
  8. 8. 9
  9. 9. Collecting Usage Data • Companies have always been collecting data to learn what their users appear to value Interviews, focus groups, questionnaires, and other similar techniques are great at revealing what users say they do Although rich with qualitative information, the learnings from these techniques are typically based on small samples and risk being biased, making it hard to generalize • With the internet connectivity of the products, companies can collect feedback data to learn what their customers actually value Telemetry and logging reveal what the customers actually do 10
  10. 10. Use Data Correctly - Correlation is not Causation • Seattle is known for its rain • Whenever I see people on the street carrying umbrellas, very soon it starts raining • I may conclude that umbrellas cause the rain, and decide to ban them • Banning umbrellas, however, won’t stop the rain; it will just make everyone more wet 11 Photo by Mike Waller, taken from Flickr Relying on correlations isn’t just neutral, it’s often harmful to the business!
  11. 11. Correlation is not Causation – Real Example • You observe the churn rates for users using/not-using your feature: 25% of new users who do NOT use your feature churn (stop using product 30 days later); only 10% of new users who use your feature churn • [Wrong] Conclusion: your feature reduces churn and thus critical for retention Flaw: Relationship between the feature and retention is correlational, the data above is insufficient for any causal conclusion • Example: Users who see error messages in Office 365 churn less. This does NOT mean we should show more error messages. They are just heavier users of Office 365 12
  12. 12. Using Data Correctly – Before and After 13 Flaw: This approach misses time related factors such as external events, weekends, holidays, seasonality, etc. 0 5 10 15 20 25 30 35 Amazon Kindle Sales Website A Website B Before and after example 0 5 10 15 20 25 30 35 Amazon Kindle Sales Website A Website B Oprah calls Kindle "her new favorite thing" The new site (B) is always worse than the original (A), opposite of what observational data suggests
  13. 13. A/B Tests in One Slide • Other names: Controlled Experiments, Randomized Clinical Trials (RCTs) • Can have more than two variants: A/B/C/etc. tests are common • Must run statistical tests to confirm differences are not due to chance 14A/B Tests are the best scientific way to prove causality!
  14. 14. Real Examples • Three experiments • Each had enough users for statistical validity • For each experiment I’ll tell you the success metric • Your job is to guess the result Please stand up You’ll chose between three options by raising you left hand, right hand, or leave both hand down If you get it wrong, please sit down • Since there are 3 choices for each question, random guessing implies 100%/3^3 =~ 4% will get all three questions right. Let’s see how much better than random you can do.
  15. 15. Example 1: Outreach Email (Step 9, Day 7) • Success metric: Reply Rate 16 Hey {{first_name}}, In short, we're a sales automation platform that makes your reps life a lot easier. Our average companies (based on 1100+ companies) have tripled their reply rates on cold outbound emails and boosted rep productivity by 2x. We take what your best reps are doing and automate that across your entire team so your weaker reps can work at the highest possible same level. We also solve the issue of follow up falling through the cracks and reps not going deep enough. When can I get a few minutes on your calendar to discuss? {{sender.first_name}} {{first_name}}, I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have no interest in. My goal is to provide enough value to warrant a 15 minute call with you. What we do is put your sales process into a structured series of touch points which takes care of your follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen. Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time, logs their activities, and gives you 100% accurate reporting. Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing, what activities they're doing, and what is and isn't working. This provides a solid foundation to accurately forecast results, improve your outreach and train your team. Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs a day, and 2X's their productivity. If you see value here can we set up a time next Tuesday or Wednesday to discuss? {{sender.first_name}} • Left: shorter, more “salesy” • Right: longer, more “socially • Raise your left hand if you think the Left version wins (stat-sig) • Raise your right hand if you think the Right version wins (stat-sig) • Don’t raise your hand if they are the about the same (no stat-sig difference)
  16. 16. Example 1: Outreach Email (Step 9, Day 7) 17 Hey {{first_name}}, In short, we're a sales automation platform that makes your reps life a lot easier. Our average companies (based on 1100+ companies) have tripled their reply rates on cold outbound emails and boosted rep productivity by 2x. We take what your best reps are doing and automate that across your entire team so your weaker reps can work at the highest possible same level. We also solve the issue of follow up falling through the cracks and reps not going deep enough. When can I get a few minutes on your calendar to discuss? {{sender.first_name}} {{first_name}}, I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have no interest in. My goal is to provide enough value to warrant a 15 minute call with you. What we do is put your sales process into a structured series of touch points which takes care of your follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen. Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time, logs their activities, and gives you 100% accurate reporting. Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing, what activities they're doing, and what is and isn't working. This provides a solid foundation to accurately forecast results, improve your outreach and train your team. Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs a day, and 2X's their productivity. If you see value here can we set up a time next Tuesday or Wednesday to discuss? {{sender.first_name}} • Left template has 70% higher reply rate… However, most replies are negative or unsubscribe requests. The right template has higher positive • If you did not raise your hand, sit down… • If you raise your right hand, sit down…
  17. 17. Example 2: SERP Truncation • SERP is a Search Engine Result Page (shown on the right) • Success Metric: Clickthrough Rate on first SERP (ignore issues with click/back, page 2, etc.) • Version A: show 10 algorithmic results • Version B: show 8 algorithmic results by removing the last two results (shown on the right) • All else the same: task pane, ads, related searches 18 • Raise your left hand if you think version A wins (10 results) • Raise your right hand if you think version B wins (8 results) • Don’t raise your hand if they are the about the same
  18. 18. Example 2: SERP Truncation • If you raised your left hand, sit down… • If you raised your right hand, sit down… • With over 3M users in each variant, we could not detect a stat-sig delta. Users simply shifted the clicks from the last two algorithmic results to other elements of the page. • Rule of Thumb: Shifting clicks is easy. Reducing abandonment is hard. 19
  19. 19. Example 3: Windows Search Box • The search box in the lower left corner of the screen on Windows machines 20 • Success metrics: more searches (and thus more Bing revenue) • Raise your left hand if you think the Left version wins • Raise your right hand if you think the Right version wins • Don’t raise your hand if they are the about the same
  20. 20. Example 3: Windows Search Box 21 • If you did not raise your hand, sit down… • If you raised your left hand, sit down… • The four variants we actually tested in order of performance are: Type here to search (winner) What can I help you find? Ask me anything (Control - the design that shipped with Windows 10) Search the web and Windows (worst) Stop guessing – get the data!
  21. 21. Experimentation Adoption: Microsoft 22
  22. 22. Experimentation Adoption: Software Industry • http://www.exp-growth.com/ - survey to determine the state of experimentation maturity (Fabijan et al, ICSE 2017, SEAA 2018) 23 0 5 10 15 20 25 Crawl Walk Run Fly State of Exp Growth
  23. 23. Other industries? Let’s look at Sales • Most of Outreach ~2500 customers fall into Crawl stage, with many not doing any A/B testing at all • Few sales organizations have a systematic experimentation program • Huge potential: some experiments we ran doubled reply rates! 24
  24. 24. What are the reasons for low adoption in Sales? • A few facts about sales Very traditional industry (some say the oldest profession on earth), slow to change No formal education or degrees, considered entry level and pays low Requires extreme mental toughness. You are constantly ignored and told no. You’ve got a monthly quota, and if you don’t meet it 3 months in row – you are fired • There’s a fear of change: sales managers are afraid to try new ideas, fearing it may cause harm and result in missing their quota 25
  25. 25. What are the reasons for low adoption in Sales? • Inadequate support for experimentation in sales tools, leading to most tests being invalid, and inability to confidently make decisions even on valid tests 26 no statistical testing any user can turn the variants on/off any time during the test Any user can edit the email being tested any time during the test Vast majority of the tests are broken (e.g. imbalance in deliveries)
  26. 26. How to increase the adoption? • We need to make experimentation Trustworthy – results are correct and easily understood Safe – impact of testing bad ideas is limited Easy to use – enable non-technical sales managers and executives answer their questions • These are the same things I worked on trying to increase adoption of experimentation at Microsoft! Except… the bar is higher!!! 27
  27. 27. Five Gaps 1. No open source trustworthy A/B testing solution 2. Difficult to come up with the right metrics 3. Small sample sizes 4. Difficult to understand results of statistical tests 5. Hard to translate business questions into experiment designs Between the needs of Sales Industry and the experimentation State of the Art 28 Solving these issues will help accelerate experimentation adoption in Sales, Software, and other domains
  28. 28. #1. Open Source A/B Testing Platform • Pretty much anything “platform” is open source, except A/B testing Wasabi, the only option, is not maintained • Our http://exp-growth.com survey showed that most companies build their own platform from scratch (Fabijan et al, SEAA 2018). This is hard - a big investment few companies can afford. 29 0% 10% 20% 30% 40% 50% 60% 70% 80% Internally developed platform Third party platform No platform (manual coding of experiments) Type of Exp Platform • There’s a need for an easy to deploy and integrate open source A/B testing solution that is easy to use, supports several common experiment designs, and provides safety features
  29. 29. #2. Determining the right metrics • How to judge the result of an A/B test? OEC = Overall Evaluation Criteria, or OMTM = One Metric That Matters A single metric or a few key metrics with a well-defined decision criteria • Two key properties: 1. Alignment with long-term company goals (directionality) 2. Ability to impact (sensitivity) • Finding a good OMTM is hard, in Sales and in Software Products Simple metrics like Opens or Replies to sales emails are not predictive of future sale (fail directionality) Long-term metrics like Sales or Revenue take too long to measure - typical sales cycle takes months - and are hard to impact via small changes like email content (fail sensitivity) Outreach solution – Positive Replies, where “positive” is determined via an ML classifier See A/B Testing at Scale Tutorial for examples from Software industry 30
  30. 30. #3. Small Sample Sizes • A typical 2-week A/B test for a mid-size Outreach customer will only have hundreds- to-thousands data points in each variant This translates to being able to detect only changes of ~20% or more • Solutions: Run bigger tests (at Outreach we recommend to always run 50/50 tests) Select more sensitive metrics: 20% increase in Revenue is hard, 20% increase in Positive Replies is easier Start by focusing on bigger changes rather than small tweaks. As the company grows and volume of sales activity increases, can focus on smaller and smaller changes Implement smarter experiment designs (e.g. cross-over design) and analysis methods (e.g. CUPED) 31
  31. 31. #4. Understanding Experiment Results • Standard way of evaluating experiments via Null Hypothesis Testing can be easily misinterpreted, leading to wrong conclusions See Steve Goodman’s A Dirty Dozen for 12 ways to get it wrong Can’t show p-values to sales reps, need an easier way to interpret results 32 • Treatment effect may be different on different sub-populations Results may vary depending on country, browser, location, prospect persona, sales step, etc. How to automatically detect and visualize such heterogeneous results?
  32. 32. #4. Understanding Experiment Results • Each experiment needs to have clear success criteria, mapping unambiguously to positive/negative outcomes • Summarize results and learnings in an easy to understand visual way (Fabijan et al, SEAA 2018) 33
  33. 33. #5. Answering Business Questions • Traditionally, A/B testing have been used to answer simple yes/no questions like Does my new medicine help? Should I ship my new feature? Is my new email subject line better? • However, managers and execs think of bigger more difficult questions Does embedding videos in e-mails help? How urgently should sales reps reply to prospects? How much should I invest in improving performance of my site? • Using A/B testing to help answer these questions can help greatly accelerate adoption of experimentation Run a series of experiments on embedding video across all key scenarios Run a series of experiments notifying users to reply with different delays across multiple scenarios Run a series of “slowdown” experiments to estimate impact of performance on revenue • Need to develop design patterns for such “learning experiment series” 34
  34. 34. Summary • We are bad at assessing the value of our ideas. Don’t trust experts – get the data! • A/B testing is the best scientific way to measure causal impact of your work on users and business • Experimentation adoption is growing in Software Industry, but very low in other industries like Sales • Five challenges slowing down the adoption: 1. No open source trustworthy A/B testing solution 2. Difficult to come up with the right metrics 3. Small sample sizes 4. Difficult to understand results of statistical tests 5. Hard to translate business questions into experiment designs • Solving these challenges will not only help Sales, it will accelerate experimentation adoption in Software and other industries, bringing experimentation to Everyone!35
  35. 35. Questions? Slides will be posted on my LinkedIn page: www.linkedin.com/in/paveldmitriev/

×