Successfully reported this slideshow.
Your SlideShare is downloading. ×

How to Correctly Use Experimentation in PM by Google PM

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 76 Ad

How to Correctly Use Experimentation in PM by Google PM

Main takeaways:
- Common misconceptions and pitfalls in using experimentation
- Best practices on using the scientific method for experimentation
- Evaluating how other experimentation techniques such as Multi-Armed Bandit and Multivariate Testing can help you solve different types of problems

Main takeaways:
- Common misconceptions and pitfalls in using experimentation
- Best practices on using the scientific method for experimentation
- Evaluating how other experimentation techniques such as Multi-Armed Bandit and Multivariate Testing can help you solve different types of problems

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to How to Correctly Use Experimentation in PM by Google PM (20)

Advertisement

More from Product School (20)

Recently uploaded (20)

Advertisement

How to Correctly Use Experimentation in PM by Google PM

  1. 1. www.productschool.com How to Correctly Use Experimentation in PM by Google PM
  2. 2. Join 35,000+Product Managers on Free Resources Discover great job opportunities Job Portal prdct.school/PSJobPortalprdct.school/events-slack
  3. 3. C O U R S E S Product Management Learn the skills you need to land a Product Manager job
  4. 4. C O U R S E S Coding for Managers Build a website and gain the technical knowledge to lead software engineers
  5. 5. C O U R S E S Data Analytics for Managers Learn the skills to understand web analytics, SQL and machine learning concepts
  6. 6. C O U R S E S Learn how to acquire more users and convert them into clients Digital Marketing for Managers
  7. 7. C O U R S E S UX Design for Managers Gain a deeper understanding of your users and deliver an exceptional end-to- end experience
  8. 8. C O U R S E S For experienced Product Managers looking to gain strategic skills needed for top leadership roles Product Leadership
  9. 9. C O U R S E S Corporate Training Level up your team’s Product Management skills
  10. 10. Ruben Lozano T O N I G H T ’ S S P E A K E R
  11. 11. A/B
  12. 12. A B CONTROL VARIATION BUY NOW BUY NOW 21% 35%
  13. 13. “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.‘” Abraham Maslow
  14. 14. Agenda 1. User Research 101 2. Experiments 101 a. Why run live experiments? 3. Deep Dive on Live Experiments a. Process overview i. MVT ii. MAB b. Common mistakes c. Best practices 4. Questions
  15. 15. “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka‘ but ‘That’s funny...‘” Isaac Asimov
  16. 16. User Research 101 Love of learning
  17. 17. What is user research? User research is a systematic approach to discovering users' aspirations, goals, tasks, needs, pain points, and information and interaction requirements. Research grounds, verifies, and validates what a team builds.
  18. 18. Types of user research 3-dimensional framework 1. Attitudinal vs. Behavioral a. "what people say" vs "what people do" 2. Qualitative vs. Quantitative a. qualitative methods are better suited for answering why, whereas quantitative methods do a better job answering how many and how much 3. Context of Use a. Natural or near-natural b. Scripted c. Not using the product d. A hybrid Source: https://www.nngroup.com/articl es/which-ux-research-methods/
  19. 19. Landscape of user research methods Source: https://www.nngroup.com/articl es/which-ux-research-methods/ Key for context of product use during data collection LIVE
  20. 20. Where does research fit to product? Iterative research Foundational research Evaluative research PRODUCT DEVELOPMENT
  21. 21. Foundational research 1. Purpose a. Uncover opportunities b. Inform product direction and strategy c. Continuously engage with users 2. Research methods a. Ethnographic field studies b. Diary or camera studies c. Competitive analysis 3. Study participants a. Future and current users
  22. 22. Iterative research 1. Purpose a. Explore design concepts b. Identify usability issues 2. Research methods a. In-lab usability studies b. Walkthroughs c. User testing 3. Study participants a. Future and current users, early access
  23. 23. Evaluative research 1. Purpose a. Test how the product performs with target users b. Improve product 2. Research methods a. Surveys b. Lab usability studies c. Live experiments 3. Study participants a. Current or potential users
  24. 24. Experiments 101 What is an experiment?
  25. 25. What is an experiment? An experiment is a way to test a hypothesis about the quality of a product. An experiment may also refer to the gradual launch of a new feature. Note: Tests, while they are an important part of the software development journey, are not experiments, since you know in advance the result you expect LIVE EVAL
  26. 26. How does it work? Total Population (e.g. users, requests, queries) Unaffected population Control group Treatment group Experiment Diversion Default Experimental Value Default
  27. 27. Success metric A quantifiable measure of a desirable outcome Product Sample Metric Facebook DAU (daily active users) Airbnb Nights booked WhatsApp Messages sent Quora Number of questions answered Amazon Revenue Long-term growth
  28. 28. Why run live experiments?
  29. 29. Obviously, this is going to work Humans are terrible at making predictions 1. Hindsight bias 2. Identifiable victim effect 3. Observational selection bias 4. Projection bias 5. Anchoring bias … and hundreds of cognitive biases...
  30. 30. Just do a pre/post The world is a fascinating place.
  31. 31. Just do a pre/post Brazil Search Traffic June 2014
  32. 32. A/B isolates the impact of just the product changes
  33. 33. Is something a good idea? A B ORGANIC HONEYCRISP APPLE GRANNY SMITH APPLE FUJI APPLE ORGANIC HONEYCRISP APPLE GRANNY SMITH APPLE
  34. 34. What’s the best iteration of a good idea? A B C Yahoo! - Wikipedia, the free encyclopedia en.wikipedia.org... Yahoo! - Wikipedia, the free encyclopedia en.wikipedia.org/wiki/Yahoo Yahoo! - Wikipedia, the free encyclopedia en.wikipedia.org/wiki/Yahoo
  35. 35. Why do live experiments? To learn 1. Is something a good idea? 2. What’s the best iteration of a good idea? 3. How much is the impact of something? 4. Is something causing a problem? 5. Is something still a good idea? 6. What is the long-term impact of something?
  36. 36. Live experiments are not magic wands
  37. 37. bring the science
  38. 38. Fundamentals of experiment design The scientific method is an empirical method of acquiring knowledge. It is the systematic observation, measurement, and experimentation of a hypothesis. Observation1 Hypothesis2 Design3 Experiment4 Analysis5 Prove/Reject6
  39. 39. Deep Dive on Live Experiments How to succeed?
  40. 40. Process overview
  41. 41. PM flavor of scientific method 1. Ask a question 2. Do background research 3. Develop a hypothesis 4. Test the hypothesis 5. Analyze data 6. Draw conclusions 7. Communicate results
  42. 42. 1. Ask a question ● How can I increase user happiness? ● How can I increase usage of my product? ● How can I simplify code without changing metrics? ● How can I increase revenue attributed to my product? ● How can I affect click behavior?
  43. 43. 2. Do background research ● What others have done before ○ Are you doing something different? ○ Did something change since the previous attempt? ● Quantitative data ○ Behavioral metrics ○ Funnel metrics ○ Downstream metrics ○ Surveys ○ Trends ● Qualitative data ○ Perceptions ○ Attitudes ○ Assumptions ○ Preferences
  44. 44. 3. Develop hypothesis A testable explanation for a phenomenon Before performing an experiment and observe what happens, we first clearly identify what we "think" will happen. We make an "educated guess." We set out to prove or disprove the hypothesis.
  45. 45. 3. Develop hypothesis BUY NOW BUY NOW It will make the product awesome! In our focus group, people liked this design better Click-through rate will increase The color red is out of harmony with the world Users associate red with danger
  46. 46. 3. Develop hypothesis BUY NOW BUY NOW Click-through rate will increase Users associate red with danger
  47. 47. Example Hypothesis: Prime users will spend more $ if they can easily narrow their search results to prime products Design experiment: Show a prime toggle on the navigation bar for all US prime users on iOS
  48. 48. 4. Test the hypothesis A B
  49. 49. 4. Test the hypothesis B1. Success metric 2. Triggering criteria 3. Duration of the experiment 4. Launch criteria Option 1 +2.5% Revenue [1.9%, 3.1%], p=0.02 Option 2 +1.3% Revenue [-0.3%, 2.9%], p=0.15
  50. 50. 5. Analyze the data Statistical significance is the likelihood that the numeric difference between a control and treatment outcome is not due to random chance Null hypothesis states there is no significant difference between control and treatment, any observed difference is due to sampling or experimental error P-value evaluates how well the sample data supports the argument that the null hypothesis is true. A low p value suggests you can reject the null hypothesis Confidence interval is a range of values (lower and upper bound) that is likely to contain an unknown population parameter
  51. 51. 5. Analyze the data
  52. 52. 5. Analyze the data significantly positive significantly negative inconclusive flat* (still inconclusive) (-) 0% (+) -0.5% | practical significance
  53. 53. 6. Draw conclusions Hypothesis: Prime users will spend more $ if they can easily narrow their search results to prime products 1. Validation 2. Description 3. Evaluation a. Arguments in favor b. Arguments against c. Key observations d. Durable learning e. Next steps B
  54. 54. 7. Communicate results
  55. 55. Multivariate Testing Tests multiple variants simultaneously to determine how the combinations of multiple variables influence the output.
  56. 56. APPLE BUY NOW BUY PURCHASE APPLE APPLE APPLE Font Color (4) - Black - Red - Gray - Blue Button Label (3) - BUY NOW - BUY - PURCHASE Image (4) - Option A - Option B - Option C - Option D 48
  57. 57. 4. MVT Pro ● Measures interactions between elements ● Test multiple variants of specified variables are tested in a user interface Cons ● Much more traffic is typically required ● All combinations of the variants must make sense together
  58. 58. Multi-armed Bandits Optimizes the objective function by updating the randomization distribution as the experiment progresses
  59. 59. Comparing A/B/n vs MAB A B C week 1 2 3 4 5 6 A B C week 1 2 3 4 5 6
  60. 60. Multi-armed Bandits Pro ● Produces answers far more quickly ● Safer to allocate larger fraction of traffic to experiments ● Increases experimentation outputs Cons ● Over valuing short-term behavior ● Lead to self-fulling properties where an initial positive change is overly valued ● Can’t quantify impact from control
  61. 61. Common mistakes
  62. 62. Ethical ● Are you reinforcing biases? ● Are you using dark patterns to improve KPIs? ● Are you deceiving users? ● Are you violating privacy? ● What if the description of your experiment was published on the NYT?
  63. 63. Time-related ● Stopping tests too early ● Not testing long enough, or not waiting long enough for the right metrics ● Testing for too long ● Tests are not run for full weeks ● Running a split test during the holidays ● Forgetting learnability
  64. 64. Configuration ● Loose triggering ● Ignoring different traffic sources ● Sending too little traffic to variations ● Change experiment settings in the middle of a test ● Not considering mobile traffic variables
  65. 65. Design ● Tests are not based on a hypothesis ● Using too many metrics ● Using the metric that you can measure, not the one that matters ● Not setting quantifiable goals ● Using too many changes on a single treatment ● Using too many variations
  66. 66. Analysis ● Trusting what you read from the numbers ● Ignoring the difference between local and global metrics ● Not taking seasonal changes into account ● Ignoring technical issues (flickering, latency, etc.) ● Incorrect post-test segmentation ● P-hacking or data fishing!
  67. 67. Best practices
  68. 68. Choose the right metrics ● Ask “How do we define success?” and “How do we measure user value?” ○ Think both short-term and long-term ○ Be more inclusive ● Should reflect changes that your product controls ● Avoid using lagging indicators ● Align on the success metrics beyond your own team
  69. 69. Be a good wannabe scientist ● Bias against rolling out flat experiments, unless it’s a complete redesign ● Be suspicious if you didn’t predict a specific result in advance, retest to verify ● Look for a consistent story in the metrics ● The more you slice and dice your data, the more false positives you’ll get ● Run an A/A test!
  70. 70. Create and follow templates and processes ● Intake forms ● Pre-experiment design ● Post-experiment analysis ● Consolidation of results
  71. 71. Pre-experiment design ● Experiment Information ○ Summary ● What’s changing? ○ Control vs treatment ● Objective ○ Goal and hypothesis ● Triggering criteria ○ Device, geography, target user, etc ● Launch criteria ○ Launch if.. ○ Definition of flat, bonferroni correction ● Experiment duration
  72. 72. Post-experiment analysis ● Executive Summary ○ Overview with final decision ● Reasons to Launch ○ Based on launch criteria ● Reasons not to Launch ○ Based on launch criteria ● Latency Impact ● Key observations ○ “That’s funny” ● Durable learnings ○ Overall insights ● Detailed analysis
  73. 73. Implement an experimentation process 1. Intake request 2. Pre-experiment review 3. Monitoring experiment 4. Post-experiment review 5. Experimentation council 6. Socialize results 7. Document results
  74. 74. www.productschool.com Part-time Product Management, Coding, Data Analytics, Digital Marketing, UX Design, Product Leadership courses and Corporate Training

Editor's Notes

  • "As you checked in we sent you an email to join our online communities, events, and to apply for product management jobs. As members of the Product School community we'd like to provide you with these resources at your disposal."
  • A/B testing, at its most basic, is a way to compare two versions of something to figure out which performs better.
  • You start an A/B test by deciding what it is you want to test. Let’s start with a simple example: the color of the buy button on your website. Then you need to know how you want to evaluate its performance. In this case, let’s say your metric is the number of visitors who click on the button. To run the test, you show two sets of users (assigned at random when they visit the site) the different versions (where the only thing different is the color of the button) and determine which influenced your success metric the most. In this case, which button color caused more visitors to click? Seems simple enough, right? Well, you'd be surprised how often A/B tests are performed the wrong way. A bad A/B test can lead to a decision that isn't in the best interest. At best, these bad A/B tests are a waste of time. At worst, they can end up costing you a lot of money in the long run.
  • The appeal of A/B tests are how straightforward and simple they are to run (and many people designing these experiments, don’t have a statistics background)

    I’ve seen and interacted with many people who use A/B as a lazy way to do their job. In a… I don’t know what to do--so I will run an A/B test. Or, I love data--so I want to collect data to support this decision. RED FLAG!
  • Foundational research literally puts in place the foundational understanding of what user needs exist, what would be of value for people, and so on. Such research is about building deep empathy and understanding for people, this is where we address what value, if any, we can bring to people with what we build. While in some contexts, the idea of foundational research is that it comes *before* design and development, that it feeds into “requirements”, in our vision taking a foundational stance and conducting foundational research is an ongoing part of every project. In order to accomplish Product Excellence, we need to be always engaging with what value we are bringing to our users.
    Iterative research is conducted to refine flows, features, and components, to make them optimal for support of the user’s task accomplishment. Objective and subjective measures are collected to make decisions between design options.
    Evaluative research is conducted when the product is almost “baked”, and when the product is launched.
  • The next step is to choose a success metric.

    Think long-term
  • There was a product release in search, but the team was not concerned
  • There are pros and cons to this UI change (adding pictures from web pages) Pro: should help users evaluate a result before clicking Con: people like images, so they will draw their attention away from results without images, even if those are good results.
  • Here, you don't really have any good model for what might be best. so you just try lots of things.
    Also note that even small changes can produce user behavior differences.
  • We sometimes do experiments just to gain knowledge of how users react. The goal is to help us to better understand future experiments
    Did we do it right? Sometimes some weird bug only shows up with real traffic. Looking at metrics and realizing “that’s weird” can help find those bugs.

    Is it still good? Best practice would be to periodically re-evaluate all features and make sure they are still useful; if not, unlaunch!

    Longer term impact: metrics look good, let's launch but "hold back" from a set of users to monitor longer term performance.
  • Hypothesis has two parts:
    a story that explains how users will be affected by the change and why that's good for the and
    how you will measure that change in behavior.

    Sometimes you need to explain your change a little more as well to make the connection clear. Sometimes user behavior changes and metrics are almost inseparable (e.g. if you expect behavior change to be "takes less time to find good result" then measuring "time to click" is almost obvious.)

    1. Sounds silly, but a lot of hypotheses amount to this: “my feature is obviously good”. Teams may push back against experimentation because it is “hard” or “takes too long”. Don’t take this at face value. Ask why, and make sure to communicate to analysts and experimentation teams when it’s actually true.
    Find proxies. At the end of the day, having awesome features is not the goal, making people’s lives better or solving problem is the goal
    2. This is a good proxy! For some features, we can actually measure growth in usage of Google Search, although we need a more specific definition of “growth”. (There are efforts to take a more Google-wide view of this… are we improving Search at the expense of something else, for example?) But: this takes a long time to measure, and though we have lots of data, only really big improvements actually move the needle. Sometimes you may be asked to do a “growth holdback” for a launch. 3. This is an important reason to run experiments! Sometimes we make a change that should not have any effect, but we need to verify that. 4. Another valid hypothesis. The good thing about this is it mentions a specific metric we think will change. 5. This mentions a metric, but not its direction. Sometimes we don’t know: if we change the link colors to one of those other 41 shades of blue, it will probably change CTR, but it could be hard to guess how! Important point: our intuition (even that of experienced and trained people) is often wrong about theses things, and that’s why just relying on principles won’t work.
  • 1. Sounds silly, but a lot of hypotheses amount to this: “my feature is obviously good”. Teams may push back against experimentation because it is “hard” or “takes too long”. Don’t take this at face value. Ask why, and make sure to communicate to analysts and experimentation teams when it’s actually true.
    “following design principles will lead you to create features / products that make Google awesome”. Design principles are not laws of nature. And even laws of nature: we run experiments to verify their consequences.

    Find proxies. At the end of the day, having awesome features is not the goal, making people’s lives better or solving problem is the goal
    2. You have background research. It is testable, but there is no explanation.
    3. It is a success metric. It is testable. But it doesn’t tell you why. 4. It is an explanation, but it is not testable. 5. It is a hypothesis--but not really a hypothesis for our testing context
  • Because… users associated red with danger, I predict that the click-through rate will increase
  • Tracking, bug, is the data good?
    What am I seeing? How did it perform?
    Craft the story

×