Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Experimentation Platform at Netflix


Published on

A high level explanation of the AB Testing Platform at Netflix and current open positions on the team.

Published in: Technology

Experimentation Platform at Netflix

  1. 1. A/B Testing at Netflix: Experimentation Platform Steve Urban
  2. 2. • Technology is just one part of the equation: a culture of experimentation is the other essential part • All product ideas are subjected to the scientific method, with actual data supporting changes before changes are rolled out to all users • The effectiveness of any idea is measured without bias - the seniority of the person proposing the idea is irrelevant Importance of A/B Testing at Netflix
  3. 3. A/B testing enables product decisions throughout Netflix, with our users spread across all departments • Data Scientists: Does this new ranking algorithm result in more plays? • Product Managers: Does this new UI reduce the time for users to find content? • Marketing: Which email campaign resulted in more new subscribers? • Content: Which thumbnail image resulted in more streams of Daredevil? • Engineers: Is the new implementation of this streaming algorithm more performant when internet connectivity is spotty? • and so on... Our Users
  4. 4. • Being an internal tool is not an excuse for poor UX • Given the diverse expertise of our users workflows must be simple and effective while providing value • Cover all generic test management scenarios • Easily accommodate unique experimentation needs as they come up • Ingest and combine real-time behavioral and batch metadata from numerous sources A/B Testing Platform Objectives
  5. 5. We’re looking for a Full-Stack Engineer to help across the board: • Collaborate with users across Netflix to understand their UI needs • Be part of a team of engineers and UX experts • Tech stack: Java, React, Node • Data visualization experience is a plus We’re Hiring Netflix has a unique culture. Read about it here. We need a Server-Side Engineer with expertise designing distributed systems: • Help design and rebuild our allocation engine • Experience processing large datasets - including efficient incorporation of near real-time data • Expertise with various Big Data databases • Machine learning experience is a plus
  7. 7. orA B Which Version is Better?
  8. 8. Which set of recommendations is better? orA B Given that I Watched House of Cards...
  9. 9. Hard to Answer Without Disciplined Experimentation orA? B?
  10. 10. A/B Testing Process Target Population Hypothesis: Retention and/or engagement will improve with new recommendation algorithm. Process: Randomly group users into different buckets. Other than the tests, all other factors are constant. Control Group: Continue to experience the current version (A) Test Group B: Experience version B Test Group C: Experience version C
  11. 11. A/B Testing Process Continued Analyze & Compare Key Results Algorithm A (Control) Algorithm B Algorithm C? ... Viewing hours delta: N/A N/A as this is what we are measuring other options against Viewing hours delta: +2.3% Statistically Significant: Yes Viewing hours delta: -5.7% Statistically Significant: Yes 2.3% better than the control, and we’re confident about it Ouch! Don’t use this algorithm.
  12. 12. Data Driven Results orA B
  13. 13. Experimentation Service Persist/Retrieve Allocations Experiment Criteria Define Experiments Sampling Metadata Allocations Evaluate Eligibility Ad Hoc queries R E S T A P I * Allocate Customers * Retrieve Allocations Real-time Analysis & MonitoringPersist Metrics Health Metrics Visualize Technology Stack Other Netflix Services
  14. 14. Allocation & Stratification All US Regions ● Randomly distribute and assign customers to a variant in the experiment utilizing Stratified Sampling ● Start, Stop, and Track allocations in near real-time Percentage of Users*: North East 22% South East 13% South West 17% ... ... *Numerical values are for illustrative purposes only and are totally made up “Random sampling” with enforcement of sample proportions across regions Percentage of Users
  15. 15. Segmentation Target Population ● Divide a broad target population into subsets with similar properties ● Some tests are meant to measure impact on specific populations ● Must maintain scale and low latencies Segmentation by specific properties Haven’t used a tablet to access Netflix in n days Used a game console to access Netflix within last n days Smart TV users
  16. 16. Test Health ● All test experiences are not equal, but we must ensure this isn’t due to buggy implementations ● Issues can be device specific, so must monitor at device, test, and experience granularity ● The example below is super-simplified - we need to create visualizations which effectively convey test health, internationally, across thousands of devices Control Cell Experience B No errors/fallbacks Experience A Issue on TV UI detected No errors/fallbacks
  17. 17. ABlaze UI: Test Lifecycle Management Initial Planning: Test Configuration Screens ● Determine hypothesis ● Implement each test experience Schedule Test: Scheduler View ● Define real-time rules & conditions ● Consider potential conflicts Monitor Test: Dashboard and Alert Views ● Monitor test health over time ○ Real-time analysis and alerting on metrics and allocations ● Pull test if bugs/issues present themselves Hypothesis Evaluation: Comparison Views ● Interactive filtering, analysis, & visualization of data ● Call success or failure of test Implement or Re-Test ● Devise plan to roll winning experience (if any) out to production ● Else, potentially revise hypothesis and retest
  18. 18. Some Challenges • Operate resiliently and at low latencies, despite: • Customer allocations taking place in real-time • Need for near real-time insights into test health over massive datasets • Data that is distributed across multiple clusters • Data processing: • Joins across billions of rows of data from many sources can cause massive increase in number of rows • Efficient management of datasets to support interactive analysis, dashboards, etc. • Rich and flexible filtering to support interactive analysis • Extract forecasts and insights • Oh, and make it as easy to use as possible for the users...