Machine Learning at Netflix Scale


Published on

Netflix is the world’s leading Internet television network with over 48 million members in more than 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series. Netflix uses machine learning to deliver a personalized experience to each one of our 48 million users.

In this talk you will hear about the machine learning algorithms that power almost every part of the Netflix experience, including some of our recent work on distributed Neural Networks on AWS GPUs. You will also get an insight into the innovation approach that includes offline experimentation and online AB testing. Finally, you will learn about the system architectures that enable all of this at a Netflix scale.

Published in: Engineering, Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • - Who in the audience has an ML background ?
    Who is has big data background?
    Who’s an engineer?
    Going to cover:
    Bit of everything. A few models, our approach to architecture of ML systems, and how it all comes together
    Feel free to ask questions as we go along.
  • - We use Machine Learning in many places at Netflix, but perhaps the place we’re best known for ML is in our recommender systems, and our personalization
    - So wanted to start with quick overview of what is personalization in Netflix
  • If you’ve logged into Netflix before this should look familiar. This is what it looks like when you login to our website
    What you might not realize however is that almost every element on this page is driven by a ML algorithm
  • - There’s the obvious recommendations. We a row of explicit recommendations, where we pull together everything we know about you, and present our “top picks” for you
  • You’ll also see “Genre” rows, that provide shows around a particular theme.
    Movies are tagged in our system based on a number of different aspects
    The tags are editorially added by our team of content experts
    Which genre’s we pick however, is personalized. So “Movies based on books” is shown for me based on my predicted likelihood of wanting to watch this genre
    There’s also a level of personalization within the row itself. So a genre like “Movies based on books” spans a lot of different tastes.
    For example, movies about Wall Street and documentation on the GFC, and Young Adult Fiction all types of “Movies based on Books”, but they serve different tastes.
    But based on what we know about you, we can construct a set of “Movies based on books” tailed to your particular view of what that means.
  • We also do “Similar” rows. So as the title says, because I last watched Bob Burgers, here’s some choices that are similar to that.
  • Even our marketing images are personalized. Much of the hero images and marketing you see within Netflix is personalized to your taste.
    I see OITNB, but here because it fits with my tastes
  • Finally we put it all together.
    Unsurprisingly, most of what people play is from the top left hand corner, and if they are forced to scroll further down, or right, then that means we failed to predict what they want to watch
    So we also rank the entire page. I’ve already shown how we rank the different rows left-to-right. We also rank each row top-to-bottom, so that you the most relevant (for you) rows are pushed to the top of the page.
  • The net result of this personalization, is that 75% of what our users watch, is selected from the homepage. And the rows I’ve just shown you.
    Which means that we’ve been able to provide a very personalized experience for our users, where what they see on the homepage, when they login to Netflix, matches pretty well with what they want to watch.
  • - Okay, I’m going to take a minute now to provide some back story.
  • Who’s heard of the Netflix prize?
    It ran from 2006->2009.
    - It was won in 2009 by Team KorBell (AT&T).
  • The challenge was:
    We give you 100M anonymized ratings from users data, to build a “rating prediction” model with
    We then get you to predict 2.8M ratings for user’s who we already know what they rated, but we held back.
    If you can improve on our predicted ratings by 10%, then we give you 1 million dollars
    We measure this as the root mean square difference between, your predicted rating, and what the real rating is that we held back.
    - Team KorBell (AT&T) won it in 2009.
    - They improved the predictions by 8.43%
  • Two significant algorithms came out of the Netflix Prize.
    SVD - Prize RMSE: 0.8914
    RBM - Prize RMSE: 0.8990
    They were known in academia already, but hadn’t made their way out into industry recommender systems.
    I talk through how SVD works at a high level in later slides
    These two algorithms are still used in parts of the Netflix Recommender System to this day.
  • - There are limitations though.
    Ratings != Plays. People’s ratings are somewhat “aspirational”. People may rate CitzenKane 5 stars, but what they watch is Sharknado.
    For our use case, we’re interested in predicting what people actually want to watch, not predicting what they think are critically worthy movies.
  • Also Netflix has changed a lot since the start of the Netflix Prize.
    In 2006 we were mailing out DVDs. Now we’re more about steaming to devices.
    This also changed people viewing habits.
    The investment in selecting a great DVD, that the entire family can watch, was higher. Everyone had to agree on it, and getting it wrong might ruin your night.
    With streaming content want content that is more personalized, and more context sensitive to what they want to watch NOW.
  • Also Netflix has grown. A lot.
    What algorithms worked in 2006, don’t necessary work with the volume we now have
  • - Okay so dive a little into the models and data we use to do our personalization
  • On the data side we have have a lot to work with.
    There’s a lot of signal that we get beyond straight plays/ratings.
    If you think about it, the context in which someone chooses what to watch tells you a lot too.
  • So I want to give you a quick overview of how SVD (aka Matrix Factorization) works. This is one of the classic algorithms used in the NF prize, and was a big break through at the time. This should give you a flavor of how these systems work.
    Basic model is.
  • - To make that more visual
  • So that’s one of the foundational algorithms used in recommender systems. But things have moved on a lot since then too.
    These days we’re mostly focused on ranking rather than rating prediction. This allows us to balance things like diversity, freshness, global popularity against our prediction on how much this fits your tastes
    We are (or have) AB tested many of these. And what algorithm to use really depends on your application, and what you’re trying to achieve. All have pros and cons.
    You’ll likely end up with a few different algorithms for different parts of the problem
    The important thing to test them in your production system
  • Over time we been able to improve on the results we got from the Netflix prize.
    It’s been a combination of adding more data, and adding in more sophisticated models
    As you can see here, we’ve moved things on a lot. These are improvements to Netflix’s core business metrics. So even a 1% improvement equates to real benefits to the business
    One quick note: Always make sure you select a realistic baseline to test against. Just straight global popularity is usually pretty tough to beat. So you can fool yourself if you’re not testing against that, or your equivalent of that.
  • - So you now you have an idea of what a recommender system algorithm looks like, lets see how you can productionize that
  • So here’s the core workflow you’ll need to support. Whatever decisions you make about your architecture, you’ll need to make the above process seamless.
    Machine Learning Approach
    Define problem (what you think needs solving, or hypothesis of what can be improved
    Gather data on which to train model
    Experiment offline to see if you can improve over baseline
    Produce Model/Algorithm and deploy
    Track key metrics in production to see if hypothesis is proven
  • - Here’s a blueprint for different layers you’ll need.
    - We’ll step through each area next.
  • Okay lets start with the front-end (aka online). I won’t cover much here, except for to point out that you’ll need an extremely good data pipeline.
    You’ll spend 90% of your time building this.
    Often needs to be built by an engineering team in collaboration with your researches.
  • There’s many different types of data you’ll want to capture
    Incl. What your algorithms are doing. You’ll need to correct for presentation bias
    And context and behavior that users interact with you in
  • - Need backend service that can accept and aggregate all these disparate data sources
    Want to look at technologies like Suro, Kafka, etc
    Stream to longer term (cheap) storage (S3, HDFS)
  • Need common framework that makes it easier to instrument your code for events.
    Adopt early and get into every app as “standard”
  • Okay lets talk about where you (typically) define and train your models
    Most of your models will be produced offline & embedded in production
    You’ll need a platform that allows easy, across diverse tools: R, iPython, in-house
    Common Format (can be code) that allows you to embed models once learned
  • Common confusion: Models change less than you think
    Values you’ll be plugging in, can still be real-time
  • Lets walk through an example of a model we train. Neural Networks
  • These days use GPUs (Cuda) to do training of network.
    Thousand of cores
    Massively parallel
    Computing power is what’s changed. ANN are really an old idea
  • But still need to explore hyper-parameter space.
    Learning rate theta
  • Parameters
    How many layers, and how deep
  • AWS offers GPU compute instances.
    Approach. Conduct search over many different architectures / parameters
    - Distribute different architecture to each instance
    - Train model
    - Evaluate
    Can get smarter with how you explore this space. So rather than doing grid search, you search in areas most likely to have improvement
    60cents an hour. Comparative fortune compared to other instances, but only takes a few hours to train model that is used in production for weeks (or months)
    Perfect for experimental work
  • Your offline models won’t reflect sudden changes in behavior, that it hasn’t seen before.
    Here’s OITNB, and House of Cards (as being searched for in Google). These can represent massive shifts in global user behavior, which can throw the model off
    Also some models degrade faster than others. You see this especially with tree models.
  • Another problem: The models themselves still run in production (even though they’re trained offline). This limits how sophisticated you can make your models. They still need to return results within your SLAs.
  • One Solution. Near-line computing.
    Re-train models based on events from the system
    Pre-compute results where you can
  • Now you don’t always have to pre-compute the final results. The beauty of the near-line approach is that it lets you half-bake the model. So that the parts that are more static are pre-generated, and the parts that are more sensitive to changes get worked on the fly.
    Remember our SVD model. U is users, M is movies, and R are ratings
    - Turns out that solving U if you know M and R, is simple Least Squares solution. With modern linear algebra libraries we can compute that in milli-seconds.
  • Recomputes are event driven. No need to re-compute if nothing has changed
    So in this example, we re-compute the latent vectors representing my tastes, whenever there’s more information available about me to re-train that vector with.
  • Machine Learning at Netflix Scale

    1. 1. Machine Learning At Netflix Scale Aish Fenton Manager - Research Engineering @aishfenton
    2. 2. Everything is a recommendation
    3. 3. 4
    4. 4. Top Picks for Aish
    5. 5. Movies based on books
    6. 6. Because you watched Bob’s Burgers
    7. 7. Rank based on your taste Rankbasedonyourtaste
    8. 8. 75% of plays come from homepage
    9. 9. Back Story…
    10. 10. Proxy question: ▪ Accuracy in predicted rating ▪ Improve by 10% = $1million! What we were interested in: ▪ High quality recommendations predicted actual
    11. 11. SVD RBMs Top two results still used in production!
    12. 12. >
    13. 13. 2006 2013
    14. 14. • > 44M members • > 40 countries • > 5B hours in Q3 2013 • Log 100B events/day • 31.62% of peak US downstream traffic
    15. 15. Data and Models
    16. 16. ▪ > 40M subscribers ▪ Ratings: ~5M/day ▪ Searches: >3M/day ▪ Plays: > 50M/day ▪ Streamed hours: o 5B hours in Q3 2013 Geo Info Time Impressions Device Info Metadata Social Ratings Demographics Member Behavior Plays
    17. 17. Aish House of Cards Latent User Vector Latent Item Vector
    18. 18. 3.53 RU M u1 u2 u3 m1 ! m2! m3 House of Cards Aish Aish House of Cards
    19. 19. Mean Rating My Bias Movie Bias Interaction
    20. 20. Mean Rating My Bias Movie Bias Interaction 3.55 = 2.50 + -1.5 + 1.2 + pq My rating for House of Cards
    21. 21. R 3.53 U M u1 u2 u3 m1 ! m2! m3 House of Cards Aish 2.35 1.34 Time T t1 t2 t3 Time
    22. 22. ▪ Matrix/Tensor Factorization ▪ Regression models (Logistic, Linear, Elastic nets) ▪ Factorization Machines ▪ Restricted Boltzmann Machines ▪ Markov Chains & other graph models ▪ Clustering / Topic Models ▪ Neural Networks ▪ Association Rules ▪ GBDT/RF ▪ …
    23. 23. Popularity + Ratings + More Features & Optimized Models 0% 50% 100% 150% 200% 250% 300% Improvement Over Baseline
    24. 24. Anatomy of a Machine Learning Platform
    25. 25. Problem Data Experiment Offline Produce Model Test / Metrics
    26. 26. Near-line Online UI Clients Event Distribution Online Algs Model Trainer Pre- compute AB Test Metrics API Layer Monitoring Offline Hadoop / Data Warehouse Experimentation Platform S3 / HDFS Offline Metrics Query Tools Models Models
    27. 27. Near-line Online UI Clients Event Distribution Online Algs Model Trainer Pre- compute AB Test Metrics API Layer Monitoring Offline Hadoop / Data Warehouse Experimentation Platform S3 / HDFS Offline Metrics Query Tools Models Models
    28. 28. ▪ App Logs ▪ User Actions ▪ Ratings ▪ Plays ▪ Queue Adds ▪ Algo Actions ▪ Impressions (Presentation Bias) ▪ Context ▪ Device Info ▪ User Demographics ▪ Social ▪ Time ▪ … Many different types of data…
    29. 29. Near-line Online UI Clients Event Distribution Online Algs Model Trainer Pre- compute AB Test Metrics API Layer Monitoring Offline Hadoop / Data Warehouse Experimentation Platform S3 / HDFS Offline Metrics Query Tools Models Models Embedded Embedded
    30. 30. Weights Real-time popularity of movie
    31. 31. Example: Neural Network Training
    32. 32. θ Input OutputHidden Layer
    33. 33. Input OutputHidden Layers
    34. 34. Neural Network Training 1,536 cores G2 Instances $0.60 p/h
    35. 35. But… things can go astray
    36. 36. Near-line Online UI Clients Event Distribution Online Algs Model Trainer Pre- compute AB Test Metrics API Layer Monitoring Offline Hadoop / Data Warehouse Experimentation Platform S3 / HDFS Offline Metrics Query Tools Models Models
    37. 37. RU M Pre-compute u1 u2 u3Online
    38. 38. Near-line Online UI Clients Event Distribution Online Algs Model Trainer Pre- compute AB Test Metrics API Layer Monitoring Offline Hadoop / Data Warehouse Experimentation Platform S3 / HDFS Offline Metrics Query Tools Models Models Aish played HoC Publish new model for Aish
    39. 39. Aish Fenton @aishfenton