Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons Learned from Building Machine Learning Software at Netflix

11,834 views

Published on

Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.

Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues we’ve encountered in developing real machine learning systems at Netflix and some of the lessons we’ve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that we’ve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that we’ve fond to be useful across a wide variety of machine-learned systems.

Published in: Technology
  • My partner says the difference is incredible! My partner has probably punched me a hundred times to get me to roll over and stop snoring. I have been using your techniques recently and now my partner has told me that the difference is incredible. But what has amazed me the most is how much better and more energetic I now feel after a good night's sleep! Thank you so much! ★★★ http://t.cn/Aigi9dEf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Lessons Learned from Building Machine Learning Software at Netflix

  1. 1. 1 Lessons Learned from Building Machine Learning Software at Netflix Justin Basilico Page Algorithms Engineering December 13, 2014 @JustinBasilico Workshop 2014
  2. 2. 2 Introduction
  3. 3. 3 Introduction 2006 2014
  4. 4. 4 Netflix Scale  > 50M members  > 40 countries  > 1000 device types  Hours: > 2B/month  Plays: > 70M/day  Log 100B events/day  34.2% of peak US downstream traffic
  5. 5. 5 Goal Help members find content to watch and enjoy to maximize member satisfaction and retention
  6. 6. 6 Everything is a Recommendation Rows Ranking Over 75% of what people watch comes from our recommendations Recommendations are driven by Machine Learning
  7. 7. 7 Machine Learning Approach Problem Data Metrics Model Algorithm
  8. 8. 8 Models & Algorithms  Regression (Linear, logistic, elastic net)  SVD and other Matrix Factorizations  Factorization Machines  Restricted Boltzmann Machines  Deep Neural Networks  Markov Models and Graph Algorithms  Clustering  Latent Dirichlet Allocation  Gradient Boosted Decision Trees/Random Forests  Gaussian Processes  …
  9. 9. 9 Design Considerations Recommendations • Personal • Accurate • Diverse • Novel • Fresh Software • Scalable • Responsive • Resilient • Efficient • Flexible
  10. 10. 10 Software Stack http://techblog.netflix.com
  11. 11. 11 Lessons Learned
  12. 12. 12 Lesson 1: Be flexible about where and when computation happens.
  13. 13. 13 System Architecture  Offline: Process data  Nearline: Process events  Online: Process requests  Learning, Features, or Model evaluation can be done at any level Netflix.Hermes Netflix.Manhattan Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE More details on Netflix Techblog
  14. 14. 14 Where to place components?  Example: Matrix Factorization  Offline:  Collect sample of play data  Run batch learning algorithm like SGD to produce factorization  Publish video factors  Nearline:  Solve user factors  Compute user-video dot products  Store scores in cache  Online:  Presentation-context filtering  Serve recommendations Netflix.Hermes Netflix.Manhattan X≈UVt Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE V sij=uivj Aui=b sij X sij>t
  15. 15. 15 Lesson 2: Think about distribution starting from the outermost levels.
  16. 16. 16 Three levels of Learning Distribution/Parallelization 1. For each subset of the population (e.g. region)  Want independently trained and tuned models 2. For each combination of (hyper)parameters  Simple: Grid search  Better: Bayesian optimization using Gaussian Processes 3. For each subset of the training data  Distribute over machines (e.g. ADMM)  Multi-core parallelism (e.g. HogWild)  Or… use GPUs
  17. 17. 17 Example: Training Neural Networks  Level 1: Machines in different AWS regions  Level 2: Machines in same AWS region  Spearmint or MOE for parameter optimization  Condor, StarCluster, Mesos, etc. for coordination  Level 3: Highly optimized, parallel CUDA code on GPUs
  18. 18. 18 Lesson 3: Design application software for experimentation.
  19. 19. 19 Example development process Idea Data Offline Modeling (R, Python, MATLAB, …) Iterate Implement in production system (Java, C++, …) Data discrepancies Missing post-processing logic Performance issues Actual output Experimentation environment Production environment (A/B test) Code discrepancies Final model
  20. 20. 20 Avoid dual implementations Shared Engine Experiment code Production code Experiment Production • Models • Features • Algorithms • …
  21. 21. 21 Solution: Share and lean towards production  Developing machine learning is an iterative process  Want a short pipeline to rapidly try ideas  Want to see output of complete system, not just learned component  Make application components easy to experiment with  Share them between online, nearline, and offline  Make it possible to run individual parts of the software  Use the real code whenever possible  Have well-defined interfaces and formats to allow you to go off-the-beaten path
  22. 22. 22 Lesson 4: Make algorithms extensible and modular.
  23. 23. 23 Make algorithms and models extensible and modular  Algorithms often need to be tailored for a specific application  Treating an algorithm as a black box is limiting  Better to make algorithms extensible and modular to allow for customization  Separate models and algorithms  Many algorithms can learn the same model (i.e. linear binary classifier)  Many algorithms can be trained on the same types of data  Support composing algorithms Data Parameters Data Model Parameters Model Algorithm Vs.
  24. 24. 24 Provide building blocks  Don’t start from scratch  Linear algebra: Vectors, Matrices, …  Statistics: Distributions, tests, …  Models, features, metrics, ensembles, …  Cost, distance, kernel, … functions  Optimization, inference, …  Layers, activation functions, …  Initializers, stopping criteria, …  …  Domain-specific components Build abstractions on familiar concepts Make the software put them together
  25. 25. 25 Example: Tailoring Random Forests Use a custom tree split Customize to run it for an hour Report a custom metric each iteration Inspect the ensemble Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
  26. 26. 26 Lesson 5: Describe your input and output transformations with your model.
  27. 27. 27 Putting learning in an application Application Application or model code? Feature Encoding Output Decoding ? Machine Learned Model Rd ⟶ Rk
  28. 28. 28 Example: Simple ranking system  High-level API: List<Video> rank(User u, List<Video> videos)  Example model description file: { “type”: “ScoringRanker”, “scorer”: { “type”: “FeatureScorer”, “features”: [ {“type”: “Popularity”, “days”: 10}, {“type”: “PredictedRating”} ], “function”: { “type”: “Linear”, “bias”: -0.5, “weights”: { “popularity”: 0.2, “predictedRating”: 1.2, “predictedRating*popularity”: 3.5 } } } } Ranker Scorer Features Linear function Feature transformations
  29. 29. 29 Lesson 6: Don’t just rely on metrics for testing.
  30. 30. 30 Importance of Testing  Temptation: Use validation metrics to test software  When things work this seems great  When metrics don’t improve: was it the code, data, metric, idea, …?  Machine learning code involves intricate math and logic  Rounding issues, corner cases, …  Is that a + or -? (The math or paper could be wrong.)  Solution: Unit test  Testing of metric code is especially important  Test the whole system  Compare output for unexpected changes across versions
  31. 31. 31 Conclusions
  32. 32. 32 Two ways to solve computational problems Know solution Write code Compile code Test code Deploy code Know relevant data Develop algorithmic approach Train model on data using algorithm Validate model with metrics Deploy model Software Development Machine Learning (steps may involve Software Development)
  33. 33. 33 Take-aways for building machine learning software  Building machine learning is an iterative process  Make experimentation easy  Take a holistic view of both the application and experimental environments  Optimize only what matters  Testing can be hard but is worthwhile
  34. 34. Thank You Justin Basilico jbasilico@netflix.com 34 @JustinBasilico We’re hiring

×