Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems


Published on

Talk from ICML 2016 workshop on Machine Learning Systems about some design patterns we use at Netflix for building machine learning systems. In particular, focusing on avoiding problems that can come up with differences between offline (experimental/lab) and online (live/production) code and data.

Published in: Technology

Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems

  1. 1. 11 Is that a Time Machine? Some Design Patterns for Real-World Machine Learning Systems Justin Basilico Page Algorithms Engineering ICML ML Systems Workshop June 24, 2016 @JustinBasilico DeLorean image by &
  2. 2. 22 Introduction
  3. 3. 3 Focus 2006 2016
  4. 4. 4 Netflix Scale  > 81M members  > 190 countries  > 1000 device types  > 3B hours/month  > 36% of peak US downstream traffic
  5. 5. 5 Goal Help members find content to watch and enjoy to maximize member satisfaction and retention
  6. 6. 6 Machine Learning is Everywhere Rows Ranking Over 80% of what people watch comes from our recommendations
  7. 7. 7 Models & Algorithms  Regression (linear, logistic, elastic net)  SVD and other Matrix Factorizations  Factorization Machines  Restricted Boltzmann Machines  Deep Neural Networks  Markov Models and Graph Algorithms  Clustering  Latent Dirichlet Allocation  Gradient Boosted Decision Trees/Random Forests  Gaussian Processes  …
  8. 8. 8 Systems  AWS Cloud  Online:  Microservices  Java  EVCache, Cassandra  Offline:  Hive on S3  Spark, Docker, Meson Netflix.Hermes Netflix.Manhattan Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE More details on Netflix Techblog
  9. 9. 99 Design Patterns
  10. 10. 10 Why Design Patterns for ML Systems? Idea Experiment Live Problem Problem
  11. 11. 11 Design patterns provide…  Common solutions to common problems  No need to re-invent them  A menu of approaches  Reusable abstractions  Transcend specific implementations  Common terminology  Eases communications of how something works
  12. 12. 12 Some machine learning patterns…  The Hulk  The Lumberjack  The Online Archive  The Time Machine  The Sentinel  The Precog  The Dagobah  The Anytime Algorithm  The Parameter Oracle  The LEGO  The Terminator  The Inception  The Feature Encoder  The Hoarder  The Transformer  The Parameter Server  The Log Space  The Matrix Transposed  The Overflow  The Substitute Thanks to: Aish Fenton, Yves Raimond, Dave Ray, Hossein Taghavi, Anuj Shah, DB Tsai, …
  13. 13. 13 Application Machine Learning in an Application Machine Learning Application ?Machine Learned Model Feature Encoding Output Decoding Predictor
  14. 14. 14 Antipattern: The Phantom Menace (AKA Training/Serving Skew) Different code/data/platform between training and applying model © Lucasfilm Ltd.
  15. 15. 15 Training Pipeline Application “Typical” ML Pipeline: A tale of two worlds Historical Data Generate Features Train Models Validate & Select Models Application Logic Live Data Load Model Offline Online Collect Labels Publish Model Evaluate Model Experimentation
  16. 16. 16 The Sentinel Validate model/data in online environment before letting it go live“You shall not pass!” © New Line Cinema
  17. 17. 17 Sentinel Service Application Sentinel: Structure Model Model Publisher Model Loader Model Loader Model Validator Offline Online Alert Republish Some potential checks: • File format is valid • Dependent data is available • Accuracy on shadow live data • Feature distributions match • Output is properly calibrated
  18. 18. 18 Sentinel  Example: Checking that new ranking model is valid and performs better than previous one  Pros:  Using a model requires both code and data are available  Models may need to be versioned along-side code changes  Ensure that a new model is no worse than previous one  Cons:  Sentinel needs to be in sync with application code  Difficult to choose failure thresholds for data-based checks
  19. 19. 19 The Hulk (AKA Offline Precompute) Train and evaluate your full model offline then publish final outputs Scale for production by batching and brute force © Disney © Disney
  20. 20. 20 Offline Precompute: Example Structure Application Cache Historical Data OfflineOnline Model Evaluation Predictor Data Publisher Generate Features Decode Output lookup key -> output save
  21. 21. 21 Offline Precompute (aka The Hulk)  Example: Computing unpersonalized video-to-video similarities  Pros:  Easy to set up based on experiment code  Decouples implementation from online platform  Can use more computationally expensive models  Cons:  Can’t depend on online facts or fresh data  May have data gaps (e.g. handling new videos, users, etc.)  May require cleanup to make consistent with online data  Model output based on offline data; may not be properly calibrated
  22. 22. 22 The Lumberjack (AKA Feature Logging) Train model on features logged online from within an application Image via YouTube
  23. 23. 23 Application Feature Logging: Structure Live Data Feature Log Train Models Predictor Labels log id Feature Config Generate Features Decode Output Model Evaluation Offline Online
  24. 24. 24  Example: Features of pages, rows, and videos in page generation  Pros:  Train on features exactly as seen online  Easy to deploy trained model  Can include impact of up-stream application logic  Cons:  Requires production-grade feature code and deployment  Takes time to log enough data  Need all dependent data also in production  Adds risk to production servers for experimental features  Feature data can be large; may require sampling Feature Logging (aka The Lumberjack)
  25. 25. 25 The Online Archive Have online services save history and expose to offline systems via batch interface © Lucasfilm Ltd.
  26. 26. 26 Online Archive: Structure Live + Historical Data Generate Features Collect Labels Offline Online Application Train Model batch interface live interface
  27. 27. 27 Online Archive  Example: Filtering online viewing history  Pros:  Provides access to online view of data at any time  Can experiment with new features  Cons:  All dependent data needs to keep track of all history  Only works for small data  Requires batch interface also available within application  May be other processes that edit history (e.g. slow arriving events)  Service needs to handle two very different request loads so batch queries don’t bring down the live system
  28. 28. 28 The Time Machine Snapshot facts and share feature generation code DeLorean image by &
  29. 29. 29 Application Time Machine: Example Structure FeaturesFact Log Feature Config Predictor Generate Features Decode Output Online Snapshotter Model Evaluation Generate Features Labels Data Service Bulk Data Other Models Live Data
  30. 30. 30 Time Machine  Example: Training ranking models in Spark*  Pros:  Easy to experiment with new features offline  Allows testing impact of modifying non-ML components  Can construct full application output after trying new model  Can share snapshots across applications to help build new ones  Cons:  Fact data volume can be high; may require sampling  Snapshotting requires deciding contexts to collect data for * See for more info
  31. 31. 3131 Conclusions
  32. 32. 32 Conclusion  Some design patterns for avoiding online-offline discrepancies  The Sentinel  The Hulk  The Lumberjack  The Online Archive  The Time Machine  What useful patterns do you see for ML systems?  Share them!
  33. 33. 33 Thank You Justin Basilico @JustinBasilico