Talk from ICML 2016 workshop on Machine Learning Systems about some design patterns we use at Netflix for building machine learning systems. In particular, focusing on avoiding problems that can come up with differences between offline (experimental/lab) and online (live/production) code and data.
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems
1. 11
Is that a Time Machine?
Some Design Patterns for Real-World Machine Learning Systems
Justin Basilico
Page Algorithms Engineering
ICML ML Systems Workshop
June 24, 2016
@JustinBasilico
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
11. 11
Design patterns provide…
Common solutions to common problems
No need to re-invent them
A menu of approaches
Reusable abstractions
Transcend specific implementations
Common terminology
Eases communications of how something works
12. 12
Some machine learning patterns…
The Hulk
The Lumberjack
The Online Archive
The Time Machine
The Sentinel
The Precog
The Dagobah
The Anytime Algorithm
The Parameter Oracle
The LEGO
The Terminator
The Inception
The Feature Encoder
The Hoarder
The Transformer
The Parameter Server
The Log Space
The Matrix Transposed
The Overflow
The Substitute
Thanks to: Aish Fenton, Yves Raimond, Dave Ray, Hossein Taghavi, Anuj Shah, DB Tsai, …
13. 13
Application
Machine Learning in an Application
Machine Learning
Application
?Machine
Learned Model
Feature
Encoding
Output
Decoding
Predictor
15. 15
Training Pipeline
Application
“Typical” ML Pipeline: A tale of two worlds
Historical
Data
Generate
Features
Train Models
Validate &
Select Models
Application
Logic
Live
Data
Load
Model
Offline
Online
Collect Labels
Publish
Model
Evaluate
Model
Experimentation
18. 18
Sentinel
Example: Checking that new ranking model is valid and
performs better than previous one
Pros:
Using a model requires both code and data are available
Models may need to be versioned along-side code changes
Ensure that a new model is no worse than previous one
Cons:
Sentinel needs to be in sync with application code
Difficult to choose failure thresholds for data-based checks
20. 20
Offline Precompute: Example Structure
Application
Cache
Historical
Data
OfflineOnline
Model Evaluation
Predictor
Data
Publisher
Generate
Features
Decode
Output
lookup
key -> output
save
21. 21
Offline Precompute (aka The Hulk)
Example: Computing unpersonalized video-to-video similarities
Pros:
Easy to set up based on experiment code
Decouples implementation from online platform
Can use more computationally expensive models
Cons:
Can’t depend on online facts or fresh data
May have data gaps (e.g. handling new videos, users, etc.)
May require cleanup to make consistent with online data
Model output based on offline data; may not be properly calibrated
22. 22
The Lumberjack
(AKA Feature Logging)
Train model on features
logged online from within
an application
Image via YouTube
24. 24
Example: Features of pages, rows, and videos in page generation
Pros:
Train on features exactly as seen online
Easy to deploy trained model
Can include impact of up-stream application logic
Cons:
Requires production-grade feature code and deployment
Takes time to log enough data
Need all dependent data also in production
Adds risk to production servers for experimental features
Feature data can be large; may require sampling
Feature Logging (aka The Lumberjack)
26. 26
Online Archive: Structure
Live +
Historical
Data
Generate
Features
Collect Labels
Offline
Online
Application
Train Model
batch
interface
live
interface
27. 27
Online Archive
Example: Filtering online viewing history
Pros:
Provides access to online view of data at any time
Can experiment with new features
Cons:
All dependent data needs to keep track of all history
Only works for small data
Requires batch interface also available within application
May be other processes that edit history (e.g. slow arriving events)
Service needs to handle two very different request loads so batch queries
don’t bring down the live system
28. 28
The Time Machine
Snapshot facts and share
feature generation code
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
29. 29
Application
Time Machine: Example Structure
FeaturesFact Log
Feature
Config
Predictor
Generate
Features
Decode
Output
Online
Snapshotter
Model Evaluation
Generate
Features
Labels
Data
Service
Bulk
Data
Other
Models
Live Data
30. 30
Time Machine
Example: Training ranking models in Spark*
Pros:
Easy to experiment with new features offline
Allows testing impact of modifying non-ML components
Can construct full application output after trying new model
Can share snapshots across applications to help build new ones
Cons:
Fact data volume can be high; may require sampling
Snapshotting requires deciding contexts to collect data for
* See http://bit.ly/sparktimetravel for more info
32. 32
Conclusion
Some design patterns for avoiding online-offline discrepancies
The Sentinel
The Hulk
The Lumberjack
The Online Archive
The Time Machine
What useful patterns do you see for ML systems?
Share them!