11
Is that a Time Machine?
Some Design Patterns for Real-World Machine Learning Systems
Justin Basilico
Page Algorithms Engineering
ICML ML Systems Workshop
June 24, 2016
@JustinBasilico
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
22
Introduction
3
Focus
2006 2016
4
Netflix Scale
 > 81M members
 > 190 countries
 > 1000 device types
 > 3B hours/month
 > 36% of peak US
downstream traffic
5
Goal
Help members find content to watch and enjoy
to maximize member satisfaction and retention
6
Machine Learning is Everywhere
Rows
Ranking
Over 80% of what
people watch
comes from our
recommendations
7
Models & Algorithms
 Regression (linear, logistic, elastic net)
 SVD and other Matrix Factorizations
 Factorization Machines
 Restricted Boltzmann Machines
 Deep Neural Networks
 Markov Models and Graph Algorithms
 Clustering
 Latent Dirichlet Allocation
 Gradient Boosted Decision
Trees/Random Forests
 Gaussian Processes
 …
8
Systems
 AWS Cloud
 Online:
 Microservices
 Java
 EVCache, Cassandra
 Offline:
 Hive on S3
 Spark, Docker, Meson
Netflix.Hermes
Netflix.Manhattan
Nearline
Computation
Models
Online
Data Service
Offline Data
Model
training
Online
Computation
Event Distribution
User Event
Queue
Algorithm
Service
UI Client
Member
Query results
Recommendations
NEARLINE
Machine
Learning
Algorithm
Machine
Learning
Algorithm
Offline
Computation Machine
Learning
Algorithm
Play, Rate,
Browse...
OFFLINE
ONLINE
More details on Netflix Techblog
99
Design Patterns
10
Why Design Patterns for ML Systems?
Idea Experiment Live
Problem
Problem
11
Design patterns provide…
 Common solutions to common problems
 No need to re-invent them
 A menu of approaches
 Reusable abstractions
 Transcend specific implementations
 Common terminology
 Eases communications of how something works
12
Some machine learning patterns…
 The Hulk
 The Lumberjack
 The Online Archive
 The Time Machine
 The Sentinel
 The Precog
 The Dagobah
 The Anytime Algorithm
 The Parameter Oracle
 The LEGO
 The Terminator
 The Inception
 The Feature Encoder
 The Hoarder
 The Transformer
 The Parameter Server
 The Log Space
 The Matrix Transposed
 The Overflow
 The Substitute
Thanks to: Aish Fenton, Yves Raimond, Dave Ray, Hossein Taghavi, Anuj Shah, DB Tsai, …
13
Application
Machine Learning in an Application
Machine Learning
Application
?Machine
Learned Model
Feature
Encoding
Output
Decoding
Predictor
14
Antipattern:
The Phantom Menace
(AKA Training/Serving Skew)
Different
code/data/platform
between training and
applying model
© Lucasfilm Ltd.
15
Training Pipeline
Application
“Typical” ML Pipeline: A tale of two worlds
Historical
Data
Generate
Features
Train Models
Validate &
Select Models
Application
Logic
Live
Data
Load
Model
Offline
Online
Collect Labels
Publish
Model
Evaluate
Model
Experimentation
16
The Sentinel
Validate model/data in
online environment before
letting it go live“You shall not pass!”
© New Line Cinema
17
Sentinel
Service
Application
Sentinel: Structure
Model
Model
Publisher
Model
Loader
Model Loader
Model
Validator
Offline
Online
Alert
Republish
Some potential checks:
• File format is valid
• Dependent data is available
• Accuracy on shadow live data
• Feature distributions match
• Output is properly calibrated
18
Sentinel
 Example: Checking that new ranking model is valid and
performs better than previous one
 Pros:
 Using a model requires both code and data are available
 Models may need to be versioned along-side code changes
 Ensure that a new model is no worse than previous one
 Cons:
 Sentinel needs to be in sync with application code
 Difficult to choose failure thresholds for data-based checks
19
The Hulk
(AKA Offline Precompute)
Train and evaluate your
full model offline then
publish final outputs
Scale for production
by batching
and brute force
© Disney
© Disney
20
Offline Precompute: Example Structure
Application
Cache
Historical
Data
OfflineOnline
Model Evaluation
Predictor
Data
Publisher
Generate
Features
Decode
Output
lookup
key -> output
save
21
Offline Precompute (aka The Hulk)
 Example: Computing unpersonalized video-to-video similarities
 Pros:
 Easy to set up based on experiment code
 Decouples implementation from online platform
 Can use more computationally expensive models
 Cons:
 Can’t depend on online facts or fresh data
 May have data gaps (e.g. handling new videos, users, etc.)
 May require cleanup to make consistent with online data
 Model output based on offline data; may not be properly calibrated
22
The Lumberjack
(AKA Feature Logging)
Train model on features
logged online from within
an application
Image via YouTube
23
Application
Feature Logging: Structure
Live
Data
Feature
Log Train Models
Predictor
Labels
log
id
Feature
Config
Generate
Features
Decode
Output
Model Evaluation
Offline
Online
24
 Example: Features of pages, rows, and videos in page generation
 Pros:
 Train on features exactly as seen online
 Easy to deploy trained model
 Can include impact of up-stream application logic
 Cons:
 Requires production-grade feature code and deployment
 Takes time to log enough data
 Need all dependent data also in production
 Adds risk to production servers for experimental features
 Feature data can be large; may require sampling
Feature Logging (aka The Lumberjack)
25
The Online Archive
Have online services save
history and expose to
offline systems via batch
interface
© Lucasfilm Ltd.
26
Online Archive: Structure
Live +
Historical
Data
Generate
Features
Collect Labels
Offline
Online
Application
Train Model
batch
interface
live
interface
27
Online Archive
 Example: Filtering online viewing history
 Pros:
 Provides access to online view of data at any time
 Can experiment with new features
 Cons:
 All dependent data needs to keep track of all history
 Only works for small data
 Requires batch interface also available within application
 May be other processes that edit history (e.g. slow arriving events)
 Service needs to handle two very different request loads so batch queries
don’t bring down the live system
28
The Time Machine
Snapshot facts and share
feature generation code
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
29
Application
Time Machine: Example Structure
FeaturesFact Log
Feature
Config
Predictor
Generate
Features
Decode
Output
Online
Snapshotter
Model Evaluation
Generate
Features
Labels
Data
Service
Bulk
Data
Other
Models
Live Data
30
Time Machine
 Example: Training ranking models in Spark*
 Pros:
 Easy to experiment with new features offline
 Allows testing impact of modifying non-ML components
 Can construct full application output after trying new model
 Can share snapshots across applications to help build new ones
 Cons:
 Fact data volume can be high; may require sampling
 Snapshotting requires deciding contexts to collect data for
* See http://bit.ly/sparktimetravel for more info
3131
Conclusions
32
Conclusion
 Some design patterns for avoiding online-offline discrepancies
 The Sentinel
 The Hulk
 The Lumberjack
 The Online Archive
 The Time Machine
 What useful patterns do you see for ML systems?
 Share them!
33
Thank You Justin Basilico
jbasilico@netflix.com
@JustinBasilico

Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems

  • 1.
    11 Is that aTime Machine? Some Design Patterns for Real-World Machine Learning Systems Justin Basilico Page Algorithms Engineering ICML ML Systems Workshop June 24, 2016 @JustinBasilico DeLorean image by JMortonPhoto.com & OtoGodfrey.com
  • 2.
  • 3.
  • 4.
    4 Netflix Scale  >81M members  > 190 countries  > 1000 device types  > 3B hours/month  > 36% of peak US downstream traffic
  • 5.
    5 Goal Help members findcontent to watch and enjoy to maximize member satisfaction and retention
  • 6.
    6 Machine Learning isEverywhere Rows Ranking Over 80% of what people watch comes from our recommendations
  • 7.
    7 Models & Algorithms Regression (linear, logistic, elastic net)  SVD and other Matrix Factorizations  Factorization Machines  Restricted Boltzmann Machines  Deep Neural Networks  Markov Models and Graph Algorithms  Clustering  Latent Dirichlet Allocation  Gradient Boosted Decision Trees/Random Forests  Gaussian Processes  …
  • 8.
    8 Systems  AWS Cloud Online:  Microservices  Java  EVCache, Cassandra  Offline:  Hive on S3  Spark, Docker, Meson Netflix.Hermes Netflix.Manhattan Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE More details on Netflix Techblog
  • 9.
  • 10.
    10 Why Design Patternsfor ML Systems? Idea Experiment Live Problem Problem
  • 11.
    11 Design patterns provide… Common solutions to common problems  No need to re-invent them  A menu of approaches  Reusable abstractions  Transcend specific implementations  Common terminology  Eases communications of how something works
  • 12.
    12 Some machine learningpatterns…  The Hulk  The Lumberjack  The Online Archive  The Time Machine  The Sentinel  The Precog  The Dagobah  The Anytime Algorithm  The Parameter Oracle  The LEGO  The Terminator  The Inception  The Feature Encoder  The Hoarder  The Transformer  The Parameter Server  The Log Space  The Matrix Transposed  The Overflow  The Substitute Thanks to: Aish Fenton, Yves Raimond, Dave Ray, Hossein Taghavi, Anuj Shah, DB Tsai, …
  • 13.
    13 Application Machine Learning inan Application Machine Learning Application ?Machine Learned Model Feature Encoding Output Decoding Predictor
  • 14.
    14 Antipattern: The Phantom Menace (AKATraining/Serving Skew) Different code/data/platform between training and applying model © Lucasfilm Ltd.
  • 15.
    15 Training Pipeline Application “Typical” MLPipeline: A tale of two worlds Historical Data Generate Features Train Models Validate & Select Models Application Logic Live Data Load Model Offline Online Collect Labels Publish Model Evaluate Model Experimentation
  • 16.
    16 The Sentinel Validate model/datain online environment before letting it go live“You shall not pass!” © New Line Cinema
  • 17.
    17 Sentinel Service Application Sentinel: Structure Model Model Publisher Model Loader Model Loader Model Validator Offline Online Alert Republish Somepotential checks: • File format is valid • Dependent data is available • Accuracy on shadow live data • Feature distributions match • Output is properly calibrated
  • 18.
    18 Sentinel  Example: Checkingthat new ranking model is valid and performs better than previous one  Pros:  Using a model requires both code and data are available  Models may need to be versioned along-side code changes  Ensure that a new model is no worse than previous one  Cons:  Sentinel needs to be in sync with application code  Difficult to choose failure thresholds for data-based checks
  • 19.
    19 The Hulk (AKA OfflinePrecompute) Train and evaluate your full model offline then publish final outputs Scale for production by batching and brute force © Disney © Disney
  • 20.
    20 Offline Precompute: ExampleStructure Application Cache Historical Data OfflineOnline Model Evaluation Predictor Data Publisher Generate Features Decode Output lookup key -> output save
  • 21.
    21 Offline Precompute (akaThe Hulk)  Example: Computing unpersonalized video-to-video similarities  Pros:  Easy to set up based on experiment code  Decouples implementation from online platform  Can use more computationally expensive models  Cons:  Can’t depend on online facts or fresh data  May have data gaps (e.g. handling new videos, users, etc.)  May require cleanup to make consistent with online data  Model output based on offline data; may not be properly calibrated
  • 22.
    22 The Lumberjack (AKA FeatureLogging) Train model on features logged online from within an application Image via YouTube
  • 23.
    23 Application Feature Logging: Structure Live Data Feature LogTrain Models Predictor Labels log id Feature Config Generate Features Decode Output Model Evaluation Offline Online
  • 24.
    24  Example: Featuresof pages, rows, and videos in page generation  Pros:  Train on features exactly as seen online  Easy to deploy trained model  Can include impact of up-stream application logic  Cons:  Requires production-grade feature code and deployment  Takes time to log enough data  Need all dependent data also in production  Adds risk to production servers for experimental features  Feature data can be large; may require sampling Feature Logging (aka The Lumberjack)
  • 25.
    25 The Online Archive Haveonline services save history and expose to offline systems via batch interface © Lucasfilm Ltd.
  • 26.
    26 Online Archive: Structure Live+ Historical Data Generate Features Collect Labels Offline Online Application Train Model batch interface live interface
  • 27.
    27 Online Archive  Example:Filtering online viewing history  Pros:  Provides access to online view of data at any time  Can experiment with new features  Cons:  All dependent data needs to keep track of all history  Only works for small data  Requires batch interface also available within application  May be other processes that edit history (e.g. slow arriving events)  Service needs to handle two very different request loads so batch queries don’t bring down the live system
  • 28.
    28 The Time Machine Snapshotfacts and share feature generation code DeLorean image by JMortonPhoto.com & OtoGodfrey.com
  • 29.
    29 Application Time Machine: ExampleStructure FeaturesFact Log Feature Config Predictor Generate Features Decode Output Online Snapshotter Model Evaluation Generate Features Labels Data Service Bulk Data Other Models Live Data
  • 30.
    30 Time Machine  Example:Training ranking models in Spark*  Pros:  Easy to experiment with new features offline  Allows testing impact of modifying non-ML components  Can construct full application output after trying new model  Can share snapshots across applications to help build new ones  Cons:  Fact data volume can be high; may require sampling  Snapshotting requires deciding contexts to collect data for * See http://bit.ly/sparktimetravel for more info
  • 31.
  • 32.
    32 Conclusion  Some designpatterns for avoiding online-offline discrepancies  The Sentinel  The Hulk  The Lumberjack  The Online Archive  The Time Machine  What useful patterns do you see for ML systems?  Share them!
  • 33.
    33 Thank You JustinBasilico jbasilico@netflix.com @JustinBasilico