SlideShare a Scribd company logo
1 of 33
11
Is that a Time Machine?
Some Design Patterns for Real-World Machine Learning Systems
Justin Basilico
Page Algorithms Engineering
ICML ML Systems Workshop
June 24, 2016
@JustinBasilico
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
22
Introduction
3
Focus
2006 2016
4
Netflix Scale
 > 81M members
 > 190 countries
 > 1000 device types
 > 3B hours/month
 > 36% of peak US
downstream traffic
5
Goal
Help members find content to watch and enjoy
to maximize member satisfaction and retention
6
Machine Learning is Everywhere
Rows
Ranking
Over 80% of what
people watch
comes from our
recommendations
7
Models & Algorithms
 Regression (linear, logistic, elastic net)
 SVD and other Matrix Factorizations
 Factorization Machines
 Restricted Boltzmann Machines
 Deep Neural Networks
 Markov Models and Graph Algorithms
 Clustering
 Latent Dirichlet Allocation
 Gradient Boosted Decision
Trees/Random Forests
 Gaussian Processes
 …
8
Systems
 AWS Cloud
 Online:
 Microservices
 Java
 EVCache, Cassandra
 Offline:
 Hive on S3
 Spark, Docker, Meson
Netflix.Hermes
Netflix.Manhattan
Nearline
Computation
Models
Online
Data Service
Offline Data
Model
training
Online
Computation
Event Distribution
User Event
Queue
Algorithm
Service
UI Client
Member
Query results
Recommendations
NEARLINE
Machine
Learning
Algorithm
Machine
Learning
Algorithm
Offline
Computation Machine
Learning
Algorithm
Play, Rate,
Browse...
OFFLINE
ONLINE
More details on Netflix Techblog
99
Design Patterns
10
Why Design Patterns for ML Systems?
Idea Experiment Live
Problem
Problem
11
Design patterns provide…
 Common solutions to common problems
 No need to re-invent them
 A menu of approaches
 Reusable abstractions
 Transcend specific implementations
 Common terminology
 Eases communications of how something works
12
Some machine learning patterns…
 The Hulk
 The Lumberjack
 The Online Archive
 The Time Machine
 The Sentinel
 The Precog
 The Dagobah
 The Anytime Algorithm
 The Parameter Oracle
 The LEGO
 The Terminator
 The Inception
 The Feature Encoder
 The Hoarder
 The Transformer
 The Parameter Server
 The Log Space
 The Matrix Transposed
 The Overflow
 The Substitute
Thanks to: Aish Fenton, Yves Raimond, Dave Ray, Hossein Taghavi, Anuj Shah, DB Tsai, …
13
Application
Machine Learning in an Application
Machine Learning
Application
?Machine
Learned Model
Feature
Encoding
Output
Decoding
Predictor
14
Antipattern:
The Phantom Menace
(AKA Training/Serving Skew)
Different
code/data/platform
between training and
applying model
© Lucasfilm Ltd.
15
Training Pipeline
Application
“Typical” ML Pipeline: A tale of two worlds
Historical
Data
Generate
Features
Train Models
Validate &
Select Models
Application
Logic
Live
Data
Load
Model
Offline
Online
Collect Labels
Publish
Model
Evaluate
Model
Experimentation
16
The Sentinel
Validate model/data in
online environment before
letting it go live“You shall not pass!”
© New Line Cinema
17
Sentinel
Service
Application
Sentinel: Structure
Model
Model
Publisher
Model
Loader
Model Loader
Model
Validator
Offline
Online
Alert
Republish
Some potential checks:
• File format is valid
• Dependent data is available
• Accuracy on shadow live data
• Feature distributions match
• Output is properly calibrated
18
Sentinel
 Example: Checking that new ranking model is valid and
performs better than previous one
 Pros:
 Using a model requires both code and data are available
 Models may need to be versioned along-side code changes
 Ensure that a new model is no worse than previous one
 Cons:
 Sentinel needs to be in sync with application code
 Difficult to choose failure thresholds for data-based checks
19
The Hulk
(AKA Offline Precompute)
Train and evaluate your
full model offline then
publish final outputs
Scale for production
by batching
and brute force
© Disney
© Disney
20
Offline Precompute: Example Structure
Application
Cache
Historical
Data
OfflineOnline
Model Evaluation
Predictor
Data
Publisher
Generate
Features
Decode
Output
lookup
key -> output
save
21
Offline Precompute (aka The Hulk)
 Example: Computing unpersonalized video-to-video similarities
 Pros:
 Easy to set up based on experiment code
 Decouples implementation from online platform
 Can use more computationally expensive models
 Cons:
 Can’t depend on online facts or fresh data
 May have data gaps (e.g. handling new videos, users, etc.)
 May require cleanup to make consistent with online data
 Model output based on offline data; may not be properly calibrated
22
The Lumberjack
(AKA Feature Logging)
Train model on features
logged online from within
an application
Image via YouTube
23
Application
Feature Logging: Structure
Live
Data
Feature
Log Train Models
Predictor
Labels
log
id
Feature
Config
Generate
Features
Decode
Output
Model Evaluation
Offline
Online
24
 Example: Features of pages, rows, and videos in page generation
 Pros:
 Train on features exactly as seen online
 Easy to deploy trained model
 Can include impact of up-stream application logic
 Cons:
 Requires production-grade feature code and deployment
 Takes time to log enough data
 Need all dependent data also in production
 Adds risk to production servers for experimental features
 Feature data can be large; may require sampling
Feature Logging (aka The Lumberjack)
25
The Online Archive
Have online services save
history and expose to
offline systems via batch
interface
© Lucasfilm Ltd.
26
Online Archive: Structure
Live +
Historical
Data
Generate
Features
Collect Labels
Offline
Online
Application
Train Model
batch
interface
live
interface
27
Online Archive
 Example: Filtering online viewing history
 Pros:
 Provides access to online view of data at any time
 Can experiment with new features
 Cons:
 All dependent data needs to keep track of all history
 Only works for small data
 Requires batch interface also available within application
 May be other processes that edit history (e.g. slow arriving events)
 Service needs to handle two very different request loads so batch queries
don’t bring down the live system
28
The Time Machine
Snapshot facts and share
feature generation code
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
29
Application
Time Machine: Example Structure
FeaturesFact Log
Feature
Config
Predictor
Generate
Features
Decode
Output
Online
Snapshotter
Model Evaluation
Generate
Features
Labels
Data
Service
Bulk
Data
Other
Models
Live Data
30
Time Machine
 Example: Training ranking models in Spark*
 Pros:
 Easy to experiment with new features offline
 Allows testing impact of modifying non-ML components
 Can construct full application output after trying new model
 Can share snapshots across applications to help build new ones
 Cons:
 Fact data volume can be high; may require sampling
 Snapshotting requires deciding contexts to collect data for
* See http://bit.ly/sparktimetravel for more info
3131
Conclusions
32
Conclusion
 Some design patterns for avoiding online-offline discrepancies
 The Sentinel
 The Hulk
 The Lumberjack
 The Online Archive
 The Time Machine
 What useful patterns do you see for ML systems?
 Share them!
33
Thank You Justin Basilico
jbasilico@netflix.com
@JustinBasilico

More Related Content

What's hot

Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsJustin Basilico
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized HomepageJustin Basilico
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsYves Raimond
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at NetflixLinas Baltrunas
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender modelsParmeshwar Khurd
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningAnoop Deoras
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the WorldYves Raimond
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableJustin Basilico
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...Sudeep Das, Ph.D.
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixJustin Basilico
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated RecommendationsHarald Steck
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveJustin Basilico
 

What's hot (20)

Déjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender SystemsDéjà Vu: The Importance of Time and Causality in Recommender Systems
Déjà Vu: The Importance of Time and Causality in Recommender Systems
 
Learning a Personalized Homepage
Learning a Personalized HomepageLearning a Personalized Homepage
Learning a Personalized Homepage
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
 
Missing values in recommender models
Missing values in recommender modelsMissing values in recommender models
Missing values in recommender models
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep Learning
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
Lessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at NetflixLessons Learned from Building Machine Learning Software at Netflix
Lessons Learned from Building Machine Learning Software at Netflix
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Recent Trends in Personalization at Netflix
Recent Trends in Personalization at NetflixRecent Trends in Personalization at Netflix
Recent Trends in Personalization at Netflix
 
Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018 Artwork Personalization at Netflix Fernando Amat RecSys2018
Artwork Personalization at Netflix Fernando Amat RecSys2018
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 

Similar to Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...MLconf
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareJustin Basilico
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Kai Wähner
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
 
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...Andrey Sadovykh
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaKai Wähner
 
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...confluent
 
Icon solutions presentation - Pure Hybrid Cloud Event, 11th September London
Icon solutions presentation - Pure Hybrid Cloud Event, 11th September LondonIcon solutions presentation - Pure Hybrid Cloud Event, 11th September London
Icon solutions presentation - Pure Hybrid Cloud Event, 11th September LondonIBM Systems UKI
 
2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challanges2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challangesIvica Crnkovic
 
Graphical Data Analytic Workflows and Cross-Platform Optimization
Graphical Data Analytic Workflows and Cross-Platform OptimizationGraphical Data Analytic Workflows and Cross-Platform Optimization
Graphical Data Analytic Workflows and Cross-Platform OptimizationBig Data Value Association
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Kai Wähner
 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Sri Ambati
 
Incremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical SystemsIncremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical SystemsÁkos Horváth
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery LabsIncquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery LabsIncQuery Labs
 
Model driven engineering for big data management systems
Model driven engineering for big data management systemsModel driven engineering for big data management systems
Model driven engineering for big data management systemsMarcos Almeida
 
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu QianVirtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu QianFlink Forward
 

Similar to Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems (20)

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Recommendations for Building Machine Learning Software
Recommendations for Building Machine Learning SoftwareRecommendations for Building Machine Learning Software
Recommendations for Building Machine Learning Software
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
 
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
JUNIPER: Towards Modeling Approach Enabling Efficient Platform for Heterogene...
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache KafkaBest Practices for Streaming IoT Data with MQTT and Apache Kafka
Best Practices for Streaming IoT Data with MQTT and Apache Kafka
 
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
Viele Autos, noch mehr Daten: IoT-Daten-Streaming mit MQTT & Kafka (Kai Waehn...
 
Icon solutions presentation - Pure Hybrid Cloud Event, 11th September London
Icon solutions presentation - Pure Hybrid Cloud Event, 11th September LondonIcon solutions presentation - Pure Hybrid Cloud Event, 11th September London
Icon solutions presentation - Pure Hybrid Cloud Event, 11th September London
 
2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challanges2020 09-16-ai-engineering challanges
2020 09-16-ai-engineering challanges
 
Graphical Data Analytic Workflows and Cross-Platform Optimization
Graphical Data Analytic Workflows and Cross-Platform OptimizationGraphical Data Analytic Workflows and Cross-Platform Optimization
Graphical Data Analytic Workflows and Cross-Platform Optimization
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud
Unleashing Apache Kafka and TensorFlow in the Cloud

Unleashing Apache Kafka and TensorFlow in the Cloud

 
Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...Building a Real-Time Security Application Using Log Data and Machine Learning...
Building a Real-Time Security Application Using Log Data and Machine Learning...
 
Incremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical SystemsIncremental Queries and Transformations for Engineering Critical Systems
Incremental Queries and Transformations for Engineering Critical Systems
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery LabsIncquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
Incquery Suite Models 2020 Conference by István Ráth, CEO of IncQuery Labs
 
Model driven engineering for big data management systems
Model driven engineering for big data management systemsModel driven engineering for big data management systems
Model driven engineering for big data management systems
 
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu QianVirtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 

Is that a Time Machine? Some Design Patterns for Real World Machine Learning Systems

  • 1. 11 Is that a Time Machine? Some Design Patterns for Real-World Machine Learning Systems Justin Basilico Page Algorithms Engineering ICML ML Systems Workshop June 24, 2016 @JustinBasilico DeLorean image by JMortonPhoto.com & OtoGodfrey.com
  • 4. 4 Netflix Scale  > 81M members  > 190 countries  > 1000 device types  > 3B hours/month  > 36% of peak US downstream traffic
  • 5. 5 Goal Help members find content to watch and enjoy to maximize member satisfaction and retention
  • 6. 6 Machine Learning is Everywhere Rows Ranking Over 80% of what people watch comes from our recommendations
  • 7. 7 Models & Algorithms  Regression (linear, logistic, elastic net)  SVD and other Matrix Factorizations  Factorization Machines  Restricted Boltzmann Machines  Deep Neural Networks  Markov Models and Graph Algorithms  Clustering  Latent Dirichlet Allocation  Gradient Boosted Decision Trees/Random Forests  Gaussian Processes  …
  • 8. 8 Systems  AWS Cloud  Online:  Microservices  Java  EVCache, Cassandra  Offline:  Hive on S3  Spark, Docker, Meson Netflix.Hermes Netflix.Manhattan Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE More details on Netflix Techblog
  • 10. 10 Why Design Patterns for ML Systems? Idea Experiment Live Problem Problem
  • 11. 11 Design patterns provide…  Common solutions to common problems  No need to re-invent them  A menu of approaches  Reusable abstractions  Transcend specific implementations  Common terminology  Eases communications of how something works
  • 12. 12 Some machine learning patterns…  The Hulk  The Lumberjack  The Online Archive  The Time Machine  The Sentinel  The Precog  The Dagobah  The Anytime Algorithm  The Parameter Oracle  The LEGO  The Terminator  The Inception  The Feature Encoder  The Hoarder  The Transformer  The Parameter Server  The Log Space  The Matrix Transposed  The Overflow  The Substitute Thanks to: Aish Fenton, Yves Raimond, Dave Ray, Hossein Taghavi, Anuj Shah, DB Tsai, …
  • 13. 13 Application Machine Learning in an Application Machine Learning Application ?Machine Learned Model Feature Encoding Output Decoding Predictor
  • 14. 14 Antipattern: The Phantom Menace (AKA Training/Serving Skew) Different code/data/platform between training and applying model © Lucasfilm Ltd.
  • 15. 15 Training Pipeline Application “Typical” ML Pipeline: A tale of two worlds Historical Data Generate Features Train Models Validate & Select Models Application Logic Live Data Load Model Offline Online Collect Labels Publish Model Evaluate Model Experimentation
  • 16. 16 The Sentinel Validate model/data in online environment before letting it go live“You shall not pass!” © New Line Cinema
  • 17. 17 Sentinel Service Application Sentinel: Structure Model Model Publisher Model Loader Model Loader Model Validator Offline Online Alert Republish Some potential checks: • File format is valid • Dependent data is available • Accuracy on shadow live data • Feature distributions match • Output is properly calibrated
  • 18. 18 Sentinel  Example: Checking that new ranking model is valid and performs better than previous one  Pros:  Using a model requires both code and data are available  Models may need to be versioned along-side code changes  Ensure that a new model is no worse than previous one  Cons:  Sentinel needs to be in sync with application code  Difficult to choose failure thresholds for data-based checks
  • 19. 19 The Hulk (AKA Offline Precompute) Train and evaluate your full model offline then publish final outputs Scale for production by batching and brute force © Disney © Disney
  • 20. 20 Offline Precompute: Example Structure Application Cache Historical Data OfflineOnline Model Evaluation Predictor Data Publisher Generate Features Decode Output lookup key -> output save
  • 21. 21 Offline Precompute (aka The Hulk)  Example: Computing unpersonalized video-to-video similarities  Pros:  Easy to set up based on experiment code  Decouples implementation from online platform  Can use more computationally expensive models  Cons:  Can’t depend on online facts or fresh data  May have data gaps (e.g. handling new videos, users, etc.)  May require cleanup to make consistent with online data  Model output based on offline data; may not be properly calibrated
  • 22. 22 The Lumberjack (AKA Feature Logging) Train model on features logged online from within an application Image via YouTube
  • 23. 23 Application Feature Logging: Structure Live Data Feature Log Train Models Predictor Labels log id Feature Config Generate Features Decode Output Model Evaluation Offline Online
  • 24. 24  Example: Features of pages, rows, and videos in page generation  Pros:  Train on features exactly as seen online  Easy to deploy trained model  Can include impact of up-stream application logic  Cons:  Requires production-grade feature code and deployment  Takes time to log enough data  Need all dependent data also in production  Adds risk to production servers for experimental features  Feature data can be large; may require sampling Feature Logging (aka The Lumberjack)
  • 25. 25 The Online Archive Have online services save history and expose to offline systems via batch interface © Lucasfilm Ltd.
  • 26. 26 Online Archive: Structure Live + Historical Data Generate Features Collect Labels Offline Online Application Train Model batch interface live interface
  • 27. 27 Online Archive  Example: Filtering online viewing history  Pros:  Provides access to online view of data at any time  Can experiment with new features  Cons:  All dependent data needs to keep track of all history  Only works for small data  Requires batch interface also available within application  May be other processes that edit history (e.g. slow arriving events)  Service needs to handle two very different request loads so batch queries don’t bring down the live system
  • 28. 28 The Time Machine Snapshot facts and share feature generation code DeLorean image by JMortonPhoto.com & OtoGodfrey.com
  • 29. 29 Application Time Machine: Example Structure FeaturesFact Log Feature Config Predictor Generate Features Decode Output Online Snapshotter Model Evaluation Generate Features Labels Data Service Bulk Data Other Models Live Data
  • 30. 30 Time Machine  Example: Training ranking models in Spark*  Pros:  Easy to experiment with new features offline  Allows testing impact of modifying non-ML components  Can construct full application output after trying new model  Can share snapshots across applications to help build new ones  Cons:  Fact data volume can be high; may require sampling  Snapshotting requires deciding contexts to collect data for * See http://bit.ly/sparktimetravel for more info
  • 32. 32 Conclusion  Some design patterns for avoiding online-offline discrepancies  The Sentinel  The Hulk  The Lumberjack  The Online Archive  The Time Machine  What useful patterns do you see for ML systems?  Share them!
  • 33. 33 Thank You Justin Basilico jbasilico@netflix.com @JustinBasilico