Production and Beyond:
Rajat Arya, Senior Product Manager
Alice Zheng, Director of Data Science
1
Deploying and Managing
Machine Learning Models
Choosing between deployed models.
What is Production?
Evaluation
Monitoring
Deployment
Management
Easily serve live predictions.
Measuring quality of deployed models.
Tracking model quality & operations.
Lifecycle of ML in Production
Evaluation
Monitoring
Deployment
Management
Lifecycle of ML in Production
Evaluation
Monitoring
Deployment
Management
The Setup
Suppose we are building a website with product
recommendations, trained using Amazon reviews.
• 34.6M reviews
• 2.4M products
• 6.6M users
Deployment System
Model
Historical
Data
Predictions
Live
Data
Batch training Real-time predictions
Feedback
Deployment System
Model
Historical
Data
Batch training Real-time predictions
input
recommendations
Batch Training: DIY
• Use entire cluster
efficiently
• Scale nodes up or
down
• No data transfer
• Operation metrics,
dashboard, alarming
8
Scalable
Distributed
Easy to schedule, launch,
and monitor
Co-located with Data
Model
Historical
Data
Batch training
Dato Distributed
Dato Distributed Architecture
GraphLab Create
Dato Distributed
In one line of code launch a long-running
cluster of machines to do parallel /
distributed execution from GraphLab Create.
Deployment System
Model
Historical
Data
Predictions
Live
Data
Batch training Real-time predictions
Real-time Predictions: DIY
• REST endpoints,
language independence
• Fast predictions, with
caching
• Replicated models
• Alarming, Metrics
• Scale up or down
• Easy to update / deploy
models
12
Low Latency
Ease of Integration
Scalable
Fault Tolerant
Maintainable
Predictions
Live
Data
Real-time predictions
Dato Predictive Services Architecture
Dato Predictive Services
website,
mobile,
browser, etc
REST API
DIST. CACHE
MODEL
Dato Predictive Services
In one line of code launch a fault-tolerant,
scalable, robust, and maintainable cluster,
putting a service-oriented architecture on
machine learning models.
Dato Predictive Services Benchmark
Amazon review dataset (34.6M reviews, 2.4mi users, 6.6mi products).
Measured end to end latency
Average: < 65ms
P99: < 100ms
AWS m3.xlarge instances
• 4 cores
• 15 GB RAM
3 Node deployment
Moderate Load Generated
Recommendation System
Model
Historical
Data
Predictions
Live
Data
Batch training Real-time predictions
Lifecycle of ML in Production
Evaluation
Monitoring
Deployment
Management
What happens after (initial) deployment
ML production life cycle
Evaluation
Monitoring
Deployment
Management
After deployment
Evaluate and track metrics over time.
React to feedback from deployed models.
MonitoringManagementEvaluation
ML in production - 101
Model
Historical
Data
Predictions
Live
Data
Feedback
Batch training Real-time predictions
ML in production - 101
Model
Historical
Data
Real-time predictionsBatch training
Predictions
Model 2
Live
Data
Key questions
• When to update a model?
• How to choose between existing models?
• Answer: continuous evaluation and testing
What is evaluation?
Predictions Metric
+
Evaluation
What data?
Which metric?
Evaluating a recommender
Model
Historical
Data
Predictions
Live
Data
Ranking
loss
User
engagement
Evaluating a recommender
Model
Historical
Data
Predictions
Live
Data
Ranking
loss
User
engagementOffline evaluation:
When to update model
Online evaluation:
Choosing between models
Updating ML models
Why update?
• Trends and user tastes change over time
• Model performance drops
When to update?
• Track statistics of data over time
• Monitor both offline & online metrics on live data
• Update when offline metric diverges from online metrics
Choosing between ML models
Model 2
Model 1
2000 visits
10% CTR
Group A
Everybody gets
Model 2
2000 visits
30% CTR
Group B
Strategy 1: A/B testing—select the best model and use it all the time
Choosing between ML models
A statistician walks into a casino…
Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500
Play this 85% of
the time
Play this 10% of
the time
Play this 5% of
the time
Multi-armed
bandits
Choosing between ML models
A statistician walks into an ML production environment
Pay-off $1:$1000 Pay-off $1:$200 Pay-off $1:$500
Use this 85% of
the time
(Exploitation)
Use this 10% of
the time
(Exploration)
Use this 5% of
the time
(Exploration)
Model 1 Model 2 Model 3
MAB vs. A/B testing
Why MAB?
• Continuous optimization, “set and forget”
• Maximize overall reward
Why A/B test?
• Simple to understand
• Single winner
• Tricky to do right
Other production considerations
• Versioning
• Logging
• Provenance
• Dashboards
• Reports
“Machine learning: The high interest rate credit card of technical debt,” D. Sculley et al, Google, 2014
“Two big challenges in machine learning,” Leon Bottou, ICML 2015 invited talk
Conclusions
Evaluation
Monitoring
Deployment
Management
Dato Distributed
&
Dato Predictive Services
A/B testing,
multi-armed bandits
& much more
Dato – one stop shop for all stages of the ML lifecycle
Simple, platform agnostic interface
@datoinc, #DataSmt

Production and Beyond: Deploying and Managing Machine Learning Models

Editor's Notes

  • #9 For both hyper-parameter tuning and model training the system we are looking for should be:   Distributed Do embarrassingly parallel things in parallel Do ML things that distribute well in parallel Scalable Scale nodes up or down Co-located with the data Execution happens where the data lives Easy to schedule, launch, and monitor Same code as model Operational metrics, dashboard, alarming Easy integration with Dato Predictive Services  
  • #10 Why architecture meets requirements? Distributed, Scalable, Co-located with data, easy to schedule/launch/monitor
  • #11 Dato Distributed - one command to launch a long-running cluster of machines to do parallel / distributed execution of Jobs from GraphLab Create. These clusters can be launched in the cloud in AWS EC2 or on-premise in Hadoop YARN or Spark clusters.
  • #14 Meets requirements: Ease of integration Low Latency Fault Tolerant Scalable Maintainable
  • #15 Dato Predictive Services - with one line we deploy a fault-tolerant, scalable, robust, and maintainable cluster to put a service-oriented architecture on machine learning models. We can choose to deploy in AWS EC2, or on-premise in our Hadoop YARN or Spark cluster environments.  
  • #16 We launch a 3-node Predictive Service deployment in AWS EC2 instances (m3.xlarge with 4 cores and 15gb RAM) and deploy our Recommender model. I should mention, our recommender model has 2.4mi users, and 6.6mi products (say). We measure operational metrics for getting a real-time live set of recommendations for a given user, and see average latency < 65ms. And of course average round-trip latency is insufficient for a production system, so we measure 99th percentile, P99 latency, and see it is < 100ms.
  • #39 For both hyper-parameter tuning and model training the system we are looking for should be:   Distributed Do embarrassingly parallel things in parallel Do ML things that distribute well in parallel Scalable Scale nodes up or down Co-located with the data Execution happens where the data lives Easy to schedule, launch, and monitor Same code as model Operational metrics, dashboard, alarming Easy integration with Dato Predictive Services  
  • #42 For periodic training and distributed feature engineering we decide to use our on-premise Spark cluster, since our data is already stored on HDFS and we already use Spark DataFrames in our data engineering pipeline. So we schedule a nightly job to take the historical data and train our FactorizationRecommender model as a Spark Job, and as part of that job the trained model is updated on the Predictive Service deployment we launched earlier. And with Dato Distributed installed on the Spark cluster, our data scientists can now run distributed hyperparameter tuning regularly, and find an optimal set of parameters for the FactorizationRecommender model.
  • #45 So far so good, we get together in a meeting with the web team to get the frontend to start using the new recommendation system we've developed. In that meeting, one of the data scienstists on the team mentions, 'I didn't tune the hyperparameters on that recommender model, there is probably an opportunity to get better results from the model.' One of the software engineers asks, 'How regularly will the model be trained?' Uh-oh, we hadn't thought of those things.   The web team puts the frontend REST API work on their next sprint and the data science team goes back to think about how to incorporate hyperparameter tuning and frequent model training into the recommender application.