Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Catch Me If You Can: Keeping up with ML
Models in Production
Shreya Shankar
May 2021
Shreya Shankar Keeping up with ML in ...
Outline
1 Introduction
2 Case study: deploying a simple model
3 Post-deployment challenges
4 Monitoring and tracing
5 Conc...
Introductions
BS and MS from Stanford Computer Science
Shreya Shankar Keeping up with ML in Production May 2021 3 / 50
Introductions
BS and MS from Stanford Computer Science
First machine learning engineer at Viaduct, an applied ML startup
S...
Introductions
BS and MS from Stanford Computer Science
First machine learning engineer at Viaduct, an applied ML startup
W...
Introductions
BS and MS from Stanford Computer Science
First machine learning engineer at Viaduct, an applied ML startup
W...
Introductions
BS and MS from Stanford Computer Science
First machine learning engineer at Viaduct, an applied ML startup
W...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Shreya Shankar Keeping ...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Fairness
Shreya Shankar...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Fairness
Generalizabili...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Fairness
Generalizabili...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Fairness
Generalizabili...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Fairness
Generalizabili...
Academic and industry ML are different
Most academic ML efforts are focused on training techniques
Fairness
Generalizabili...
The depressing truth about ML IRL
87% of data science projects don’t make it to production1
1
https://venturebeat.com/2019...
The depressing truth about ML IRL
87% of data science projects don’t make it to production1
Data in the "real world" is no...
The depressing truth about ML IRL
87% of data science projects don’t make it to production1
Data in the "real world" is no...
The depressing truth about ML IRL
87% of data science projects don’t make it to production1
Data in the "real world" is no...
Today’s talk
Not about how to improve model performance on a validation or test
set
Shreya Shankar Keeping up with ML in P...
Today’s talk
Not about how to improve model performance on a validation or test
set
Not about feature engineering, machine...
Today’s talk
Not about how to improve model performance on a validation or test
set
Not about feature engineering, machine...
Today’s talk
Not about how to improve model performance on a validation or test
set
Not about feature engineering, machine...
Today’s talk
Not about how to improve model performance on a validation or test
set
Not about feature engineering, machine...
Today’s talk
Not about how to improve model performance on a validation or test
set
Not about feature engineering, machine...
General problem statement
Common to experience performance drift after a model is deployed
Shreya Shankar Keeping up with ...
General problem statement
Common to experience performance drift after a model is deployed
Unavoidable in many cases
Shrey...
General problem statement
Common to experience performance drift after a model is deployed
Unavoidable in many cases
Time ...
General problem statement
Common to experience performance drift after a model is deployed
Unavoidable in many cases
Time ...
General problem statement
Common to experience performance drift after a model is deployed
Unavoidable in many cases
Time ...
General problem statement
Common to experience performance drift after a model is deployed
Unavoidable in many cases
Time ...
General problem statement
Common to experience performance drift after a model is deployed
Unavoidable in many cases
Time ...
Case study description
Using publicly available NYC Taxicab Trip Data (data in public S3
bucket s3://nyc-tlc/trip data/)
S...
Case study description
Using publicly available NYC Taxicab Trip Data (data in public S3
bucket s3://nyc-tlc/trip data/)
F...
Case study description
Using publicly available NYC Taxicab Trip Data (data in public S3
bucket s3://nyc-tlc/trip data/)
F...
Case study description
Using publicly available NYC Taxicab Trip Data (data in public S3
bucket s3://nyc-tlc/trip data/)
F...
Case study description
Using publicly available NYC Taxicab Trip Data (data in public S3
bucket s3://nyc-tlc/trip data/)
F...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
2
https://www1.nyc.gov/site/tlc...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Dataset description
Simple toy example using the publicly available NYC Taxicab Trip
Data2
Tabular dataset where each reco...
Learning task
Let X be the train covariate matrix (features) and y be the train
target vector (labels)
Shreya Shankar Keep...
Learning task
Let X be the train covariate matrix (features) and y be the train
target vector (labels)
Xi represents an n-...
Learning task
Let X be the train covariate matrix (features) and y be the train
target vector (labels)
Xi represents an n-...
Learning task
Let X be the train covariate matrix (features) and y be the train
target vector (labels)
Xi represents an n-...
Learning task
Let X be the train covariate matrix (features) and y be the train
target vector (labels)
Xi represents an n-...
Learning task
Let X be the train covariate matrix (features) and y be the train
target vector (labels)
Xi represents an n-...
Toy ML Pipeline
Code lives in toy-ml-pipeline3, an example of an end-to-end pipeline for
our high tip prediction task.
3
h...
Toy ML pipeline utilities
S3 bucket for intermediate storage
Shreya Shankar Keeping up with ML in Production May 2021 12 /...
Toy ML pipeline utilities
S3 bucket for intermediate storage
io library
Shreya Shankar Keeping up with ML in Production Ma...
Toy ML pipeline utilities
S3 bucket for intermediate storage
io library
dumps versioned output of each stage every time a ...
Toy ML pipeline utilities
S3 bucket for intermediate storage
io library
dumps versioned output of each stage every time a ...
Toy ML pipeline utilities
S3 bucket for intermediate storage
io library
dumps versioned output of each stage every time a ...
ETL
Cleaning stage reads in raw data for each month and removes:
Shreya Shankar Keeping up with ML in Production May 2021 ...
ETL
Cleaning stage reads in raw data for each month and removes:
Rows with timestamps that don’t fall within the month ran...
ETL
Cleaning stage reads in raw data for each month and removes:
Rows with timestamps that don’t fall within the month ran...
ETL
Cleaning stage reads in raw data for each month and removes:
Rows with timestamps that don’t fall within the month ran...
ETL
Cleaning stage reads in raw data for each month and removes:
Rows with timestamps that don’t fall within the month ran...
ETL
Cleaning stage reads in raw data for each month and removes:
Rows with timestamps that don’t fall within the month ran...
ETL
Cleaning stage reads in raw data for each month and removes:
Rows with timestamps that don’t fall within the month ran...
Offline training and evaluation
Train on January 2020, validate on February 2020
Shreya Shankar Keeping up with ML in Prod...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
S...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Offline training and evaluation
Train on January 2020, validate on February 2020
We will measure F1 score for our metric
p...
Train and test split code snippet
train_month = ’2020_01 ’
test_month = ’2020_02 ’
# Load latest features for train and te...
Training code snippet
feature_columns = [
’pickup_weekday ’, ’pickup_hour ’,
’pickup_minute ’, ’work_hours ’,
’passenger_c...
Training and test set evaluation
train_score = mw.score(train_df , label_column)
test_score = mw.score(test_df , label_col...
Train the production model
Simulate deployment in March 2020 (COVID-19 incoming!)
Shreya Shankar Keeping up with ML in Pro...
Train the production model
Simulate deployment in March 2020 (COVID-19 incoming!)
In practice, train models on different w...
Train the production model
Simulate deployment in March 2020 (COVID-19 incoming!)
In practice, train models on different w...
Deploy model
Flask server that runs predict on features passed in via request
Shreya Shankar Keeping up with ML in Product...
Deploy model
Flask server that runs predict on features passed in via request
Separate file inference.py that repeatedly s...
Deploy model
Flask server that runs predict on features passed in via request
Separate file inference.py that repeatedly s...
Deploy model
Flask server that runs predict on features passed in via request
Separate file inference.py that repeatedly s...
Deploy model
Flask server that runs predict on features passed in via request
Separate file inference.py that repeatedly s...
Inspect F1 scores at the end of every day
Output
day rolling_f1 daily_f1
178123 1 0.576629 0.576629
370840 2 0.633320 0.67...
Deeper dive into March F1 scores
day rolling_f1 daily_f1
178123 1 0.576629 0.576629
370840 2 0.633320 0.677398
592741 3 0....
Deeper dive into March F1 scores
day rolling_f1 daily_f1
178123 1 0.576629 0.576629
370840 2 0.633320 0.677398
592741 3 0....
Deeper dive into March F1 scores
day rolling_f1 daily_f1
178123 1 0.576629 0.576629
370840 2 0.633320 0.677398
592741 3 0....
Evaluating the same model on following months
Output
2020-04
F1: 0.5714705472990737
2020-05
F1: 0.5530868473460906
2020-06...
Challenge: lag
In this example we are simulating a "live" deployment on historical
data, so we have no lag
Shreya Shankar ...
Challenge: lag
In this example we are simulating a "live" deployment on historical
data, so we have no lag
In practice the...
Challenge: lag
In this example we are simulating a "live" deployment on historical
data, so we have no lag
In practice the...
Challenge: lag
In this example we are simulating a "live" deployment on historical
data, so we have no lag
In practice the...
Challenge: lag
In this example we are simulating a "live" deployment on historical
data, so we have no lag
In practice the...
Challenge: lag
In this example we are simulating a "live" deployment on historical
data, so we have no lag
In practice the...
Challenge: distribution shift
Data "drifts" over time, and models will need to be retrained to
reflect such drift
Shreya S...
Challenge: distribution shift
Data "drifts" over time, and models will need to be retrained to
reflect such drift
Open que...
Challenge: distribution shift
Data "drifts" over time, and models will need to be retrained to
reflect such drift
Open que...
Challenge: distribution shift
Data "drifts" over time, and models will need to be retrained to
reflect such drift
Open que...
Challenge: distribution shift
Data "drifts" over time, and models will need to be retrained to
reflect such drift
Open que...
Challenge: distribution shift
Data "drifts" over time, and models will need to be retrained to
reflect such drift
Open que...
Demo
Shreya Shankar Keeping up with ML in Production May 2021 25 / 50
How do you know when the data has "drifted?"
4
4
Rabanser, Günnemann, and Lipton, Failing Loudly: An Empirical Study of Me...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
Shreya...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
"Multi...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
"Multi...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
"Multi...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
"Multi...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
"Multi...
Challenge: quantifying distribution shift
We only have 11 features, so we’ll skip the dimensionality reduction
step
"Multi...
Kolmogorov-Smirnov test made simple
For each feature, we will run a 2-sided Kolmogorov-Smirnov test
Shreya Shankar Keeping...
Kolmogorov-Smirnov test made simple
For each feature, we will run a 2-sided Kolmogorov-Smirnov test
Continuous data, non-p...
Kolmogorov-Smirnov test made simple
For each feature, we will run a 2-sided Kolmogorov-Smirnov test
Continuous data, non-p...
Kolmogorov-Smirnov test made simple
For each feature, we will run a 2-sided Kolmogorov-Smirnov test
Continuous data, non-p...
How do you know when the data has "drifted?"
Comparing January 2020 and February 2020 datasets using the KS test:
Output
f...
How do you know when the data has "drifted?"
For our "offline" train and evaluation sets (January and February), we
get ex...
How do you know when the data has "drifted?"
For our "offline" train and evaluation sets (January and February), we
get ex...
How do you know when the data has "drifted?"
For our "offline" train and evaluation sets (January and February), we
get ex...
How do you know when the data has "drifted?"
For our "offline" train and evaluation sets (January and February), we
get ex...
How do you know when the data has "drifted?"
You don’t really know.
An unsatisfying solution is to retrain the model to be...
Types of production ML bugs
Data changed or corrupted
Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
Types of production ML bugs
Data changed or corrupted
Data dependency / infra issues
Shreya Shankar Keeping up with ML in ...
Types of production ML bugs
Data changed or corrupted
Data dependency / infra issues
Logical error in the ETL code
Shreya ...
Types of production ML bugs
Data changed or corrupted
Data dependency / infra issues
Logical error in the ETL code
Logical...
Types of production ML bugs
Data changed or corrupted
Data dependency / infra issues
Logical error in the ETL code
Logical...
Types of production ML bugs
Data changed or corrupted
Data dependency / infra issues
Logical error in the ETL code
Logical...
Types of production ML bugs
Data changed or corrupted
Data dependency / infra issues
Logical error in the ETL code
Logical...
Why production ML bugs are hard to find and
address
ML code fails silently (few runtime or compiler errors)
Shreya Shankar...
Why production ML bugs are hard to find and
address
ML code fails silently (few runtime or compiler errors)
Little or no u...
Why production ML bugs are hard to find and
address
ML code fails silently (few runtime or compiler errors)
Little or no u...
Why production ML bugs are hard to find and
address
ML code fails silently (few runtime or compiler errors)
Little or no u...
Why production ML bugs are hard to find and
address
ML code fails silently (few runtime or compiler errors)
Little or no u...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Shreya Shankar Keeping up with ML in Produc...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Sh...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
What to monitor?
Agile philosophy: "Continuous Delivery." Monitor proactively!
Especially hard when there are no labels
Fi...
Prometheus monitoring for ML
Selling points Pain points
Easy to call hist.observe(val)
User (you) needs to define the
hist...
PromQL for ML monitoring
Question Query
Average output
[5m]
sum(rate(histogram_sum[5m])) /
sum(rate(histogram_count[5m])),...
Retraining regularly
Suppose we retrain our model on a weekly basis.
Use MLFlow Model Registry7 to keep track of which mod...
Retraining regularly
Suppose we retrain our model on a weekly basis.
Use MLFlow Model Registry7 to keep track of which mod...
Retraining regularly
Suppose we retrain our model on a weekly basis.
Use MLFlow Model Registry7 to keep track of which mod...
mltrace
Coarse-grained lineage and tracing
9
https://github.com/loglabs/mltrace
Shreya Shankar Keeping up with ML in Produ...
mltrace
Coarse-grained lineage and tracing
Designed specifically for complex data or ML pipelines
9
https://github.com/log...
mltrace
Coarse-grained lineage and tracing
Designed specifically for complex data or ML pipelines
Designed specifically fo...
mltrace
Coarse-grained lineage and tracing
Designed specifically for complex data or ML pipelines
Designed specifically fo...
mltrace design principles
Utter simiplicity (user knows everything going on)
Shreya Shankar Keeping up with ML in Producti...
mltrace design principles
Utter simiplicity (user knows everything going on)
No need for users to manually set component d...
mltrace design principles
Utter simiplicity (user knows everything going on)
No need for users to manually set component d...
mltrace design principles
Utter simiplicity (user knows everything going on)
No need for users to manually set component d...
mltrace design principles
Utter simiplicity (user knows everything going on)
No need for users to manually set component d...
mltrace design principles
Utter simiplicity (user knows everything going on)
No need for users to manually set component d...
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
10
https://mltrace.readthedocs....
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
In ML pipelines, different arti...
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
In ML pipelines, different arti...
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
In ML pipelines, different arti...
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
In ML pipelines, different arti...
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
In ML pipelines, different arti...
mltrace concepts
Pipelines are made of components (ex: cleaning, featuregen, split,
train)
In ML pipelines, different arti...
mltrace integration example
def clean_data( raw_data_filename : str , raw_df: pd.DataFrame ,
month: str , year: str ,
comp...
mltrace integration example
@register(’cleaning ’, input_vars=[’raw_data_filename ’],
output_vars=[’output_path ’])
def cl...
mltrace integration example
from mltrace import create_component , tag_component ,
register
create_component (’cleaning ’,...
mltrace UI demo
Shreya Shankar Keeping up with ML in Production May 2021 44 / 50
mltrace immediate roadmap13
UI feature to easily see if a component is stale
13
Please email shreyashankar@berkeley.edu if...
mltrace immediate roadmap13
UI feature to easily see if a component is stale
There are newer versions of the component’s o...
mltrace immediate roadmap13
UI feature to easily see if a component is stale
There are newer versions of the component’s o...
mltrace immediate roadmap13
UI feature to easily see if a component is stale
There are newer versions of the component’s o...
mltrace immediate roadmap13
UI feature to easily see if a component is stale
There are newer versions of the component’s o...
mltrace immediate roadmap13
UI feature to easily see if a component is stale
There are newer versions of the component’s o...
mltrace immediate roadmap13
UI feature to easily see if a component is stale
There are newer versions of the component’s o...
Talk recap
Introduced an end-to-end ML pipeline for a toy task
Shreya Shankar Keeping up with ML in Production May 2021 46...
Talk recap
Introduced an end-to-end ML pipeline for a toy task
Demonstrated performance drift or degradation when a model ...
Talk recap
Introduced an end-to-end ML pipeline for a toy task
Demonstrated performance drift or degradation when a model ...
Talk recap
Introduced an end-to-end ML pipeline for a toy task
Demonstrated performance drift or degradation when a model ...
Talk recap
Introduced an end-to-end ML pipeline for a toy task
Demonstrated performance drift or degradation when a model ...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
14
D’Amour et al., Underspecifica...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
Many more of these problems aroun...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
Many more of these problems aroun...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
Many more of these problems aroun...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
Many more of these problems aroun...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
Many more of these problems aroun...
Areas of future work in MLOps
Only scratched the surface with challenges discussed today
Many more of these problems aroun...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Shreya Shankar Keeping up with ML in ...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Toy ML Pipeline with mltrace, Prometh...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Toy ML Pipeline with mltrace, Prometh...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Toy ML Pipeline with mltrace, Prometh...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Toy ML Pipeline with mltrace, Prometh...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Toy ML Pipeline with mltrace, Prometh...
Miscellaneous
Code for this talk:
https://github.com/shreyashankar/debugging-ml-talk
Toy ML Pipeline with mltrace, Prometh...
References
D’Amour, Alexander et al. Underspecification Presents Challenges for Credibility in Modern Machine Learning. 20...
Feedback
Your feedback is important to us. Don’t forget to rate and review the
sessions.
Shreya Shankar Keeping up with ML...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Catch Me If You Can: Keeping Up With ML Models in Production

Download to read offline

Advances in machine learning and big data are disrupting every industry. However, even when companies deploy to production, they face significant challenges in staying there, with performance degrading significantly from offline benchmarks over time, a phenomenon known as performance drift. Models deployed over an extended period of time often experience performance drift due to changing data distribution.

In this talk, we discuss approaches to mitigate the effects of performance drift, illustrating our methods on a sample prediction task. Leveraging our experience at a startup deploying and monitoring production-grade ML pipelines for predictive maintenance, we also address several aspects of machine learning often overlooked in academia, such as the incorporation of non-technical collaborators and the integration of machine learning in an agile framework. This talk will consist of:

Using Python, Dask, and open-source datasets to demonstrate an example of training and validating a model in an offline setting that subsequently experiences performance degradation when it is deployed.
Using MLFlow, Prometheus, and Grafana to show how one can build tools to monitor production pipelines and enable teams of different stakeholders to quickly identify performance degradation using the right metrics.
Proposing a checklist of criteria for when to retrain machine learning models in production.
This talk will be a slideshow presentation accompanied by a Python notebook demo. It is aimed towards engineers that deploy and debug models in production, but may be of broader interest for people building machine learning-based products, and requires a familiarity with machine learning basics (train/test sets, decision trees).

  • Be the first to like this

Catch Me If You Can: Keeping Up With ML Models in Production

  1. 1. Catch Me If You Can: Keeping up with ML Models in Production Shreya Shankar May 2021 Shreya Shankar Keeping up with ML in Production May 2021 1 / 50
  2. 2. Outline 1 Introduction 2 Case study: deploying a simple model 3 Post-deployment challenges 4 Monitoring and tracing 5 Conclusion Shreya Shankar Keeping up with ML in Production May 2021 2 / 50
  3. 3. Introductions BS and MS from Stanford Computer Science Shreya Shankar Keeping up with ML in Production May 2021 3 / 50
  4. 4. Introductions BS and MS from Stanford Computer Science First machine learning engineer at Viaduct, an applied ML startup Shreya Shankar Keeping up with ML in Production May 2021 3 / 50
  5. 5. Introductions BS and MS from Stanford Computer Science First machine learning engineer at Viaduct, an applied ML startup Working with terabytes of time series data Shreya Shankar Keeping up with ML in Production May 2021 3 / 50
  6. 6. Introductions BS and MS from Stanford Computer Science First machine learning engineer at Viaduct, an applied ML startup Working with terabytes of time series data Building infrastructure for large-scale machine learning and data analytics Shreya Shankar Keeping up with ML in Production May 2021 3 / 50
  7. 7. Introductions BS and MS from Stanford Computer Science First machine learning engineer at Viaduct, an applied ML startup Working with terabytes of time series data Building infrastructure for large-scale machine learning and data analytics Responsibilities spanning recruiting, SWE, ML, product, and more Shreya Shankar Keeping up with ML in Production May 2021 3 / 50
  8. 8. Academic and industry ML are different Most academic ML efforts are focused on training techniques Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  9. 9. Academic and industry ML are different Most academic ML efforts are focused on training techniques Fairness Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  10. 10. Academic and industry ML are different Most academic ML efforts are focused on training techniques Fairness Generalizability to test sets with unknown distributions Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  11. 11. Academic and industry ML are different Most academic ML efforts are focused on training techniques Fairness Generalizability to test sets with unknown distributions Security Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  12. 12. Academic and industry ML are different Most academic ML efforts are focused on training techniques Fairness Generalizability to test sets with unknown distributions Security Industry places emphasis on Agile working style Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  13. 13. Academic and industry ML are different Most academic ML efforts are focused on training techniques Fairness Generalizability to test sets with unknown distributions Security Industry places emphasis on Agile working style In industry, we want to train few models but do lots of inference Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  14. 14. Academic and industry ML are different Most academic ML efforts are focused on training techniques Fairness Generalizability to test sets with unknown distributions Security Industry places emphasis on Agile working style In industry, we want to train few models but do lots of inference What happens beyond the validation or test set? Shreya Shankar Keeping up with ML in Production May 2021 4 / 50
  15. 15. The depressing truth about ML IRL 87% of data science projects don’t make it to production1 1 https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/ Shreya Shankar Keeping up with ML in Production May 2021 5 / 50
  16. 16. The depressing truth about ML IRL 87% of data science projects don’t make it to production1 Data in the "real world" is not necessarily clean and balanced, like canonical benchmark datasets (ex: ImageNet) 1 https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/ Shreya Shankar Keeping up with ML in Production May 2021 5 / 50
  17. 17. The depressing truth about ML IRL 87% of data science projects don’t make it to production1 Data in the "real world" is not necessarily clean and balanced, like canonical benchmark datasets (ex: ImageNet) Data in the "real world" is always changing 1 https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/ Shreya Shankar Keeping up with ML in Production May 2021 5 / 50
  18. 18. The depressing truth about ML IRL 87% of data science projects don’t make it to production1 Data in the "real world" is not necessarily clean and balanced, like canonical benchmark datasets (ex: ImageNet) Data in the "real world" is always changing Showing high performance on a fixed train and validation set 6= consistent high performance when that model is deployed 1 https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/ Shreya Shankar Keeping up with ML in Production May 2021 5 / 50
  19. 19. Today’s talk Not about how to improve model performance on a validation or test set Shreya Shankar Keeping up with ML in Production May 2021 6 / 50
  20. 20. Today’s talk Not about how to improve model performance on a validation or test set Not about feature engineering, machine learning algorithms, or data science Shreya Shankar Keeping up with ML in Production May 2021 6 / 50
  21. 21. Today’s talk Not about how to improve model performance on a validation or test set Not about feature engineering, machine learning algorithms, or data science Discussing: Shreya Shankar Keeping up with ML in Production May 2021 6 / 50
  22. 22. Today’s talk Not about how to improve model performance on a validation or test set Not about feature engineering, machine learning algorithms, or data science Discussing: Case study of challenges faced post-deployment of models Shreya Shankar Keeping up with ML in Production May 2021 6 / 50
  23. 23. Today’s talk Not about how to improve model performance on a validation or test set Not about feature engineering, machine learning algorithms, or data science Discussing: Case study of challenges faced post-deployment of models Showcase of tools to monitor production pipelines and enable Agile teams of different stakeholders to quickly identify performance degradation Shreya Shankar Keeping up with ML in Production May 2021 6 / 50
  24. 24. Today’s talk Not about how to improve model performance on a validation or test set Not about feature engineering, machine learning algorithms, or data science Discussing: Case study of challenges faced post-deployment of models Showcase of tools to monitor production pipelines and enable Agile teams of different stakeholders to quickly identify performance degradation Motivating when to retrain machine learning models in production Shreya Shankar Keeping up with ML in Production May 2021 6 / 50
  25. 25. General problem statement Common to experience performance drift after a model is deployed Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  26. 26. General problem statement Common to experience performance drift after a model is deployed Unavoidable in many cases Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  27. 27. General problem statement Common to experience performance drift after a model is deployed Unavoidable in many cases Time series data Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  28. 28. General problem statement Common to experience performance drift after a model is deployed Unavoidable in many cases Time series data High-variance data Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  29. 29. General problem statement Common to experience performance drift after a model is deployed Unavoidable in many cases Time series data High-variance data Intuitively, a model and pipeline should not be stale Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  30. 30. General problem statement Common to experience performance drift after a model is deployed Unavoidable in many cases Time series data High-variance data Intuitively, a model and pipeline should not be stale Want the pipeline to reflect most recent data points Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  31. 31. General problem statement Common to experience performance drift after a model is deployed Unavoidable in many cases Time series data High-variance data Intuitively, a model and pipeline should not be stale Want the pipeline to reflect most recent data points Will simulate performance drift with a toy task — predicting whether a taxi rider will give their driver a high tip or not Shreya Shankar Keeping up with ML in Production May 2021 7 / 50
  32. 32. Case study description Using publicly available NYC Taxicab Trip Data (data in public S3 bucket s3://nyc-tlc/trip data/) Shreya Shankar Keeping up with ML in Production May 2021 8 / 50
  33. 33. Case study description Using publicly available NYC Taxicab Trip Data (data in public S3 bucket s3://nyc-tlc/trip data/) For this exercise, we will train and "deploy" a model to predict whether the user gave a large tip Shreya Shankar Keeping up with ML in Production May 2021 8 / 50
  34. 34. Case study description Using publicly available NYC Taxicab Trip Data (data in public S3 bucket s3://nyc-tlc/trip data/) For this exercise, we will train and "deploy" a model to predict whether the user gave a large tip Using Python, Prometheus, Grafana, mlflow, mltrace Shreya Shankar Keeping up with ML in Production May 2021 8 / 50
  35. 35. Case study description Using publicly available NYC Taxicab Trip Data (data in public S3 bucket s3://nyc-tlc/trip data/) For this exercise, we will train and "deploy" a model to predict whether the user gave a large tip Using Python, Prometheus, Grafana, mlflow, mltrace Goal of this exercise is to demonstrate what can go wrong post-deployment and how to troubleshoot bugs Shreya Shankar Keeping up with ML in Production May 2021 8 / 50
  36. 36. Case study description Using publicly available NYC Taxicab Trip Data (data in public S3 bucket s3://nyc-tlc/trip data/) For this exercise, we will train and "deploy" a model to predict whether the user gave a large tip Using Python, Prometheus, Grafana, mlflow, mltrace Goal of this exercise is to demonstrate what can go wrong post-deployment and how to troubleshoot bugs Diagnosing failure points in the pipeline is challenging for Agile teams when pipelines are very complex! Shreya Shankar Keeping up with ML in Production May 2021 8 / 50
  37. 37. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  38. 38. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  39. 39. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  40. 40. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times Number of passengers 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  41. 41. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times Number of passengers Trip distance 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  42. 42. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times Number of passengers Trip distance Pickup and dropoff location zones 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  43. 43. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times Number of passengers Trip distance Pickup and dropoff location zones Fare, toll, and tip amounts 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  44. 44. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times Number of passengers Trip distance Pickup and dropoff location zones Fare, toll, and tip amounts Monthly files from Jan 2009 to June 2020 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  45. 45. Dataset description Simple toy example using the publicly available NYC Taxicab Trip Data2 Tabular dataset where each record corresponds to one taxi ride Pickup and dropoff times Number of passengers Trip distance Pickup and dropoff location zones Fare, toll, and tip amounts Monthly files from Jan 2009 to June 2020 Each month is about 1 GB of data → over 100 GB 2 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page Shreya Shankar Keeping up with ML in Production May 2021 9 / 50
  46. 46. Learning task Let X be the train covariate matrix (features) and y be the train target vector (labels) Shreya Shankar Keeping up with ML in Production May 2021 10 / 50
  47. 47. Learning task Let X be the train covariate matrix (features) and y be the train target vector (labels) Xi represents an n-dim feature vector corresponding to the ith ride and yi ∈ [0, 1] where 1 represents the rider tipping more than 20% of the total fare Shreya Shankar Keeping up with ML in Production May 2021 10 / 50
  48. 48. Learning task Let X be the train covariate matrix (features) and y be the train target vector (labels) Xi represents an n-dim feature vector corresponding to the ith ride and yi ∈ [0, 1] where 1 represents the rider tipping more than 20% of the total fare Want to learn classifier f such that f (Xi ) ≈ yi for all Xi ∈ X Shreya Shankar Keeping up with ML in Production May 2021 10 / 50
  49. 49. Learning task Let X be the train covariate matrix (features) and y be the train target vector (labels) Xi represents an n-dim feature vector corresponding to the ith ride and yi ∈ [0, 1] where 1 represents the rider tipping more than 20% of the total fare Want to learn classifier f such that f (Xi ) ≈ yi for all Xi ∈ X We do not want to overfit, meaning F1(Xtrain) ≈ F1(Xtest) Shreya Shankar Keeping up with ML in Production May 2021 10 / 50
  50. 50. Learning task Let X be the train covariate matrix (features) and y be the train target vector (labels) Xi represents an n-dim feature vector corresponding to the ith ride and yi ∈ [0, 1] where 1 represents the rider tipping more than 20% of the total fare Want to learn classifier f such that f (Xi ) ≈ yi for all Xi ∈ X We do not want to overfit, meaning F1(Xtrain) ≈ F1(Xtest) We will use a RandomForestClassifier with basic hyperparameters Shreya Shankar Keeping up with ML in Production May 2021 10 / 50
  51. 51. Learning task Let X be the train covariate matrix (features) and y be the train target vector (labels) Xi represents an n-dim feature vector corresponding to the ith ride and yi ∈ [0, 1] where 1 represents the rider tipping more than 20% of the total fare Want to learn classifier f such that f (Xi ) ≈ yi for all Xi ∈ X We do not want to overfit, meaning F1(Xtrain) ≈ F1(Xtest) We will use a RandomForestClassifier with basic hyperparameters n_estimators=100, max_depth=10 Shreya Shankar Keeping up with ML in Production May 2021 10 / 50
  52. 52. Toy ML Pipeline Code lives in toy-ml-pipeline3, an example of an end-to-end pipeline for our high tip prediction task. 3 https://github.com/shreyashankar/toy-ml-pipeline Shreya Shankar Keeping up with ML in Production May 2021 11 / 50
  53. 53. Toy ML pipeline utilities S3 bucket for intermediate storage Shreya Shankar Keeping up with ML in Production May 2021 12 / 50
  54. 54. Toy ML pipeline utilities S3 bucket for intermediate storage io library Shreya Shankar Keeping up with ML in Production May 2021 12 / 50
  55. 55. Toy ML pipeline utilities S3 bucket for intermediate storage io library dumps versioned output of each stage every time a stage is run Shreya Shankar Keeping up with ML in Production May 2021 12 / 50
  56. 56. Toy ML pipeline utilities S3 bucket for intermediate storage io library dumps versioned output of each stage every time a stage is run load latest version of outputs so we can use them as inputs to another stage Shreya Shankar Keeping up with ML in Production May 2021 12 / 50
  57. 57. Toy ML pipeline utilities S3 bucket for intermediate storage io library dumps versioned output of each stage every time a stage is run load latest version of outputs so we can use them as inputs to another stage Serve predictions via Flask application global mw mw = models. RandomForestModelWrapper .load(’training/ models ’) @app.route(’/predict ’, methods=[’POST ’]) def predict (): req = request.get_json () df = pd.read_json(req[’data ’]) df[’prediction ’] = mw.predict(df) result = { ’prediction ’: df[’prediction ’].to_list () } return jsonify(result) Shreya Shankar Keeping up with ML in Production May 2021 12 / 50
  58. 58. ETL Cleaning stage reads in raw data for each month and removes: Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  59. 59. ETL Cleaning stage reads in raw data for each month and removes: Rows with timestamps that don’t fall within the month range Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  60. 60. ETL Cleaning stage reads in raw data for each month and removes: Rows with timestamps that don’t fall within the month range Rows with $0 fares Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  61. 61. ETL Cleaning stage reads in raw data for each month and removes: Rows with timestamps that don’t fall within the month range Rows with $0 fares Featuregen stage computes the following features: Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  62. 62. ETL Cleaning stage reads in raw data for each month and removes: Rows with timestamps that don’t fall within the month range Rows with $0 fares Featuregen stage computes the following features: Trip-based: passenger count, trip distance, trip time, trip speed Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  63. 63. ETL Cleaning stage reads in raw data for each month and removes: Rows with timestamps that don’t fall within the month range Rows with $0 fares Featuregen stage computes the following features: Trip-based: passenger count, trip distance, trip time, trip speed Pickup-based: pickup weekday, pickup hour, pickup minute, work hour indicator Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  64. 64. ETL Cleaning stage reads in raw data for each month and removes: Rows with timestamps that don’t fall within the month range Rows with $0 fares Featuregen stage computes the following features: Trip-based: passenger count, trip distance, trip time, trip speed Pickup-based: pickup weekday, pickup hour, pickup minute, work hour indicator Basic categorical: pickup location code, dropoff location code, type of ride Shreya Shankar Keeping up with ML in Production May 2021 13 / 50
  65. 65. Offline training and evaluation Train on January 2020, validate on February 2020 Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  66. 66. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  67. 67. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  68. 68. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  69. 69. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset F1 = 2∗precision∗recall precision+recall Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  70. 70. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset F1 = 2∗precision∗recall precision+recall Higher F1 score is better, want to have low false positives and false negatives Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  71. 71. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset F1 = 2∗precision∗recall precision+recall Higher F1 score is better, want to have low false positives and false negatives Steps Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  72. 72. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset F1 = 2∗precision∗recall precision+recall Higher F1 score is better, want to have low false positives and false negatives Steps 1 Split the output of the featuregen stage into train and validation sets Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  73. 73. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset F1 = 2∗precision∗recall precision+recall Higher F1 score is better, want to have low false positives and false negatives Steps 1 Split the output of the featuregen stage into train and validation sets 2 Train and validate the model in a notebook Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  74. 74. Offline training and evaluation Train on January 2020, validate on February 2020 We will measure F1 score for our metric precision = number of true positives number of records predicted to be positive recall = number of true positives number of positives in the dataset F1 = 2∗precision∗recall precision+recall Higher F1 score is better, want to have low false positives and false negatives Steps 1 Split the output of the featuregen stage into train and validation sets 2 Train and validate the model in a notebook 3 Train the model for production in another notebook Shreya Shankar Keeping up with ML in Production May 2021 14 / 50
  75. 75. Train and test split code snippet train_month = ’2020_01 ’ test_month = ’2020_02 ’ # Load latest features for train and test sets train_df = io. load_output_df (os.path.join(’features ’, train_month)) test_df = io.load_output_df(os.path.join(’features ’, test_month)) # Save train and test sets print(io.save_output_df (train_df , ’training/files/train ’)) print(io.save_output_df (test_df , ’training/files/test ’)) Output s3://toy-applied-ml-pipeline/dev/training/files/train/20210501-165847.pq s3://toy-applied-ml-pipeline/dev/training/files/test/20210501-170136.pq Shreya Shankar Keeping up with ML in Production May 2021 15 / 50
  76. 76. Training code snippet feature_columns = [ ’pickup_weekday ’, ’pickup_hour ’, ’pickup_minute ’, ’work_hours ’, ’passenger_count ’, ’trip_distance ’, ’trip_time ’, ’trip_speed ’, ’PULocationID ’, ’DOLocationID ’, ’RatecodeID ’ ] label_column = ’high_tip_indicator ’ model_params = {’max_depth ’: 10} # Create and train model mw = models. RandomForestModelWrapper (feature_columns = feature_columns , model_params= model_params) mw.train(train_df , label_column) Shreya Shankar Keeping up with ML in Production May 2021 16 / 50
  77. 77. Training and test set evaluation train_score = mw.score(train_df , label_column) test_score = mw.score(test_df , label_column) mw.add_metric(’train_f1 ’, train_score) mw.add_metric(’test_f1 ’, test_score) print(’Metrics:’) print(mw.get_metrics ()) Output Metrics: ’train_f1’: 0.7310504153094398, ’test_f1’: 0.735223195868965 Shreya Shankar Keeping up with ML in Production May 2021 17 / 50
  78. 78. Train the production model Simulate deployment in March 2020 (COVID-19 incoming!) Shreya Shankar Keeping up with ML in Production May 2021 18 / 50
  79. 79. Train the production model Simulate deployment in March 2020 (COVID-19 incoming!) In practice, train models on different windows/do more thorough data science Shreya Shankar Keeping up with ML in Production May 2021 18 / 50
  80. 80. Train the production model Simulate deployment in March 2020 (COVID-19 incoming!) In practice, train models on different windows/do more thorough data science Here, we will just train a model on February 2020 with the same parameters that we explored before train_df = io. load_output_df (os.path.join(’features ’, ’ 2020 -02’)) prod_mw = models. RandomForestModelWrapper ( feature_columns = feature_columns , model_params=model_params) prod_mw.train(train_df , label_column) train_score = prod_mw.score(train_df , label_column) prod_mw.add_metric(’train_f1 ’, train_score) prod_mw.save(’training/models ’, dev=False) Shreya Shankar Keeping up with ML in Production May 2021 18 / 50
  81. 81. Deploy model Flask server that runs predict on features passed in via request Shreya Shankar Keeping up with ML in Production May 2021 19 / 50
  82. 82. Deploy model Flask server that runs predict on features passed in via request Separate file inference.py that repeatedly sends requests containing features Shreya Shankar Keeping up with ML in Production May 2021 19 / 50
  83. 83. Deploy model Flask server that runs predict on features passed in via request Separate file inference.py that repeatedly sends requests containing features Logging metrics in the Flask app via Prometheus Shreya Shankar Keeping up with ML in Production May 2021 19 / 50
  84. 84. Deploy model Flask server that runs predict on features passed in via request Separate file inference.py that repeatedly sends requests containing features Logging metrics in the Flask app via Prometheus Visualizing metrics via Grafana Shreya Shankar Keeping up with ML in Production May 2021 19 / 50
  85. 85. Deploy model Flask server that runs predict on features passed in via request Separate file inference.py that repeatedly sends requests containing features Logging metrics in the Flask app via Prometheus Visualizing metrics via Grafana Demo Shreya Shankar Keeping up with ML in Production May 2021 19 / 50
  86. 86. Inspect F1 scores at the end of every day Output day rolling_f1 daily_f1 178123 1 0.576629 0.576629 370840 2 0.633320 0.677398 592741 3 0.649983 0.675877 . . . 1514827 28 0.659934 0.501860 1520358 29 0.659764 0.537860 1529847 30 0.659530 0.576178 Shreya Shankar Keeping up with ML in Production May 2021 20 / 50
  87. 87. Deeper dive into March F1 scores day rolling_f1 daily_f1 178123 1 0.576629 0.576629 370840 2 0.633320 0.677398 592741 3 0.649983 0.675877 . . . 1507234 27 0.660228 0.576993 1514827 28 0.659934 0.501860 1520358 29 0.659764 0.537860 1529847 30 0.659530 0.576178 Large discrepancy between rolling metric and daily metric Shreya Shankar Keeping up with ML in Production May 2021 21 / 50
  88. 88. Deeper dive into March F1 scores day rolling_f1 daily_f1 178123 1 0.576629 0.576629 370840 2 0.633320 0.677398 592741 3 0.649983 0.675877 . . . 1507234 27 0.660228 0.576993 1514827 28 0.659934 0.501860 1520358 29 0.659764 0.537860 1529847 30 0.659530 0.576178 Large discrepancy between rolling metric and daily metric What you monitor is important – daily metric significantly drops towards the end Shreya Shankar Keeping up with ML in Production May 2021 21 / 50
  89. 89. Deeper dive into March F1 scores day rolling_f1 daily_f1 178123 1 0.576629 0.576629 370840 2 0.633320 0.677398 592741 3 0.649983 0.675877 . . . 1507234 27 0.660228 0.576993 1514827 28 0.659934 0.501860 1520358 29 0.659764 0.537860 1529847 30 0.659530 0.576178 Large discrepancy between rolling metric and daily metric What you monitor is important – daily metric significantly drops towards the end Visible effect of COVID-19 on the data Shreya Shankar Keeping up with ML in Production May 2021 21 / 50
  90. 90. Evaluating the same model on following months Output 2020-04 F1: 0.5714705472990737 2020-05 F1: 0.5530868473460906 2020-06 F1: 0.5967621469282887 Performance gets significantly worse! We will need to retrain models periodically. Shreya Shankar Keeping up with ML in Production May 2021 22 / 50
  91. 91. Challenge: lag In this example we are simulating a "live" deployment on historical data, so we have no lag Shreya Shankar Keeping up with ML in Production May 2021 23 / 50
  92. 92. Challenge: lag In this example we are simulating a "live" deployment on historical data, so we have no lag In practice there are many types of lag Shreya Shankar Keeping up with ML in Production May 2021 23 / 50
  93. 93. Challenge: lag In this example we are simulating a "live" deployment on historical data, so we have no lag In practice there are many types of lag Feature lag: our system only learns about the ride well after it has occurred Shreya Shankar Keeping up with ML in Production May 2021 23 / 50
  94. 94. Challenge: lag In this example we are simulating a "live" deployment on historical data, so we have no lag In practice there are many types of lag Feature lag: our system only learns about the ride well after it has occurred Label lag: our system only learns of the label (fare, tip) well after the ride has occurred Shreya Shankar Keeping up with ML in Production May 2021 23 / 50
  95. 95. Challenge: lag In this example we are simulating a "live" deployment on historical data, so we have no lag In practice there are many types of lag Feature lag: our system only learns about the ride well after it has occurred Label lag: our system only learns of the label (fare, tip) well after the ride has occurred The evaluation metric will inherently be lagging Shreya Shankar Keeping up with ML in Production May 2021 23 / 50
  96. 96. Challenge: lag In this example we are simulating a "live" deployment on historical data, so we have no lag In practice there are many types of lag Feature lag: our system only learns about the ride well after it has occurred Label lag: our system only learns of the label (fare, tip) well after the ride has occurred The evaluation metric will inherently be lagging We might not be able to train on the most recent data Shreya Shankar Keeping up with ML in Production May 2021 23 / 50
  97. 97. Challenge: distribution shift Data "drifts" over time, and models will need to be retrained to reflect such drift Shreya Shankar Keeping up with ML in Production May 2021 24 / 50
  98. 98. Challenge: distribution shift Data "drifts" over time, and models will need to be retrained to reflect such drift Open question: how often do you retrain a model? Shreya Shankar Keeping up with ML in Production May 2021 24 / 50
  99. 99. Challenge: distribution shift Data "drifts" over time, and models will need to be retrained to reflect such drift Open question: how often do you retrain a model? Retraining adds complexity to the overall system (more artifacts to version and keep track of) Shreya Shankar Keeping up with ML in Production May 2021 24 / 50
  100. 100. Challenge: distribution shift Data "drifts" over time, and models will need to be retrained to reflect such drift Open question: how often do you retrain a model? Retraining adds complexity to the overall system (more artifacts to version and keep track of) Retraining can be expensive in terms of compute Shreya Shankar Keeping up with ML in Production May 2021 24 / 50
  101. 101. Challenge: distribution shift Data "drifts" over time, and models will need to be retrained to reflect such drift Open question: how often do you retrain a model? Retraining adds complexity to the overall system (more artifacts to version and keep track of) Retraining can be expensive in terms of compute Retraining can take time and energy from people Shreya Shankar Keeping up with ML in Production May 2021 24 / 50
  102. 102. Challenge: distribution shift Data "drifts" over time, and models will need to be retrained to reflect such drift Open question: how often do you retrain a model? Retraining adds complexity to the overall system (more artifacts to version and keep track of) Retraining can be expensive in terms of compute Retraining can take time and energy from people Open question: how do you know when the data has "drifted?" Shreya Shankar Keeping up with ML in Production May 2021 24 / 50
  103. 103. Demo Shreya Shankar Keeping up with ML in Production May 2021 25 / 50
  104. 104. How do you know when the data has "drifted?" 4 4 Rabanser, Günnemann, and Lipton, Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. Shreya Shankar Keeping up with ML in Production May 2021 26 / 50
  105. 105. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  106. 106. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step "Multiple univariate testing seem to offer comparable performance to multivariate testing" (Rabanser et al.) Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  107. 107. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step "Multiple univariate testing seem to offer comparable performance to multivariate testing" (Rabanser et al.) Maybe this is specific to their MNIST and CIFAR experiments Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  108. 108. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step "Multiple univariate testing seem to offer comparable performance to multivariate testing" (Rabanser et al.) Maybe this is specific to their MNIST and CIFAR experiments Regardless, we will employ multiple univariate testing Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  109. 109. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step "Multiple univariate testing seem to offer comparable performance to multivariate testing" (Rabanser et al.) Maybe this is specific to their MNIST and CIFAR experiments Regardless, we will employ multiple univariate testing For each feature, we will run a 2-sided Kolmogorov-Smirnov test Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  110. 110. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step "Multiple univariate testing seem to offer comparable performance to multivariate testing" (Rabanser et al.) Maybe this is specific to their MNIST and CIFAR experiments Regardless, we will employ multiple univariate testing For each feature, we will run a 2-sided Kolmogorov-Smirnov test Continuous data, non-parametric test Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  111. 111. Challenge: quantifying distribution shift We only have 11 features, so we’ll skip the dimensionality reduction step "Multiple univariate testing seem to offer comparable performance to multivariate testing" (Rabanser et al.) Maybe this is specific to their MNIST and CIFAR experiments Regardless, we will employ multiple univariate testing For each feature, we will run a 2-sided Kolmogorov-Smirnov test Continuous data, non-parametric test Compute the largest difference Z = sup z |Fp(z) − Fq(z)| where Fp and Fq are the empirical CDFs of the data we are comparing Shreya Shankar Keeping up with ML in Production May 2021 27 / 50
  112. 112. Kolmogorov-Smirnov test made simple For each feature, we will run a 2-sided Kolmogorov-Smirnov test Shreya Shankar Keeping up with ML in Production May 2021 28 / 50
  113. 113. Kolmogorov-Smirnov test made simple For each feature, we will run a 2-sided Kolmogorov-Smirnov test Continuous data, non-parametric test Shreya Shankar Keeping up with ML in Production May 2021 28 / 50
  114. 114. Kolmogorov-Smirnov test made simple For each feature, we will run a 2-sided Kolmogorov-Smirnov test Continuous data, non-parametric test Compute the largest difference Z = sup z |Fp(z) − Fq(z)| where Fp and Fq are the empirical CDFs of the data we are comparing Shreya Shankar Keeping up with ML in Production May 2021 28 / 50
  115. 115. Kolmogorov-Smirnov test made simple For each feature, we will run a 2-sided Kolmogorov-Smirnov test Continuous data, non-parametric test Compute the largest difference Z = sup z |Fp(z) − Fq(z)| where Fp and Fq are the empirical CDFs of the data we are comparing Example comparing two random normally distributed PDFs Shreya Shankar Keeping up with ML in Production May 2021 28 / 50
  116. 116. How do you know when the data has "drifted?" Comparing January 2020 and February 2020 datasets using the KS test: Output feature statistic p_value 0 pickup_weekday 0.046196 0.000000e+00 2 work_hours 0.028587 0.000000e+00 6 trip_time 0.017205 0.000000e+00 7 trip_speed 0.035415 0.000000e+00 1 pickup_hour 0.009676 8.610133e-258 5 trip_distance 0.005312 5.266602e-78 8 PULocationID 0.004083 2.994877e-46 9 DOLocationID 0.003132 2.157559e-27 4 passenger_count 0.002947 2.634493e-24 10 RatecodeID 0.002616 3.047481e-19 3 pickup_minute 0.000702 8.861498e-02 Shreya Shankar Keeping up with ML in Production May 2021 29 / 50
  117. 117. How do you know when the data has "drifted?" For our "offline" train and evaluation sets (January and February), we get extremely low p-values! 5 Lin, Lucas, and Shmueli, “Too Big to Fail: Large Samples and the p-Value Problem”. Shreya Shankar Keeping up with ML in Production May 2021 30 / 50
  118. 118. How do you know when the data has "drifted?" For our "offline" train and evaluation sets (January and February), we get extremely low p-values! This method can have a high "false positive rate" 5 Lin, Lucas, and Shmueli, “Too Big to Fail: Large Samples and the p-Value Problem”. Shreya Shankar Keeping up with ML in Production May 2021 30 / 50
  119. 119. How do you know when the data has "drifted?" For our "offline" train and evaluation sets (January and February), we get extremely low p-values! This method can have a high "false positive rate" In my experience, this method has flagged distributions as "significantly different" more than I want it to 5 Lin, Lucas, and Shmueli, “Too Big to Fail: Large Samples and the p-Value Problem”. Shreya Shankar Keeping up with ML in Production May 2021 30 / 50
  120. 120. How do you know when the data has "drifted?" For our "offline" train and evaluation sets (January and February), we get extremely low p-values! This method can have a high "false positive rate" In my experience, this method has flagged distributions as "significantly different" more than I want it to In the era of "big data" (we have millions of data points), p-values are not useful to look at5 5 Lin, Lucas, and Shmueli, “Too Big to Fail: Large Samples and the p-Value Problem”. Shreya Shankar Keeping up with ML in Production May 2021 30 / 50
  121. 121. How do you know when the data has "drifted?" You don’t really know. An unsatisfying solution is to retrain the model to be as fresh as possible. Shreya Shankar Keeping up with ML in Production May 2021 31 / 50
  122. 122. Types of production ML bugs Data changed or corrupted Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  123. 123. Types of production ML bugs Data changed or corrupted Data dependency / infra issues Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  124. 124. Types of production ML bugs Data changed or corrupted Data dependency / infra issues Logical error in the ETL code Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  125. 125. Types of production ML bugs Data changed or corrupted Data dependency / infra issues Logical error in the ETL code Logical error in retraining a model Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  126. 126. Types of production ML bugs Data changed or corrupted Data dependency / infra issues Logical error in the ETL code Logical error in retraining a model Logical error in promoting a retrained model to inference step Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  127. 127. Types of production ML bugs Data changed or corrupted Data dependency / infra issues Logical error in the ETL code Logical error in retraining a model Logical error in promoting a retrained model to inference step Change in consumer behavior Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  128. 128. Types of production ML bugs Data changed or corrupted Data dependency / infra issues Logical error in the ETL code Logical error in retraining a model Logical error in promoting a retrained model to inference step Change in consumer behavior Many more Shreya Shankar Keeping up with ML in Production May 2021 32 / 50
  129. 129. Why production ML bugs are hard to find and address ML code fails silently (few runtime or compiler errors) Shreya Shankar Keeping up with ML in Production May 2021 33 / 50
  130. 130. Why production ML bugs are hard to find and address ML code fails silently (few runtime or compiler errors) Little or no unit or integration testing in feature engineering and ML code Shreya Shankar Keeping up with ML in Production May 2021 33 / 50
  131. 131. Why production ML bugs are hard to find and address ML code fails silently (few runtime or compiler errors) Little or no unit or integration testing in feature engineering and ML code Very few (if any) people have end-to-end visibility on an ML pipeline Shreya Shankar Keeping up with ML in Production May 2021 33 / 50
  132. 132. Why production ML bugs are hard to find and address ML code fails silently (few runtime or compiler errors) Little or no unit or integration testing in feature engineering and ML code Very few (if any) people have end-to-end visibility on an ML pipeline Underdeveloped monitoring and tracing tools Shreya Shankar Keeping up with ML in Production May 2021 33 / 50
  133. 133. Why production ML bugs are hard to find and address ML code fails silently (few runtime or compiler errors) Little or no unit or integration testing in feature engineering and ML code Very few (if any) people have end-to-end visibility on an ML pipeline Underdeveloped monitoring and tracing tools So many artifacts to keep track of, especially if you retrain your models Shreya Shankar Keeping up with ML in Production May 2021 33 / 50
  134. 134. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  135. 135. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  136. 136. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  137. 137. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  138. 138. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Averages, percentiles Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  139. 139. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Averages, percentiles More advanced monitoring Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  140. 140. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Averages, percentiles More advanced monitoring Model input (feature) distributions Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  141. 141. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Averages, percentiles More advanced monitoring Model input (feature) distributions ETL intermediate output distributions Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  142. 142. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Averages, percentiles More advanced monitoring Model input (feature) distributions ETL intermediate output distributions Can get tedious if you have many features or steps in ETL Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  143. 143. What to monitor? Agile philosophy: "Continuous Delivery." Monitor proactively! Especially hard when there are no labels First-pass monitoring Model output distributions Averages, percentiles More advanced monitoring Model input (feature) distributions ETL intermediate output distributions Can get tedious if you have many features or steps in ETL Still unclear how to quantify, detect, and act on Shreya Shankar Keeping up with ML in Production May 2021 34 / 50
  144. 144. Prometheus monitoring for ML Selling points Pain points Easy to call hist.observe(val) User (you) needs to define the histogram buckets Easy to integrate with Grafana Metric naming, tracking, and visualization clutter Many applications already use Prometheus PromQL is not very intuitive for ML-based tasks Shreya Shankar Keeping up with ML in Production May 2021 35 / 50
  145. 145. PromQL for ML monitoring Question Query Average output [5m] sum(rate(histogram_sum[5m])) / sum(rate(histogram_count[5m])), time series view Median output [5m] histogram_quantile(0.5, sum by (job, le) (rate(histogram_bucket[5m]))), time series view Probability distribution of outputs [5m] sum(rate(histogram_bucket[5m])) by (le), heatmap view6 6 Jordan, A simple solution for monitoring ML systems. Shreya Shankar Keeping up with ML in Production May 2021 36 / 50
  146. 146. Retraining regularly Suppose we retrain our model on a weekly basis. Use MLFlow Model Registry7 to keep track of which model is the best model Now, we need some tracing8... 7 https://www.mlflow.org/docs/latest/model-registry.html 8 Hermann and Del Balso, Scaling Machine Learning at Uber with Michelangelo. Shreya Shankar Keeping up with ML in Production May 2021 37 / 50
  147. 147. Retraining regularly Suppose we retrain our model on a weekly basis. Use MLFlow Model Registry7 to keep track of which model is the best model Alternatively, can also manage versioning and a registry ourselves (what we do now) Now, we need some tracing8... 7 https://www.mlflow.org/docs/latest/model-registry.html 8 Hermann and Del Balso, Scaling Machine Learning at Uber with Michelangelo. Shreya Shankar Keeping up with ML in Production May 2021 37 / 50
  148. 148. Retraining regularly Suppose we retrain our model on a weekly basis. Use MLFlow Model Registry7 to keep track of which model is the best model Alternatively, can also manage versioning and a registry ourselves (what we do now) At inference time, pull the latest/best model from our model registry Now, we need some tracing8... 7 https://www.mlflow.org/docs/latest/model-registry.html 8 Hermann and Del Balso, Scaling Machine Learning at Uber with Michelangelo. Shreya Shankar Keeping up with ML in Production May 2021 37 / 50
  149. 149. mltrace Coarse-grained lineage and tracing 9 https://github.com/loglabs/mltrace Shreya Shankar Keeping up with ML in Production May 2021 38 / 50
  150. 150. mltrace Coarse-grained lineage and tracing Designed specifically for complex data or ML pipelines 9 https://github.com/loglabs/mltrace Shreya Shankar Keeping up with ML in Production May 2021 38 / 50
  151. 151. mltrace Coarse-grained lineage and tracing Designed specifically for complex data or ML pipelines Designed specifically for Agile multidisciplinary teams 9 https://github.com/loglabs/mltrace Shreya Shankar Keeping up with ML in Production May 2021 38 / 50
  152. 152. mltrace Coarse-grained lineage and tracing Designed specifically for complex data or ML pipelines Designed specifically for Agile multidisciplinary teams Alpha release9 contains Python API to log run information and UI to view traces for outputs 9 https://github.com/loglabs/mltrace Shreya Shankar Keeping up with ML in Production May 2021 38 / 50
  153. 153. mltrace design principles Utter simiplicity (user knows everything going on) Shreya Shankar Keeping up with ML in Production May 2021 39 / 50
  154. 154. mltrace design principles Utter simiplicity (user knows everything going on) No need for users to manually set component dependencies – the tool should detect dependencies based on I/O of a component’s run Shreya Shankar Keeping up with ML in Production May 2021 39 / 50
  155. 155. mltrace design principles Utter simiplicity (user knows everything going on) No need for users to manually set component dependencies – the tool should detect dependencies based on I/O of a component’s run API designed for both engineers and data scientists Shreya Shankar Keeping up with ML in Production May 2021 39 / 50
  156. 156. mltrace design principles Utter simiplicity (user knows everything going on) No need for users to manually set component dependencies – the tool should detect dependencies based on I/O of a component’s run API designed for both engineers and data scientists UI designed for people to help triage issues even if they didn’t build the ETL or models themselves Shreya Shankar Keeping up with ML in Production May 2021 39 / 50
  157. 157. mltrace design principles Utter simiplicity (user knows everything going on) No need for users to manually set component dependencies – the tool should detect dependencies based on I/O of a component’s run API designed for both engineers and data scientists UI designed for people to help triage issues even if they didn’t build the ETL or models themselves When there is a bug, whose backlog do you put it on? Shreya Shankar Keeping up with ML in Production May 2021 39 / 50
  158. 158. mltrace design principles Utter simiplicity (user knows everything going on) No need for users to manually set component dependencies – the tool should detect dependencies based on I/O of a component’s run API designed for both engineers and data scientists UI designed for people to help triage issues even if they didn’t build the ETL or models themselves When there is a bug, whose backlog do you put it on? Enable people who may not have developed the model to investigate the bug Shreya Shankar Keeping up with ML in Production May 2021 39 / 50
  159. 159. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  160. 160. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) In ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  161. 161. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) In ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once Component and ComponentRun objects10 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  162. 162. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) In ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once Component and ComponentRun objects10 Decorator interface similar to Dagster "solids"11 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  163. 163. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) In ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once Component and ComponentRun objects10 Decorator interface similar to Dagster "solids"11 User specifies component name, input and output variables 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  164. 164. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) In ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once Component and ComponentRun objects10 Decorator interface similar to Dagster "solids"11 User specifies component name, input and output variables Alternative Pythonic interface similar to MLFlow tracking12 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  165. 165. mltrace concepts Pipelines are made of components (ex: cleaning, featuregen, split, train) In ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once Component and ComponentRun objects10 Decorator interface similar to Dagster "solids"11 User specifies component name, input and output variables Alternative Pythonic interface similar to MLFlow tracking12 User creates instance of ComponentRun object and calls log_component_run on the object 10 https://mltrace.readthedocs.io/en/latest 11 https://docs.dagster.io/concepts/solids-pipelines/solids 12 https://www.mlflow.org/docs/latest/tracking.html Shreya Shankar Keeping up with ML in Production May 2021 40 / 50
  166. 166. mltrace integration example def clean_data( raw_data_filename : str , raw_df: pd.DataFrame , month: str , year: str , component: str) -> str: first , last = calendar.monthrange(int(year), int(month)) first_day = f’{year}-{month}-{first:02d}’ last_day = f’{year}-{month}-{last:02d}’ clean_df = helpers. remove_zero_fare_and_oob_rows ( raw_df , first_day , last_day) # Write "clean" df to s3 output_path = io. save_output_df (clean_df , component) return output_path Shreya Shankar Keeping up with ML in Production May 2021 41 / 50
  167. 167. mltrace integration example @register(’cleaning ’, input_vars=[’raw_data_filename ’], output_vars=[’output_path ’]) def clean_data( raw_data_filename : str , raw_df: pd.DataFrame , month: str , year: str , component: str) -> str: first , last = calendar.monthrange(int(year), int(month)) first_day = f’{year}-{month}-{first:02d}’ last_day = f’{year}-{month}-{last:02d}’ clean_df = helpers. remove_zero_fare_and_oob_rows ( raw_df , first_day , last_day) # Write "clean" df to s3 output_path = io. save_output_df (clean_df , component) return output_path Shreya Shankar Keeping up with ML in Production May 2021 42 / 50
  168. 168. mltrace integration example from mltrace import create_component , tag_component , register create_component (’cleaning ’, ’Cleans the raw data with basic OOB criteria.’, ’shreya ’) tag_component(’cleaning ’, [’etl’]) # Run function raw_df = io.read_file(month , year) raw_data_filename = io. get_raw_data_filename (’2020 ’, ’01’) output_path = clean_data(raw_data_filename , raw_df , ’01’, ’ 2020 ’, ’clean/2020_01 ’) print(output_path) Shreya Shankar Keeping up with ML in Production May 2021 43 / 50
  169. 169. mltrace UI demo Shreya Shankar Keeping up with ML in Production May 2021 44 / 50
  170. 170. mltrace immediate roadmap13 UI feature to easily see if a component is stale 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  171. 171. mltrace immediate roadmap13 UI feature to easily see if a component is stale There are newer versions of the component’s outputs 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  172. 172. mltrace immediate roadmap13 UI feature to easily see if a component is stale There are newer versions of the component’s outputs The latest run of a component happened weeks or months ago 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  173. 173. mltrace immediate roadmap13 UI feature to easily see if a component is stale There are newer versions of the component’s outputs The latest run of a component happened weeks or months ago CLI to interact with mltrace logs 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  174. 174. mltrace immediate roadmap13 UI feature to easily see if a component is stale There are newer versions of the component’s outputs The latest run of a component happened weeks or months ago CLI to interact with mltrace logs Scala Spark integrations (or REST API for logging) 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  175. 175. mltrace immediate roadmap13 UI feature to easily see if a component is stale There are newer versions of the component’s outputs The latest run of a component happened weeks or months ago CLI to interact with mltrace logs Scala Spark integrations (or REST API for logging) Prometheus integrations to easily monitor outputs 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  176. 176. mltrace immediate roadmap13 UI feature to easily see if a component is stale There are newer versions of the component’s outputs The latest run of a component happened weeks or months ago CLI to interact with mltrace logs Scala Spark integrations (or REST API for logging) Prometheus integrations to easily monitor outputs Possibly other MLOps tool integrations 13 Please email shreyashankar@berkeley.edu if you’re interested in contributing! Shreya Shankar Keeping up with ML in Production May 2021 45 / 50
  177. 177. Talk recap Introduced an end-to-end ML pipeline for a toy task Shreya Shankar Keeping up with ML in Production May 2021 46 / 50
  178. 178. Talk recap Introduced an end-to-end ML pipeline for a toy task Demonstrated performance drift or degradation when a model was deployed via Prometheus metric logging and Grafana dashboard visualizations Shreya Shankar Keeping up with ML in Production May 2021 46 / 50
  179. 179. Talk recap Introduced an end-to-end ML pipeline for a toy task Demonstrated performance drift or degradation when a model was deployed via Prometheus metric logging and Grafana dashboard visualizations Showed via statistical tests that it’s hard to know when exactly to retrain a model Shreya Shankar Keeping up with ML in Production May 2021 46 / 50
  180. 180. Talk recap Introduced an end-to-end ML pipeline for a toy task Demonstrated performance drift or degradation when a model was deployed via Prometheus metric logging and Grafana dashboard visualizations Showed via statistical tests that it’s hard to know when exactly to retrain a model Discussed challenges around continuously retraining models and maintaining ML production pipelines Shreya Shankar Keeping up with ML in Production May 2021 46 / 50
  181. 181. Talk recap Introduced an end-to-end ML pipeline for a toy task Demonstrated performance drift or degradation when a model was deployed via Prometheus metric logging and Grafana dashboard visualizations Showed via statistical tests that it’s hard to know when exactly to retrain a model Discussed challenges around continuously retraining models and maintaining ML production pipelines Introduced mltrace, a tool developed for ML pipelines that performs coarse-grained lineage and tracing Shreya Shankar Keeping up with ML in Production May 2021 46 / 50
  182. 182. Areas of future work in MLOps Only scratched the surface with challenges discussed today 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  183. 183. Areas of future work in MLOps Only scratched the surface with challenges discussed today Many more of these problems around deep learning 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  184. 184. Areas of future work in MLOps Only scratched the surface with challenges discussed today Many more of these problems around deep learning Embeddings as features – if someone updates the upstream embedding model, do all the data scientists downstream need to immediately change their models? 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  185. 185. Areas of future work in MLOps Only scratched the surface with challenges discussed today Many more of these problems around deep learning Embeddings as features – if someone updates the upstream embedding model, do all the data scientists downstream need to immediately change their models? "Underspecified" pipelines can pose threats14 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  186. 186. Areas of future work in MLOps Only scratched the surface with challenges discussed today Many more of these problems around deep learning Embeddings as features – if someone updates the upstream embedding model, do all the data scientists downstream need to immediately change their models? "Underspecified" pipelines can pose threats14 How to enable anyone to build dashboards or monitor pipelines? 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  187. 187. Areas of future work in MLOps Only scratched the surface with challenges discussed today Many more of these problems around deep learning Embeddings as features – if someone updates the upstream embedding model, do all the data scientists downstream need to immediately change their models? "Underspecified" pipelines can pose threats14 How to enable anyone to build dashboards or monitor pipelines? ML people know what to monitor, infra people know how to monitor 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  188. 188. Areas of future work in MLOps Only scratched the surface with challenges discussed today Many more of these problems around deep learning Embeddings as features – if someone updates the upstream embedding model, do all the data scientists downstream need to immediately change their models? "Underspecified" pipelines can pose threats14 How to enable anyone to build dashboards or monitor pipelines? ML people know what to monitor, infra people know how to monitor We should strive to build tools to allow engineers and data scientists to be more self sufficient 14 D’Amour et al., Underspecification Presents Challenges for Credibility in Modern Machine Learning. Shreya Shankar Keeping up with ML in Production May 2021 47 / 50
  189. 189. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  190. 190. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Toy ML Pipeline with mltrace, Prometheus, and Grafana: https://github.com/shreyashankar/toy-ml-pipeline/tree/ shreyashankar/db Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  191. 191. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Toy ML Pipeline with mltrace, Prometheus, and Grafana: https://github.com/shreyashankar/toy-ml-pipeline/tree/ shreyashankar/db mltrace: https://github.com/loglabs/mltrace Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  192. 192. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Toy ML Pipeline with mltrace, Prometheus, and Grafana: https://github.com/shreyashankar/toy-ml-pipeline/tree/ shreyashankar/db mltrace: https://github.com/loglabs/mltrace Contact info Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  193. 193. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Toy ML Pipeline with mltrace, Prometheus, and Grafana: https://github.com/shreyashankar/toy-ml-pipeline/tree/ shreyashankar/db mltrace: https://github.com/loglabs/mltrace Contact info Twitter: @sh_reya Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  194. 194. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Toy ML Pipeline with mltrace, Prometheus, and Grafana: https://github.com/shreyashankar/toy-ml-pipeline/tree/ shreyashankar/db mltrace: https://github.com/loglabs/mltrace Contact info Twitter: @sh_reya Email: shreyashankar@berkeley.edu Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  195. 195. Miscellaneous Code for this talk: https://github.com/shreyashankar/debugging-ml-talk Toy ML Pipeline with mltrace, Prometheus, and Grafana: https://github.com/shreyashankar/toy-ml-pipeline/tree/ shreyashankar/db mltrace: https://github.com/loglabs/mltrace Contact info Twitter: @sh_reya Email: shreyashankar@berkeley.edu Thank you! Shreya Shankar Keeping up with ML in Production May 2021 48 / 50
  196. 196. References D’Amour, Alexander et al. Underspecification Presents Challenges for Credibility in Modern Machine Learning. 2020. arXiv: 2011.03395 [cs.LG]. Hermann, Jeremy and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo. 2018. URL: https://eng.uber.com/scaling-michelangelo/. Jordan, Jeremy. A simple solution for monitoring ML systems. 2021. URL: https://www.jeremyjordan.me/ml-monitoring/. Lin, Mingfeng, Henry Lucas, and Galit Shmueli. “Too Big to Fail: Large Samples and the p-Value Problem”. In: Information Systems Research 24 (Dec. 2013), pp. 906–917. DOI: 10.1287/isre.2013.0480. Rabanser, Stephan, Stephan Günnemann, and Zachary C. Lipton. Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. 2019. arXiv: 1810.11953 [stat.ML]. Shreya Shankar Keeping up with ML in Production May 2021 49 / 50
  197. 197. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Shreya Shankar Keeping up with ML in Production May 2021 50 / 50

Advances in machine learning and big data are disrupting every industry. However, even when companies deploy to production, they face significant challenges in staying there, with performance degrading significantly from offline benchmarks over time, a phenomenon known as performance drift. Models deployed over an extended period of time often experience performance drift due to changing data distribution. In this talk, we discuss approaches to mitigate the effects of performance drift, illustrating our methods on a sample prediction task. Leveraging our experience at a startup deploying and monitoring production-grade ML pipelines for predictive maintenance, we also address several aspects of machine learning often overlooked in academia, such as the incorporation of non-technical collaborators and the integration of machine learning in an agile framework. This talk will consist of: Using Python, Dask, and open-source datasets to demonstrate an example of training and validating a model in an offline setting that subsequently experiences performance degradation when it is deployed. Using MLFlow, Prometheus, and Grafana to show how one can build tools to monitor production pipelines and enable teams of different stakeholders to quickly identify performance degradation using the right metrics. Proposing a checklist of criteria for when to retrain machine learning models in production. This talk will be a slideshow presentation accompanied by a Python notebook demo. It is aimed towards engineers that deploy and debug models in production, but may be of broader interest for people building machine learning-based products, and requires a familiarity with machine learning basics (train/test sets, decision trees).

Views

Total views

140

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

17

Shares

0

Comments

0

Likes

0

×