SlideShare a Scribd company logo
1 of 137
Download to read offline
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
Deployment & Monitoring
Full Stack Deep Learning - UC Berkeley Spring 2021
2
Full Stack Deep Learning - UC Berkeley Spring 2021
3
Full Stack Deep Learning - UC Berkeley Spring 2021
Today
• Deployment (AKA, how to get your model into production)

• Monitoring (AKA how to keep it healthy once it’s there)
4
Full Stack Deep Learning - UC Berkeley Spring 2021
Today
• Deployment
• Monitoring
5
Full Stack Deep Learning - UC Berkeley Spring 2021
Where in the architecture should your model go?
6
Client Server Database
Local Remote
Request
Response
Full Stack Deep Learning - UC Berkeley Spring 2021
Batch prediction
7
Client Server Database
Local Remote
Request
Response
Model
Full Stack Deep Learning - UC Berkeley Spring 2021
Batch prediction
• Periodically run your model on new data and cache the results in a
database

• Works if the universe of inputs is relatively small (e.g., 1 prediction per
user)
8
Full Stack Deep Learning - UC Berkeley Spring 2021
Batch prediction
• Simple to implement

• Relatively low latency to the user
9
• Doesn’t scale to complex input
types

• Users don’t get the most up-to-
date predictions

• Models frequently become
“stale”, which can be hard to
detect
Pros Cons
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
10
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-in-service
11
Client Server Database
Local Remote
Request
Response
Model
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-in-service
• Package up your model and include it in your deployed web server

• Web server loads the model and calls it to make predictions
12
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-in-service
• Re-uses your existing
infrastructure
13
• Web server may be written in a
different language

• Models may change more
frequently than server code

• Large models can eat into the
resources for your web server

• Server hardware not optimized for
your model (e.g., no GPUs)

• Model & server may scale
differently
Pros Cons
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
14
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
15
Client Server Database
Local Remote
Request
Response
Model
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
• Run your model on its own web server

• The backend (or the client itself) interact with the model by making
requests to the model service and receiving responses back
16
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
• Dependability — model bugs less
likely to crash the web app

• Scalability — choose optimal
hardware for the model and scale
it appropriately

• Flexibility — easily reuse a model
across multiple apps
17
• Can add latency

• Adds infrastructural complexity

• Now you have to run a model
service…
Pros Cons
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment

• Managed options
18
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment

• Managed options
19
Full Stack Deep Learning - UC Berkeley Spring 2021
REST APIs
• Serving predictions in response to canonically-formatted HTTP requests

• There are alternatives like GRPC (which is actually used in tensorflow
serving) and GraphQL (not terribly relevant to model services)
20
Full Stack Deep Learning - UC Berkeley Spring 2021
REST API example
21
Full Stack Deep Learning - UC Berkeley Spring 2021
Formats for requests and responses
• Sadly, no standard yet
22
Google Cloud
Azure
AWS Sagemaker
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
23
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs

• Dependency management
• Performance optimization

• Horizontal scaling

• Deployment

• Managed options
24
Full Stack Deep Learning - UC Berkeley Spring 2021
Dependency management for model servers
• Model predictions depend on code, model weights, and dependencies.
All need to be present on your web server

• Dependencies cause trouble. Hard to make consistent, hard to update,
etc. Even changing a tensorflow version can change your model

• Two strategies:

• Constrain the dependencies for your model

• Use containers
25
Full Stack Deep Learning - UC Berkeley Spring 2021
A standard neural net format: ONNX
• The promise: define network in any language, run it consistently anywhere

• The reality: since the libraries change quickly, there are often bugs in the
translation layer

• What about non-library code like feature transformations?
26
Full Stack Deep Learning - UC Berkeley Spring 2021
Managing dependencies with containers (i.e., Docker)
• Docker vs VM

• Dockerfile and layer-based images

• DockerHub and the ecosystem

• Kubernetes and other orchestrators
27
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
28
https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b
No OS -> Light weight
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
• Spin up a container for
every discrete task

• For example, a web app
might have four containers:

• Web server

• Database

• Job queue

• Worker
29
https://www.docker.com/what-container
Light weight -> Heavy use
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
Dockerfile
30
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
Strong Ecosystem
• Images are easy to
find, modify, and
contribute back to
DockerHub

• Private images
easy to store in
same place
31
https://docs.docker.com/engine/docker-overview
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
32
https://www.docker.com/what-container#/package_software
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
Container Orchestration
• Distributing containers onto underlying VMs
or bare metal

• Kubernetes is the open-source winner, but
cloud providers have good offerings, too.
33
Testing & Deployment - Docker
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
34
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs

• Dependency management

• Performance optimization
• Horizontal scaling

• Deployment

• Managed options
35
Full Stack Deep Learning - UC Berkeley Spring 2021
Making inference on a single machine more efficient
• GPU or no GPU?

• Concurrency

• Model distillation

• Quantization

• Caching

• Batching

• Sharing the GPU

• Libraries
36
Full Stack Deep Learning - UC Berkeley Spring 2021
GPU or no GPU?
• GPU pros
• Same hardware you trained on probably

• In the limit of model size, batch size tuning, etc usually higher
throughput

• GPU cons
• More complex to set up

• Often more expensive
37
Full Stack Deep Learning - UC Berkeley Spring 2021
Concurrency
• What?
• Multiple copies of the model running on different CPUs or cores

• How?
• Be careful about thread tuning
38
https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus/
Full Stack Deep Learning - UC Berkeley Spring 2021
Model distillation
• What?
• Train a smaller model to imitate your larger one

• How?
• Several techniques outlined below

• Can be finicky to do yourself — infrequently used in practice

• Exception — pretrained distilled models like DistilBERT
39
https://heartbeat.fritz.ai/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb
Full Stack Deep Learning - UC Berkeley Spring 2021
Quantization
• What?
• Execute some or all of the operations in your model with a smaller
numerical representation than floats (e.g., INT8)

• Some tradeoffs with accuracy

• How?
• PyTorch and Tensorflow Lite have quantization built-in

• Can also run quantization-aware training, which often results in higher
accuracy
40
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
Full Stack Deep Learning - UC Berkeley Spring 2021
Caching
• What?
• For some ML models, some inputs are more common than others

• Instead of always calling the model, first check the cache

• How?
• Can get very fancy

• Basic way uses functools
41
Full Stack Deep Learning - UC Berkeley Spring 2021
Batching
• What?
• ML models often achieve higher throughput when doing prediction in parallel,
especially in a GPU

• How?
• Gather predictions until you have a batch, run prediction, return to user

• Batch size needs to be tuned

• You need to have a way to shortcut the process if latency becomes too long

• Probably don’t want to implement this yourself
42
Full Stack Deep Learning - UC Berkeley Spring 2021
Sharing the GPU
• What?
• Your model may not take up all of the GPU memory with your inference
batch size. Why not run multiple models on the same GPU?

• How?
• You’ll probably want to use a model serving solution that supports this
out of the box
43
Full Stack Deep Learning - UC Berkeley Spring 2021
Model serving libraries
44
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
45
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling
• Deployment

• Managed options
46
Full Stack Deep Learning - UC Berkeley Spring 2021
Horizontal scaling
• What?
• If you have too much traffic for a single machine, split traffic among multiple
machines

• How?
• Spin up multiple copies of your service and split traffic using a load balancer

• In practice, two common methods

• Container orchestration (i.e., Kubernetes)

• Serverless (e.g., AWS Lambda)
47
Full Stack Deep Learning - UC Berkeley Spring 2021
Container orchestration
48
Full Stack Deep Learning - UC Berkeley Spring 2021
Frameworks for ML deployment on kubernetes
49
Full Stack Deep Learning - UC Berkeley Spring 2021
Deploying code as serverless functions
• App code and dependencies are packaged into .zip files or docker
containers with a single entry point function

• AWS Lambda (or Google Cloud Functions, or Azure Functions) manages
everything else: instant scaling to 10,000+ requests per second, load
balancing, etc.

• Only pay for compute-time.
50
Testing & Deployment - Web Deployment
Full Stack Deep Learning - UC Berkeley Spring 2021
51
Testing & Deployment - Web Deployment
Full Stack Deep Learning - UC Berkeley Spring 2021
Deploying code as serverless functions
• Cons:
• Limited size of deployment package

• CPU-only, limited execution time

• Can be challenging to build pipelines of models

• Little to no state management (e.g., for caching)

• Limited deployment tooling
52
Testing & Deployment - Web Deployment
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
53
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment
• Managed options
54
Full Stack Deep Learning - UC Berkeley Spring 2021
Model deployment
• What?
• If serving is how you turn a model into something that can respond to
requests, deployment is how you roll out, manage, and update these
services

• How?
• You probably want to be able to roll out gradually, roll back instantly,
and deploy pipelines of models

• Your deployment library may take care of this for you
55
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs

• Dependency management

• Performance optimization

• Horizontal scaling

• Deployment

• Managed options
56
Full Stack Deep Learning - UC Berkeley Spring 2021
Managed options
57
Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: takeaways
• If you are doing CPU inference, can get away with scaling by launching
more servers, or going serverless.

• Serverless makes sense if you can get away with CPUs and traffic is spiky
or low-volume.

• If using GPU inference, serving tools like TF serving, Triton, and torch
serve will save you time

• Worth keeping an eye on the startups in this space for GPU inference
58
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
59
Full Stack Deep Learning - UC Berkeley Spring 2021
Edge prediction
60
Client Server Database
Local Remote
Request
Response
Model
Full Stack Deep Learning - UC Berkeley Spring 2021
Edge prediction
• Send model weights to the client device

• Client loads the model and interacts with it directly
61
Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
• Low-latency

• Does not require an internet
connection

• Data security — data doesn’t
need to leave the user’s device
62
• Often limited hardware resources
available

• Embedded and mobile
frameworks are less full featured
than tensorflow / pytorch

• Difficult to update models

• Difficult to monitor and debug
when things go wrong
Pros Cons
Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
63
TensorRT: Model optimizer and inference runtime for tensor flow + NVIDIA devices
Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
64
Apache TVM: Library-agnostic and target-device agnostic inference runtime
Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
65
TFLite: tensor flow on mobile / edge devices
Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
66
PyTorch mobile: PyTorch on iOS and Android
Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
67
TensorFlow.js: tensorflow in the browser
Full Stack Deep Learning - UC Berkeley Spring 2021
68
https://heartbeat.fritz.ai/core-ml-vs-ml-kit-which-mobile-machine-learning-framework-is-right-for-you-e25c5d34c765
CoreML
- Released at Apple WWDC 2017

- Inference only

- https://coreml.store/
- Supposed to work with both
CoreML and MLKit
- Announced at Google I/O 2018

- Either via API or on-device

- Offers pre-trained models, or
can upload Tensorflow Lite
model
Testing & Deployment - Hardware/Mobile
Full Stack Deep Learning - UC Berkeley Spring 2021
More efficient models
• Quantization and distillation from above

• Mobile-friendly model architectures
69
Full Stack Deep Learning - UC Berkeley Spring 2021
70
Standard
1x1
Inception
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
MobileNet
Testing & Deployment - Hardware/Mobile
MobileNets
Full Stack Deep Learning - UC Berkeley Spring 2021
71
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile
Full Stack Deep Learning - UC Berkeley Spring 2021
Recommended case study: DistillBERT
72
https://medium.com/huggingface/distilbert-8cf3380435b5
Full Stack Deep Learning - UC Berkeley Spring 2021
Mindsets for edge deployment
• Choose your architecture with your target hardware in mind

• You can make up a factor of 2-10 through distillation, quantization, and other
tricks, but not more than that

• Once you have a model that works on your edge device, you can iterate locally
as long as you add model size and latency to your metrics and avoid
regressions

• Treat tuning the model for your device as an additional risk in the deployment
cycle and test it accordingly

• E.g., always test your models on production hardware before deploying

• Since models can be finicky, it’s a good idea to build fallback mechanisms into
the application in case the model fails or is too slow
73
Full Stack Deep Learning - UC Berkeley Spring 2021
Edge deployment: conclusion
• Web deployment is easier, so use it if you need to

• Choose your framework to match the available hardware and
corresponding mobile frameworks, or try TVM to be more flexible

• Start considering hardware constraints at the beginning of the project and
choose architectures accordingly
74
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
75
Full Stack Deep Learning - UC Berkeley Spring 2021
Today
• Deployment

• Monitoring
76
Full Stack Deep Learning - UC Berkeley Spring 2021
Train, test, deploy. You’re done, right?
• Validation loss is below your target performance

• Test loss is not much worse than validation

• Your model performs well across all critical slices and
metrics

• Qualitatively the predictions make sense

• You verified that the prod model has the same performance
characteristics as the dev model

• You verified that the prod model is indeed better than the
previous one
77
Training & troubleshooting
lecture
Testing lecture
Full Stack Deep Learning - UC Berkeley Spring 2021
Train, test, deploy. You’re done, right?
• Validation loss is below your target performance

• Test loss is not much worse than validation

• Your model performs well across all critical slices and
metrics

• Qualitatively the predictions make sense

• You verified that the prod model has the same
performance characteristics as the dev model

• You verified that the prod model is indeed better than the
previous one
78
Training & troubleshooting
lecture
Testing lecture
What else could go wrong?
Full Stack Deep Learning - UC Berkeley Spring 2021
Model performance degrades post-deployment
79
Supervised ML: approximate by learning a parameterized model
on data sampled from
p(y ∣ x) fθ
p(x, y)
What can go wrong? Names Examples
p(x) changes • Data drift
• Bug in upstream data pipeline

• Malicious users

• Launch in a new region

• Add new users with different demographics
p(y | x) changes • Model drift

• Concept drift
• Users behavior changes (in response to your
model!)
Sample doesn’t adequately
approximate p(x, y)
• The long tail

• Domain shift
• Tasks where outliers matter

• Bug in training data pipeline

• Bias in the sampling process
Full Stack Deep Learning - UC Berkeley Spring 2021
Types of data drift
80
Time
Value
Full Stack Deep Learning - UC Berkeley Spring 2021
Instantaneous drift (i.e., shift)
Examples
• Model deployed in a new domain

• A bug is introduced in the preprocessing pipeline

• Big external shift, like Covid
81
Time
Value
Full Stack Deep Learning - UC Berkeley Spring 2021
Gradual drift
Examples
• Users preferences change over time

• New concepts get introduced to the corpus over time
82
Time
Value
Full Stack Deep Learning - UC Berkeley Spring 2021
Periodic drift
Examples
• Users preferences are seasonal

• People in different timezones use your model differently
83
Time
Value
Full Stack Deep Learning - UC Berkeley Spring 2021
Temporary drift
Examples
• Malicious user attacks your model

• A new user with different demographics tries your product and then churns

• Someone uses your product in a way it was not intended to be used
84
Time
Value
Full Stack Deep Learning - UC Berkeley Spring 2021
This can have a huge impact in practice
85
Story from global e-commerce company:
• Bug in retraining pipeline caused recommendations
not to be updated for new users
• New users got the same stale recommendations
for weeks before the bug was caught
• Estimated $Ms in revenue lost
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
86
Full Stack Deep Learning - UC Berkeley Spring 2021
Strategies for monitoring ML models
87
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?

• How do I measure if it’s changed?

• How do I tell if that change is bad?

• Tools for monitoring

• Monitoring and your broader ML system
88
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?

• How do I tell if that change is bad?

• Tools for monitoring

• Monitoring and your broader ML system
89
Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
90
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
91
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
Ground truth performance metrics
Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
92
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
Approximate performance metrics
Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
93
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
System health metrics
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
94
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?

• How do I measure if it’s changed?
• How do I tell if that change is bad?

• Tools for monitoring

• Monitoring and your broader ML system
95
Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
96
Time
Value
Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
97
Time
Value
1. Select a window of “good” data to serve as a reference
Full Stack Deep Learning - UC Berkeley Spring 2021
Selecting a reference window
• You can use a fixed window of production data you believe to be healthy

• Some papers [1] advocate for using a sliding window of production data

• In practice, most of the time you probably should use your training or eval
data as the reference
98
[1] Automatic Model Monitoring for Data Streams, Pinto et al. (https://arxiv.org/abs/1908.04240)
Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
99
Time
Value
2. Select a new window of data to measure distance on
Full Stack Deep Learning - UC Berkeley Spring 2021
Selecting the measurement window
• Problem-dependent — not aware of a principled approach

• Pragmatic solution: pick one or several window sizes with a reasonable
amount of data and slide them [1]

• Special case: window size = 1, i.e., outlier detection [2]
100
[1] To avoid recomputing metrics over and over again when you slide the window, it’s worth looking into the literature on mergeable
(quantile) sketching algorithms (https://www.sketchingbigdata.org/)
[2] E.g., check out https://pyod.readthedocs.io/en/latest/
Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
101
Time
Value
3. Compare the windows using a distance metric
Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
102
Time
Value
3. Compare the windows using a distance metric
Full Stack Deep Learning - UC Berkeley Spring 2021
Distance metrics (1D data)
• Rule-based distance metrics (aka, data quality)

• Statistical distance metrics

• KL divergence

• Kolmogorov-Smirnov statistic

• D_1 distance

• Others
103
Full Stack Deep Learning - UC Berkeley Spring 2021
Rule-based distance metrics (i.e., data quality)
• Min, max, mean within
acceptable range

• Enough data points

• Not too many missing values

• Col A > Col B

• Etc
104
Do this!
Full Stack Deep Learning - UC Berkeley Spring 2021
Kullback-Leibler divergence
• 

• Sensitive to what happens on
the tails of the distribution

• Not interpretable

• What if the P and Q don’t
have exactly the same range?
DKL(P|Q) = 𝔼x∼P log (P(x)/Q(x))
105
Friends don’t let
friends monitor the KL
Full Stack Deep Learning - UC Berkeley Spring 2021
Kolmogorov-Smirnov Statistic
•


• Max distance between CDFs

• Easy to interpret

• Used widely in practice
K = sup
t∈(0,1)
B(t)
106
K-S = “Oh Yes”
Full Stack Deep Learning - UC Berkeley Spring 2021
D1 distance
•


• Sum of distances between
CDFs

• Used at Google [1]

• Isn’t this a bit hacky?
d1(p, q) =
n
∑
i=1
pi − qi
107
If google does it….
[1] Data Validation for Machine Learning (Breck et al) https://mlsys.org/Conferences/2019/doc/2019/167.pdf
Full Stack Deep Learning - UC Berkeley Spring 2021
Other distance metrics?
• Earth mover distance

• Population stability index

• Etc
108
Worthy of some
research.
Request for research: how do drifts according to different distance metrics affect
model performance differently?
Full Stack Deep Learning - UC Berkeley Spring 2021
Distance metrics (>1D data)
• Open problem

• Maximum Mean Discrepancy 

• Do a bunch of 1D comparisons

• Doesn’t capture cross-correlation

• Multiple hypothesis testing [1]

• Prioritize some features for 1D comparisons

• Feature importance?

• Projections
MMD(ℱ, p, q) = ||μp − μq ||2
ℱ
109
[1] Look into corrections like the Bonferroni correction (https://en.wikipedia.org/wiki/Bonferroni_correction)
Full Stack Deep Learning - UC Berkeley Spring 2021
Detecting high-dimensional data drift using projections
110
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift

• Analytical projections, e.g.,

• Mean pixel value

• Length of sentence

• (feature_1 - feature_2) / feature_3

• Random projections (e.g., linear)

• Statistical projections, e.g.,

• Autoencoder or other density model

• PCA or T-SNE
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
111
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?

• How do I measure if it’s changed?

• How do I tell if that change is bad?
• Tools for monitoring

• Monitoring and your broader ML system
112
Full Stack Deep Learning - UC Berkeley Spring 2021
Setting thresholds on test values
• Statistical tests (e.g., KS-Test)

• Typically, return a p-value for likelihood that the data distributions are not the same
• When you have a lot of data, you will get very small p-values for small shifts
113
https://blog.anomalo.com/dynamic-data-testing-f831435dba90
Used most in practice
Full Stack Deep Learning - UC Berkeley Spring 2021
Request for research
• Approximating model performance changes given different distribution
changes
114
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
115
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?

• How do I measure if it’s changed?

• How do I tell if that change is bad?

• Tools for monitoring
• Monitoring and your broader ML system
116
Full Stack Deep Learning - UC Berkeley Spring 2021
System monitoring tools
117
• Alarms for when things go
wrong, and records for tuning
things.

• Cloud providers have decent
monitoring solutions.

• Anything that can be logged
can be monitored (i.e. data
skew)

• Will explore in lab
Full Stack Deep Learning - UC Berkeley Spring 2021
System monitoring tools
118
Full Stack Deep Learning - UC Berkeley Spring 2021
Data quality tools
119
Full Stack Deep Learning - UC Berkeley Spring 2021
ML monitoring
120
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
121
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?

• How do I measure if it’s changed?

• How do I tell if that change is bad?

• Tools for monitoring

• Monitoring and your broader ML system
122
Full Stack Deep Learning - UC Berkeley Spring 2021
Monitoring is more central to ML than to traditional software
In traditional software…
• Bugs are usually obvious
and/or cause loud failures

• The data you monitor
(system metrics, etc) is
mostly valuable to detect &
diagnose problems
123
… in ML
• Bugs usually cause silent
depredations in performance

• The data you monitor
(system metrics, etc) is
literally the code used to
produce the next model
Full Stack Deep Learning - UC Berkeley Spring 2021
A traditional view of monitoring
124
Evaluation Production
Training Monitoring
Full Stack Deep Learning - UC Berkeley Spring 2021
The reality of monitoring in ML
125
Evaluation Production
Training
Monitoring
Full Stack Deep Learning - UC Berkeley Spring 2021
The reality of monitoring in ML
126
Evaluation Production
Training
Eval Store
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
127
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Eval Store
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
128
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Train
Eval Store
• Register data distribution and
performance for this model
• Warn us if training data looks
too different than prod
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
129
Collect data
Clean and
label
Train Report
Select problem
Deploy
Monitor
Train
Test
Eval Store
• Register performance for this
model on all test slices
• Pull historical that has been
flagged as “interesting” (e.g.,
gave another model trouble)
• Pull definitions of slices
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
130
Test
Collect data
Clean and
label
Train Report
Select problem
Monitor
Train
Deploy
Eval Store
• Run a shadow test or AB test by
pulling the diff in model
performance between versions
• Log data and approximate
performance back to the eval
store
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
131
Deploy Test
Collect data
Clean and
label
Train Report
Select problem Train
Monitor
Eval Store
• Fire an alert when approximate
performance on any of our
slices dips below a threshold
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
132
Monitor Deploy Test
Clean and
label
Train Report
Select problem Train
Collect data
Eval Store
• Log more data with low or
uncertain approximate
performance
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
133
Collect data
Monitor Deploy Test
Train Report
Select problem Train
Clean and
label
Eval Store
• Inspect & label data with low
approximate performance
Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
134
Clean and
label
Collect data
Monitor Deploy Test
Train Report
Select problem Train
Eval Store
• Retrain when approximate
performance dips below a
threshold
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
135
Full Stack Deep Learning - UC Berkeley Spring 2021
Monitoring ML models: takeaways
• You should be monitoring your models

• Start by looking at data quality metrics and system metrics because it’s
easiest

• In a perfect world, your testing and monitoring systems should be linked,
and should help you close the data flywheel 

• Lots of activity in tooling — watch this space!

• Big open research questions still to be answered
136
Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
137

More Related Content

What's hot

PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
Jinwon Lee
 
Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -
alopatenko
 

What's hot (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian ClassifiersMachine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
Machine Learning and Data Mining: 13 Nearest Neighbor and Bayesian Classifiers
 
Xgboost
XgboostXgboost
Xgboost
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Designing more efficient convolution neural network
Designing more efficient convolution neural networkDesigning more efficient convolution neural network
Designing more efficient convolution neural network
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Object Detection with Transformers
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
 
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and ConfidenceFixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Skip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architecturesSkip, residual and densely connected RNN architectures
Skip, residual and densely connected RNN architectures
 
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for VisionPR-317: MLP-Mixer: An all-MLP Architecture for Vision
PR-317: MLP-Mixer: An all-MLP Architecture for Vision
 
Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -Evaluating LLM Models for Production Systems Methods and Practices -
Evaluating LLM Models for Production Systems Methods and Practices -
 
Convolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNetConvolutional neural network from VGG to DenseNet
Convolutional neural network from VGG to DenseNet
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 

Similar to Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)

Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
confluent
 

Similar to Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021) (20)

Democratizing machine learning on kubernetes
Democratizing machine learning on kubernetesDemocratizing machine learning on kubernetes
Democratizing machine learning on kubernetes
 
Nordic infrastructure Conference 2017 - SQL Server in DevOps
Nordic infrastructure Conference 2017 - SQL Server in DevOpsNordic infrastructure Conference 2017 - SQL Server in DevOps
Nordic infrastructure Conference 2017 - SQL Server in DevOps
 
Continuous Integration & Continuous Delivery
Continuous Integration & Continuous DeliveryContinuous Integration & Continuous Delivery
Continuous Integration & Continuous Delivery
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Webinar kubernetes and-spark
Webinar  kubernetes and-sparkWebinar  kubernetes and-spark
Webinar kubernetes and-spark
 
Building Efficient Parallel Testing Platforms with Docker
Building Efficient Parallel Testing Platforms with DockerBuilding Efficient Parallel Testing Platforms with Docker
Building Efficient Parallel Testing Platforms with Docker
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Building High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in KafkaBuilding High-Throughput, Low-Latency Pipelines in Kafka
Building High-Throughput, Low-Latency Pipelines in Kafka
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Innovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXCInnovating faster with SBT, Continuous Delivery, and LXC
Innovating faster with SBT, Continuous Delivery, and LXC
 
DevOps in Silos
DevOps in SilosDevOps in Silos
DevOps in Silos
 
Deploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with KubernetesDeploy your machine learning models to production with Kubernetes
Deploy your machine learning models to production with Kubernetes
 
Containerized architectures for deep learning
Containerized architectures for deep learningContainerized architectures for deep learning
Containerized architectures for deep learning
 
Continuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfestContinuous Deployment of your Application @JUGtoberfest
Continuous Deployment of your Application @JUGtoberfest
 
Improving WordPress Development and Deployments with Docker
Improving WordPress Development and Deployments with DockerImproving WordPress Development and Deployments with Docker
Improving WordPress Development and Deployments with Docker
 
Parallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC SystemsParallel Distributed Deep Learning on HPCC Systems
Parallel Distributed Deep Learning on HPCC Systems
 
Spring Framework 3.2 - What's New
Spring Framework 3.2 - What's NewSpring Framework 3.2 - What's New
Spring Framework 3.2 - What's New
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
Docker for the enterprise
Docker for the enterpriseDocker for the enterprise
Docker for the enterprise
 

More from Sergey Karayev

Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
Sergey Karayev
 

More from Sergey Karayev (17)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep Learning
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)

  • 1. Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Deployment & Monitoring
  • 2. Full Stack Deep Learning - UC Berkeley Spring 2021 2
  • 3. Full Stack Deep Learning - UC Berkeley Spring 2021 3
  • 4. Full Stack Deep Learning - UC Berkeley Spring 2021 Today • Deployment (AKA, how to get your model into production) • Monitoring (AKA how to keep it healthy once it’s there) 4
  • 5. Full Stack Deep Learning - UC Berkeley Spring 2021 Today • Deployment • Monitoring 5
  • 6. Full Stack Deep Learning - UC Berkeley Spring 2021 Where in the architecture should your model go? 6 Client Server Database Local Remote Request Response
  • 7. Full Stack Deep Learning - UC Berkeley Spring 2021 Batch prediction 7 Client Server Database Local Remote Request Response Model
  • 8. Full Stack Deep Learning - UC Berkeley Spring 2021 Batch prediction • Periodically run your model on new data and cache the results in a database • Works if the universe of inputs is relatively small (e.g., 1 prediction per user) 8
  • 9. Full Stack Deep Learning - UC Berkeley Spring 2021 Batch prediction • Simple to implement • Relatively low latency to the user 9 • Doesn’t scale to complex input types • Users don’t get the most up-to- date predictions • Models frequently become “stale”, which can be hard to detect Pros Cons
  • 10. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 10
  • 11. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-in-service 11 Client Server Database Local Remote Request Response Model
  • 12. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-in-service • Package up your model and include it in your deployed web server • Web server loads the model and calls it to make predictions 12
  • 13. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-in-service • Re-uses your existing infrastructure 13 • Web server may be written in a different language • Models may change more frequently than server code • Large models can eat into the resources for your web server • Server hardware not optimized for your model (e.g., no GPUs) • Model & server may scale differently Pros Cons
  • 14. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 14
  • 15. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-as-service 15 Client Server Database Local Remote Request Response Model
  • 16. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-as-service • Run your model on its own web server • The backend (or the client itself) interact with the model by making requests to the model service and receiving responses back 16
  • 17. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-as-service • Dependability — model bugs less likely to crash the web app • Scalability — choose optimal hardware for the model and scale it appropriately • Flexibility — easily reuse a model across multiple apps 17 • Can add latency • Adds infrastructural complexity • Now you have to run a model service… Pros Cons
  • 18. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 18
  • 19. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 19
  • 20. Full Stack Deep Learning - UC Berkeley Spring 2021 REST APIs • Serving predictions in response to canonically-formatted HTTP requests • There are alternatives like GRPC (which is actually used in tensorflow serving) and GraphQL (not terribly relevant to model services) 20
  • 21. Full Stack Deep Learning - UC Berkeley Spring 2021 REST API example 21
  • 22. Full Stack Deep Learning - UC Berkeley Spring 2021 Formats for requests and responses • Sadly, no standard yet 22 Google Cloud Azure AWS Sagemaker
  • 23. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 23
  • 24. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 24
  • 25. Full Stack Deep Learning - UC Berkeley Spring 2021 Dependency management for model servers • Model predictions depend on code, model weights, and dependencies. All need to be present on your web server • Dependencies cause trouble. Hard to make consistent, hard to update, etc. Even changing a tensorflow version can change your model • Two strategies: • Constrain the dependencies for your model • Use containers 25
  • 26. Full Stack Deep Learning - UC Berkeley Spring 2021 A standard neural net format: ONNX • The promise: define network in any language, run it consistently anywhere • The reality: since the libraries change quickly, there are often bugs in the translation layer • What about non-library code like feature transformations? 26
  • 27. Full Stack Deep Learning - UC Berkeley Spring 2021 Managing dependencies with containers (i.e., Docker) • Docker vs VM • Dockerfile and layer-based images • DockerHub and the ecosystem • Kubernetes and other orchestrators 27 Testing & Deployment - Docker
  • 28. Full Stack Deep Learning - UC Berkeley Spring 2021 28 https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b No OS -> Light weight Testing & Deployment - Docker
  • 29. Full Stack Deep Learning - UC Berkeley Spring 2021 • Spin up a container for every discrete task • For example, a web app might have four containers: • Web server • Database • Job queue • Worker 29 https://www.docker.com/what-container Light weight -> Heavy use Testing & Deployment - Docker
  • 30. Full Stack Deep Learning - UC Berkeley Spring 2021 Dockerfile 30 Testing & Deployment - Docker
  • 31. Full Stack Deep Learning - UC Berkeley Spring 2021 Strong Ecosystem • Images are easy to find, modify, and contribute back to DockerHub • Private images easy to store in same place 31 https://docs.docker.com/engine/docker-overview Testing & Deployment - Docker
  • 32. Full Stack Deep Learning - UC Berkeley Spring 2021 32 https://www.docker.com/what-container#/package_software Testing & Deployment - Docker
  • 33. Full Stack Deep Learning - UC Berkeley Spring 2021 Container Orchestration • Distributing containers onto underlying VMs or bare metal • Kubernetes is the open-source winner, but cloud providers have good offerings, too. 33 Testing & Deployment - Docker
  • 34. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 34
  • 35. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 35
  • 36. Full Stack Deep Learning - UC Berkeley Spring 2021 Making inference on a single machine more efficient • GPU or no GPU? • Concurrency • Model distillation • Quantization • Caching • Batching • Sharing the GPU • Libraries 36
  • 37. Full Stack Deep Learning - UC Berkeley Spring 2021 GPU or no GPU? • GPU pros • Same hardware you trained on probably • In the limit of model size, batch size tuning, etc usually higher throughput • GPU cons • More complex to set up • Often more expensive 37
  • 38. Full Stack Deep Learning - UC Berkeley Spring 2021 Concurrency • What? • Multiple copies of the model running on different CPUs or cores • How? • Be careful about thread tuning 38 https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus/
  • 39. Full Stack Deep Learning - UC Berkeley Spring 2021 Model distillation • What? • Train a smaller model to imitate your larger one • How? • Several techniques outlined below • Can be finicky to do yourself — infrequently used in practice • Exception — pretrained distilled models like DistilBERT 39 https://heartbeat.fritz.ai/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb
  • 40. Full Stack Deep Learning - UC Berkeley Spring 2021 Quantization • What? • Execute some or all of the operations in your model with a smaller numerical representation than floats (e.g., INT8) • Some tradeoffs with accuracy • How? • PyTorch and Tensorflow Lite have quantization built-in • Can also run quantization-aware training, which often results in higher accuracy 40 https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
  • 41. Full Stack Deep Learning - UC Berkeley Spring 2021 Caching • What? • For some ML models, some inputs are more common than others • Instead of always calling the model, first check the cache • How? • Can get very fancy • Basic way uses functools 41
  • 42. Full Stack Deep Learning - UC Berkeley Spring 2021 Batching • What? • ML models often achieve higher throughput when doing prediction in parallel, especially in a GPU • How? • Gather predictions until you have a batch, run prediction, return to user • Batch size needs to be tuned • You need to have a way to shortcut the process if latency becomes too long • Probably don’t want to implement this yourself 42
  • 43. Full Stack Deep Learning - UC Berkeley Spring 2021 Sharing the GPU • What? • Your model may not take up all of the GPU memory with your inference batch size. Why not run multiple models on the same GPU? • How? • You’ll probably want to use a model serving solution that supports this out of the box 43
  • 44. Full Stack Deep Learning - UC Berkeley Spring 2021 Model serving libraries 44
  • 45. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 45
  • 46. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 46
  • 47. Full Stack Deep Learning - UC Berkeley Spring 2021 Horizontal scaling • What? • If you have too much traffic for a single machine, split traffic among multiple machines • How? • Spin up multiple copies of your service and split traffic using a load balancer • In practice, two common methods • Container orchestration (i.e., Kubernetes) • Serverless (e.g., AWS Lambda) 47
  • 48. Full Stack Deep Learning - UC Berkeley Spring 2021 Container orchestration 48
  • 49. Full Stack Deep Learning - UC Berkeley Spring 2021 Frameworks for ML deployment on kubernetes 49
  • 50. Full Stack Deep Learning - UC Berkeley Spring 2021 Deploying code as serverless functions • App code and dependencies are packaged into .zip files or docker containers with a single entry point function • AWS Lambda (or Google Cloud Functions, or Azure Functions) manages everything else: instant scaling to 10,000+ requests per second, load balancing, etc. • Only pay for compute-time. 50 Testing & Deployment - Web Deployment
  • 51. Full Stack Deep Learning - UC Berkeley Spring 2021 51 Testing & Deployment - Web Deployment
  • 52. Full Stack Deep Learning - UC Berkeley Spring 2021 Deploying code as serverless functions • Cons: • Limited size of deployment package • CPU-only, limited execution time • Can be challenging to build pipelines of models • Little to no state management (e.g., for caching) • Limited deployment tooling 52 Testing & Deployment - Web Deployment
  • 53. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 53
  • 54. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 54
  • 55. Full Stack Deep Learning - UC Berkeley Spring 2021 Model deployment • What? • If serving is how you turn a model into something that can respond to requests, deployment is how you roll out, manage, and update these services • How? • You probably want to be able to roll out gradually, roll back instantly, and deploy pipelines of models • Your deployment library may take care of this for you 55
  • 56. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: the basics • REST APIs • Dependency management • Performance optimization • Horizontal scaling • Deployment • Managed options 56
  • 57. Full Stack Deep Learning - UC Berkeley Spring 2021 Managed options 57
  • 58. Full Stack Deep Learning - UC Berkeley Spring 2021 Building a model service: takeaways • If you are doing CPU inference, can get away with scaling by launching more servers, or going serverless. • Serverless makes sense if you can get away with CPUs and traffic is spiky or low-volume. • If using GPU inference, serving tools like TF serving, Triton, and torch serve will save you time • Worth keeping an eye on the startups in this space for GPU inference 58
  • 59. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 59
  • 60. Full Stack Deep Learning - UC Berkeley Spring 2021 Edge prediction 60 Client Server Database Local Remote Request Response Model
  • 61. Full Stack Deep Learning - UC Berkeley Spring 2021 Edge prediction • Send model weights to the client device • Client loads the model and interacts with it directly 61
  • 62. Full Stack Deep Learning - UC Berkeley Spring 2021 Model-as-service • Low-latency • Does not require an internet connection • Data security — data doesn’t need to leave the user’s device 62 • Often limited hardware resources available • Embedded and mobile frameworks are less full featured than tensorflow / pytorch • Difficult to update models • Difficult to monitor and debug when things go wrong Pros Cons
  • 63. Full Stack Deep Learning - UC Berkeley Spring 2021 Tools for edge deployment 63 TensorRT: Model optimizer and inference runtime for tensor flow + NVIDIA devices
  • 64. Full Stack Deep Learning - UC Berkeley Spring 2021 Tools for edge deployment 64 Apache TVM: Library-agnostic and target-device agnostic inference runtime
  • 65. Full Stack Deep Learning - UC Berkeley Spring 2021 Tools for edge deployment 65 TFLite: tensor flow on mobile / edge devices
  • 66. Full Stack Deep Learning - UC Berkeley Spring 2021 Tools for edge deployment 66 PyTorch mobile: PyTorch on iOS and Android
  • 67. Full Stack Deep Learning - UC Berkeley Spring 2021 Tools for edge deployment 67 TensorFlow.js: tensorflow in the browser
  • 68. Full Stack Deep Learning - UC Berkeley Spring 2021 68 https://heartbeat.fritz.ai/core-ml-vs-ml-kit-which-mobile-machine-learning-framework-is-right-for-you-e25c5d34c765 CoreML - Released at Apple WWDC 2017 - Inference only - https://coreml.store/ - Supposed to work with both CoreML and MLKit - Announced at Google I/O 2018 - Either via API or on-device - Offers pre-trained models, or can upload Tensorflow Lite model Testing & Deployment - Hardware/Mobile
  • 69. Full Stack Deep Learning - UC Berkeley Spring 2021 More efficient models • Quantization and distillation from above • Mobile-friendly model architectures 69
  • 70. Full Stack Deep Learning - UC Berkeley Spring 2021 70 Standard 1x1 Inception https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d MobileNet Testing & Deployment - Hardware/Mobile MobileNets
  • 71. Full Stack Deep Learning - UC Berkeley Spring 2021 71 https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d Testing & Deployment - Hardware/Mobile
  • 72. Full Stack Deep Learning - UC Berkeley Spring 2021 Recommended case study: DistillBERT 72 https://medium.com/huggingface/distilbert-8cf3380435b5
  • 73. Full Stack Deep Learning - UC Berkeley Spring 2021 Mindsets for edge deployment • Choose your architecture with your target hardware in mind • You can make up a factor of 2-10 through distillation, quantization, and other tricks, but not more than that • Once you have a model that works on your edge device, you can iterate locally as long as you add model size and latency to your metrics and avoid regressions • Treat tuning the model for your device as an additional risk in the deployment cycle and test it accordingly • E.g., always test your models on production hardware before deploying • Since models can be finicky, it’s a good idea to build fallback mechanisms into the application in case the model fails or is too slow 73
  • 74. Full Stack Deep Learning - UC Berkeley Spring 2021 Edge deployment: conclusion • Web deployment is easier, so use it if you need to • Choose your framework to match the available hardware and corresponding mobile frameworks, or try TVM to be more flexible • Start considering hardware constraints at the beginning of the project and choose architectures accordingly 74
  • 75. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 75
  • 76. Full Stack Deep Learning - UC Berkeley Spring 2021 Today • Deployment • Monitoring 76
  • 77. Full Stack Deep Learning - UC Berkeley Spring 2021 Train, test, deploy. You’re done, right? • Validation loss is below your target performance • Test loss is not much worse than validation • Your model performs well across all critical slices and metrics • Qualitatively the predictions make sense • You verified that the prod model has the same performance characteristics as the dev model • You verified that the prod model is indeed better than the previous one 77 Training & troubleshooting lecture Testing lecture
  • 78. Full Stack Deep Learning - UC Berkeley Spring 2021 Train, test, deploy. You’re done, right? • Validation loss is below your target performance • Test loss is not much worse than validation • Your model performs well across all critical slices and metrics • Qualitatively the predictions make sense • You verified that the prod model has the same performance characteristics as the dev model • You verified that the prod model is indeed better than the previous one 78 Training & troubleshooting lecture Testing lecture What else could go wrong?
  • 79. Full Stack Deep Learning - UC Berkeley Spring 2021 Model performance degrades post-deployment 79 Supervised ML: approximate by learning a parameterized model on data sampled from p(y ∣ x) fθ p(x, y) What can go wrong? Names Examples p(x) changes • Data drift • Bug in upstream data pipeline • Malicious users • Launch in a new region • Add new users with different demographics p(y | x) changes • Model drift • Concept drift • Users behavior changes (in response to your model!) Sample doesn’t adequately approximate p(x, y) • The long tail • Domain shift • Tasks where outliers matter • Bug in training data pipeline • Bias in the sampling process
  • 80. Full Stack Deep Learning - UC Berkeley Spring 2021 Types of data drift 80 Time Value
  • 81. Full Stack Deep Learning - UC Berkeley Spring 2021 Instantaneous drift (i.e., shift) Examples • Model deployed in a new domain • A bug is introduced in the preprocessing pipeline • Big external shift, like Covid 81 Time Value
  • 82. Full Stack Deep Learning - UC Berkeley Spring 2021 Gradual drift Examples • Users preferences change over time • New concepts get introduced to the corpus over time 82 Time Value
  • 83. Full Stack Deep Learning - UC Berkeley Spring 2021 Periodic drift Examples • Users preferences are seasonal • People in different timezones use your model differently 83 Time Value
  • 84. Full Stack Deep Learning - UC Berkeley Spring 2021 Temporary drift Examples • Malicious user attacks your model • A new user with different demographics tries your product and then churns • Someone uses your product in a way it was not intended to be used 84 Time Value
  • 85. Full Stack Deep Learning - UC Berkeley Spring 2021 This can have a huge impact in practice 85 Story from global e-commerce company: • Bug in retraining pipeline caused recommendations not to be updated for new users • New users got the same stale recommendations for weeks before the bug was caught • Estimated $Ms in revenue lost
  • 86. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 86
  • 87. Full Stack Deep Learning - UC Berkeley Spring 2021 Strategies for monitoring ML models 87
  • 88. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • What should I monitor? • How do I measure if it’s changed? • How do I tell if that change is bad? • Tools for monitoring • Monitoring and your broader ML system 88
  • 89. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • What should I monitor? • How do I measure if it’s changed? • How do I tell if that change is bad? • Tools for monitoring • Monitoring and your broader ML system 89
  • 90. Full Stack Deep Learning - UC Berkeley Spring 2021 Four types of signals to monitor 90 Model metrics (e.g., accuracy) Model inputs and predictions System performance More informative Less informative Harder to measure Easier to measure Business metrics
  • 91. Full Stack Deep Learning - UC Berkeley Spring 2021 Four types of signals to monitor 91 Model metrics (e.g., accuracy) Model inputs and predictions System performance More informative Less informative Harder to measure Easier to measure Business metrics Ground truth performance metrics
  • 92. Full Stack Deep Learning - UC Berkeley Spring 2021 Four types of signals to monitor 92 Model metrics (e.g., accuracy) Model inputs and predictions System performance More informative Less informative Harder to measure Easier to measure Business metrics Approximate performance metrics
  • 93. Full Stack Deep Learning - UC Berkeley Spring 2021 Four types of signals to monitor 93 Model metrics (e.g., accuracy) Model inputs and predictions System performance More informative Less informative Harder to measure Easier to measure Business metrics System health metrics
  • 94. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 94
  • 95. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • What should I monitor? • How do I measure if it’s changed? • How do I tell if that change is bad? • Tools for monitoring • Monitoring and your broader ML system 95
  • 96. Full Stack Deep Learning - UC Berkeley Spring 2021 Measuring distribution change 96 Time Value
  • 97. Full Stack Deep Learning - UC Berkeley Spring 2021 Measuring distribution change 97 Time Value 1. Select a window of “good” data to serve as a reference
  • 98. Full Stack Deep Learning - UC Berkeley Spring 2021 Selecting a reference window • You can use a fixed window of production data you believe to be healthy • Some papers [1] advocate for using a sliding window of production data • In practice, most of the time you probably should use your training or eval data as the reference 98 [1] Automatic Model Monitoring for Data Streams, Pinto et al. (https://arxiv.org/abs/1908.04240)
  • 99. Full Stack Deep Learning - UC Berkeley Spring 2021 Measuring distribution change 99 Time Value 2. Select a new window of data to measure distance on
  • 100. Full Stack Deep Learning - UC Berkeley Spring 2021 Selecting the measurement window • Problem-dependent — not aware of a principled approach • Pragmatic solution: pick one or several window sizes with a reasonable amount of data and slide them [1] • Special case: window size = 1, i.e., outlier detection [2] 100 [1] To avoid recomputing metrics over and over again when you slide the window, it’s worth looking into the literature on mergeable (quantile) sketching algorithms (https://www.sketchingbigdata.org/) [2] E.g., check out https://pyod.readthedocs.io/en/latest/
  • 101. Full Stack Deep Learning - UC Berkeley Spring 2021 Measuring distribution change 101 Time Value 3. Compare the windows using a distance metric
  • 102. Full Stack Deep Learning - UC Berkeley Spring 2021 Measuring distribution change 102 Time Value 3. Compare the windows using a distance metric
  • 103. Full Stack Deep Learning - UC Berkeley Spring 2021 Distance metrics (1D data) • Rule-based distance metrics (aka, data quality) • Statistical distance metrics • KL divergence • Kolmogorov-Smirnov statistic • D_1 distance • Others 103
  • 104. Full Stack Deep Learning - UC Berkeley Spring 2021 Rule-based distance metrics (i.e., data quality) • Min, max, mean within acceptable range • Enough data points • Not too many missing values • Col A > Col B • Etc 104 Do this!
  • 105. Full Stack Deep Learning - UC Berkeley Spring 2021 Kullback-Leibler divergence • • Sensitive to what happens on the tails of the distribution • Not interpretable • What if the P and Q don’t have exactly the same range? DKL(P|Q) = 𝔼x∼P log (P(x)/Q(x)) 105 Friends don’t let friends monitor the KL
  • 106. Full Stack Deep Learning - UC Berkeley Spring 2021 Kolmogorov-Smirnov Statistic • • Max distance between CDFs • Easy to interpret • Used widely in practice K = sup t∈(0,1) B(t) 106 K-S = “Oh Yes”
  • 107. Full Stack Deep Learning - UC Berkeley Spring 2021 D1 distance • • Sum of distances between CDFs • Used at Google [1] • Isn’t this a bit hacky? d1(p, q) = n ∑ i=1 pi − qi 107 If google does it…. [1] Data Validation for Machine Learning (Breck et al) https://mlsys.org/Conferences/2019/doc/2019/167.pdf
  • 108. Full Stack Deep Learning - UC Berkeley Spring 2021 Other distance metrics? • Earth mover distance • Population stability index • Etc 108 Worthy of some research. Request for research: how do drifts according to different distance metrics affect model performance differently?
  • 109. Full Stack Deep Learning - UC Berkeley Spring 2021 Distance metrics (>1D data) • Open problem • Maximum Mean Discrepancy • Do a bunch of 1D comparisons • Doesn’t capture cross-correlation • Multiple hypothesis testing [1] • Prioritize some features for 1D comparisons • Feature importance? • Projections MMD(ℱ, p, q) = ||μp − μq ||2 ℱ 109 [1] Look into corrections like the Bonferroni correction (https://en.wikipedia.org/wiki/Bonferroni_correction)
  • 110. Full Stack Deep Learning - UC Berkeley Spring 2021 Detecting high-dimensional data drift using projections 110 Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift • Analytical projections, e.g., • Mean pixel value • Length of sentence • (feature_1 - feature_2) / feature_3 • Random projections (e.g., linear) • Statistical projections, e.g., • Autoencoder or other density model • PCA or T-SNE
  • 111. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 111
  • 112. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • What should I monitor? • How do I measure if it’s changed? • How do I tell if that change is bad? • Tools for monitoring • Monitoring and your broader ML system 112
  • 113. Full Stack Deep Learning - UC Berkeley Spring 2021 Setting thresholds on test values • Statistical tests (e.g., KS-Test) • Typically, return a p-value for likelihood that the data distributions are not the same • When you have a lot of data, you will get very small p-values for small shifts 113 https://blog.anomalo.com/dynamic-data-testing-f831435dba90 Used most in practice
  • 114. Full Stack Deep Learning - UC Berkeley Spring 2021 Request for research • Approximating model performance changes given different distribution changes 114
  • 115. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 115
  • 116. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • What should I monitor? • How do I measure if it’s changed? • How do I tell if that change is bad? • Tools for monitoring • Monitoring and your broader ML system 116
  • 117. Full Stack Deep Learning - UC Berkeley Spring 2021 System monitoring tools 117 • Alarms for when things go wrong, and records for tuning things. • Cloud providers have decent monitoring solutions. • Anything that can be logged can be monitored (i.e. data skew) • Will explore in lab
  • 118. Full Stack Deep Learning - UC Berkeley Spring 2021 System monitoring tools 118
  • 119. Full Stack Deep Learning - UC Berkeley Spring 2021 Data quality tools 119
  • 120. Full Stack Deep Learning - UC Berkeley Spring 2021 ML monitoring 120
  • 121. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 121
  • 122. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • What should I monitor? • How do I measure if it’s changed? • How do I tell if that change is bad? • Tools for monitoring • Monitoring and your broader ML system 122
  • 123. Full Stack Deep Learning - UC Berkeley Spring 2021 Monitoring is more central to ML than to traditional software In traditional software… • Bugs are usually obvious and/or cause loud failures • The data you monitor (system metrics, etc) is mostly valuable to detect & diagnose problems 123 … in ML • Bugs usually cause silent depredations in performance • The data you monitor (system metrics, etc) is literally the code used to produce the next model
  • 124. Full Stack Deep Learning - UC Berkeley Spring 2021 A traditional view of monitoring 124 Evaluation Production Training Monitoring
  • 125. Full Stack Deep Learning - UC Berkeley Spring 2021 The reality of monitoring in ML 125 Evaluation Production Training Monitoring
  • 126. Full Stack Deep Learning - UC Berkeley Spring 2021 The reality of monitoring in ML 126 Evaluation Production Training Eval Store
  • 127. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 127 Collect data Clean and label Train Report Select problem Test Deploy Monitor Eval Store
  • 128. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 128 Collect data Clean and label Train Report Select problem Test Deploy Monitor Train Eval Store • Register data distribution and performance for this model • Warn us if training data looks too different than prod
  • 129. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 129 Collect data Clean and label Train Report Select problem Deploy Monitor Train Test Eval Store • Register performance for this model on all test slices • Pull historical that has been flagged as “interesting” (e.g., gave another model trouble) • Pull definitions of slices
  • 130. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 130 Test Collect data Clean and label Train Report Select problem Monitor Train Deploy Eval Store • Run a shadow test or AB test by pulling the diff in model performance between versions • Log data and approximate performance back to the eval store
  • 131. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 131 Deploy Test Collect data Clean and label Train Report Select problem Train Monitor Eval Store • Fire an alert when approximate performance on any of our slices dips below a threshold
  • 132. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 132 Monitor Deploy Test Clean and label Train Report Select problem Train Collect data Eval Store • Log more data with low or uncertain approximate performance
  • 133. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 133 Collect data Monitor Deploy Test Train Report Select problem Train Clean and label Eval Store • Inspect & label data with low approximate performance
  • 134. Full Stack Deep Learning - UC Berkeley Spring 2021 Closing the data flywheel 134 Clean and label Collect data Monitor Deploy Test Train Report Select problem Train Eval Store • Retrain when approximate performance dips below a threshold
  • 135. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 135
  • 136. Full Stack Deep Learning - UC Berkeley Spring 2021 Monitoring ML models: takeaways • You should be monitoring your models • Start by looking at data quality metrics and system metrics because it’s easiest • In a perfect world, your testing and monitoring systems should be linked, and should help you close the data flywheel • Lots of activity in tooling — watch this space! • Big open research questions still to be answered 136
  • 137. Full Stack Deep Learning - UC Berkeley Spring 2021 Thank you! 137