This document discusses deployment and monitoring of machine learning models. It begins by introducing the topics of deployment and monitoring. It then discusses where models should be placed in a system architecture, including options for batch prediction, model-in-service, and model-as-service. The rest of the document covers building model services, including using REST APIs, managing dependencies with containers, optimizing performance, horizontal scaling, and deployment. It also discusses options for managed services and deploying models to edge devices.
4. Full Stack Deep Learning - UC Berkeley Spring 2021
Today
• Deployment (AKA, how to get your model into production)
• Monitoring (AKA how to keep it healthy once it’s there)
4
5. Full Stack Deep Learning - UC Berkeley Spring 2021
Today
• Deployment
• Monitoring
5
6. Full Stack Deep Learning - UC Berkeley Spring 2021
Where in the architecture should your model go?
6
Client Server Database
Local Remote
Request
Response
7. Full Stack Deep Learning - UC Berkeley Spring 2021
Batch prediction
7
Client Server Database
Local Remote
Request
Response
Model
8. Full Stack Deep Learning - UC Berkeley Spring 2021
Batch prediction
• Periodically run your model on new data and cache the results in a
database
• Works if the universe of inputs is relatively small (e.g., 1 prediction per
user)
8
9. Full Stack Deep Learning - UC Berkeley Spring 2021
Batch prediction
• Simple to implement
• Relatively low latency to the user
9
• Doesn’t scale to complex input
types
• Users don’t get the most up-to-
date predictions
• Models frequently become
“stale”, which can be hard to
detect
Pros Cons
10. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
10
11. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-in-service
11
Client Server Database
Local Remote
Request
Response
Model
12. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-in-service
• Package up your model and include it in your deployed web server
• Web server loads the model and calls it to make predictions
12
13. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-in-service
• Re-uses your existing
infrastructure
13
• Web server may be written in a
different language
• Models may change more
frequently than server code
• Large models can eat into the
resources for your web server
• Server hardware not optimized for
your model (e.g., no GPUs)
• Model & server may scale
differently
Pros Cons
14. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
14
15. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
15
Client Server Database
Local Remote
Request
Response
Model
16. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
• Run your model on its own web server
• The backend (or the client itself) interact with the model by making
requests to the model service and receiving responses back
16
17. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
• Dependability — model bugs less
likely to crash the web app
• Scalability — choose optimal
hardware for the model and scale
it appropriately
• Flexibility — easily reuse a model
across multiple apps
17
• Can add latency
• Adds infrastructural complexity
• Now you have to run a model
service…
Pros Cons
18. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
18
19. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
19
20. Full Stack Deep Learning - UC Berkeley Spring 2021
REST APIs
• Serving predictions in response to canonically-formatted HTTP requests
• There are alternatives like GRPC (which is actually used in tensorflow
serving) and GraphQL (not terribly relevant to model services)
20
21. Full Stack Deep Learning - UC Berkeley Spring 2021
REST API example
21
22. Full Stack Deep Learning - UC Berkeley Spring 2021
Formats for requests and responses
• Sadly, no standard yet
22
Google Cloud
Azure
AWS Sagemaker
23. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
23
24. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
24
25. Full Stack Deep Learning - UC Berkeley Spring 2021
Dependency management for model servers
• Model predictions depend on code, model weights, and dependencies.
All need to be present on your web server
• Dependencies cause trouble. Hard to make consistent, hard to update,
etc. Even changing a tensorflow version can change your model
• Two strategies:
• Constrain the dependencies for your model
• Use containers
25
26. Full Stack Deep Learning - UC Berkeley Spring 2021
A standard neural net format: ONNX
• The promise: define network in any language, run it consistently anywhere
• The reality: since the libraries change quickly, there are often bugs in the
translation layer
• What about non-library code like feature transformations?
26
27. Full Stack Deep Learning - UC Berkeley Spring 2021
Managing dependencies with containers (i.e., Docker)
• Docker vs VM
• Dockerfile and layer-based images
• DockerHub and the ecosystem
• Kubernetes and other orchestrators
27
Testing & Deployment - Docker
28. Full Stack Deep Learning - UC Berkeley Spring 2021
28
https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b
No OS -> Light weight
Testing & Deployment - Docker
29. Full Stack Deep Learning - UC Berkeley Spring 2021
• Spin up a container for
every discrete task
• For example, a web app
might have four containers:
• Web server
• Database
• Job queue
• Worker
29
https://www.docker.com/what-container
Light weight -> Heavy use
Testing & Deployment - Docker
30. Full Stack Deep Learning - UC Berkeley Spring 2021
Dockerfile
30
Testing & Deployment - Docker
31. Full Stack Deep Learning - UC Berkeley Spring 2021
Strong Ecosystem
• Images are easy to
find, modify, and
contribute back to
DockerHub
• Private images
easy to store in
same place
31
https://docs.docker.com/engine/docker-overview
Testing & Deployment - Docker
32. Full Stack Deep Learning - UC Berkeley Spring 2021
32
https://www.docker.com/what-container#/package_software
Testing & Deployment - Docker
33. Full Stack Deep Learning - UC Berkeley Spring 2021
Container Orchestration
• Distributing containers onto underlying VMs
or bare metal
• Kubernetes is the open-source winner, but
cloud providers have good offerings, too.
33
Testing & Deployment - Docker
34. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
34
35. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
35
36. Full Stack Deep Learning - UC Berkeley Spring 2021
Making inference on a single machine more efficient
• GPU or no GPU?
• Concurrency
• Model distillation
• Quantization
• Caching
• Batching
• Sharing the GPU
• Libraries
36
37. Full Stack Deep Learning - UC Berkeley Spring 2021
GPU or no GPU?
• GPU pros
• Same hardware you trained on probably
• In the limit of model size, batch size tuning, etc usually higher
throughput
• GPU cons
• More complex to set up
• Often more expensive
37
38. Full Stack Deep Learning - UC Berkeley Spring 2021
Concurrency
• What?
• Multiple copies of the model running on different CPUs or cores
• How?
• Be careful about thread tuning
38
https://blog.roblox.com/2020/05/scaled-bert-serve-1-billion-daily-requests-cpus/
39. Full Stack Deep Learning - UC Berkeley Spring 2021
Model distillation
• What?
• Train a smaller model to imitate your larger one
• How?
• Several techniques outlined below
• Can be finicky to do yourself — infrequently used in practice
• Exception — pretrained distilled models like DistilBERT
39
https://heartbeat.fritz.ai/research-guide-model-distillation-techniques-for-deep-learning-4a100801c0eb
40. Full Stack Deep Learning - UC Berkeley Spring 2021
Quantization
• What?
• Execute some or all of the operations in your model with a smaller
numerical representation than floats (e.g., INT8)
• Some tradeoffs with accuracy
• How?
• PyTorch and Tensorflow Lite have quantization built-in
• Can also run quantization-aware training, which often results in higher
accuracy
40
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/
41. Full Stack Deep Learning - UC Berkeley Spring 2021
Caching
• What?
• For some ML models, some inputs are more common than others
• Instead of always calling the model, first check the cache
• How?
• Can get very fancy
• Basic way uses functools
41
42. Full Stack Deep Learning - UC Berkeley Spring 2021
Batching
• What?
• ML models often achieve higher throughput when doing prediction in parallel,
especially in a GPU
• How?
• Gather predictions until you have a batch, run prediction, return to user
• Batch size needs to be tuned
• You need to have a way to shortcut the process if latency becomes too long
• Probably don’t want to implement this yourself
42
43. Full Stack Deep Learning - UC Berkeley Spring 2021
Sharing the GPU
• What?
• Your model may not take up all of the GPU memory with your inference
batch size. Why not run multiple models on the same GPU?
• How?
• You’ll probably want to use a model serving solution that supports this
out of the box
43
44. Full Stack Deep Learning - UC Berkeley Spring 2021
Model serving libraries
44
45. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
45
46. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
46
47. Full Stack Deep Learning - UC Berkeley Spring 2021
Horizontal scaling
• What?
• If you have too much traffic for a single machine, split traffic among multiple
machines
• How?
• Spin up multiple copies of your service and split traffic using a load balancer
• In practice, two common methods
• Container orchestration (i.e., Kubernetes)
• Serverless (e.g., AWS Lambda)
47
48. Full Stack Deep Learning - UC Berkeley Spring 2021
Container orchestration
48
49. Full Stack Deep Learning - UC Berkeley Spring 2021
Frameworks for ML deployment on kubernetes
49
50. Full Stack Deep Learning - UC Berkeley Spring 2021
Deploying code as serverless functions
• App code and dependencies are packaged into .zip files or docker
containers with a single entry point function
• AWS Lambda (or Google Cloud Functions, or Azure Functions) manages
everything else: instant scaling to 10,000+ requests per second, load
balancing, etc.
• Only pay for compute-time.
50
Testing & Deployment - Web Deployment
51. Full Stack Deep Learning - UC Berkeley Spring 2021
51
Testing & Deployment - Web Deployment
52. Full Stack Deep Learning - UC Berkeley Spring 2021
Deploying code as serverless functions
• Cons:
• Limited size of deployment package
• CPU-only, limited execution time
• Can be challenging to build pipelines of models
• Little to no state management (e.g., for caching)
• Limited deployment tooling
52
Testing & Deployment - Web Deployment
53. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
53
54. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
54
55. Full Stack Deep Learning - UC Berkeley Spring 2021
Model deployment
• What?
• If serving is how you turn a model into something that can respond to
requests, deployment is how you roll out, manage, and update these
services
• How?
• You probably want to be able to roll out gradually, roll back instantly,
and deploy pipelines of models
• Your deployment library may take care of this for you
55
56. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: the basics
• REST APIs
• Dependency management
• Performance optimization
• Horizontal scaling
• Deployment
• Managed options
56
57. Full Stack Deep Learning - UC Berkeley Spring 2021
Managed options
57
58. Full Stack Deep Learning - UC Berkeley Spring 2021
Building a model service: takeaways
• If you are doing CPU inference, can get away with scaling by launching
more servers, or going serverless.
• Serverless makes sense if you can get away with CPUs and traffic is spiky
or low-volume.
• If using GPU inference, serving tools like TF serving, Triton, and torch
serve will save you time
• Worth keeping an eye on the startups in this space for GPU inference
58
59. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
59
60. Full Stack Deep Learning - UC Berkeley Spring 2021
Edge prediction
60
Client Server Database
Local Remote
Request
Response
Model
61. Full Stack Deep Learning - UC Berkeley Spring 2021
Edge prediction
• Send model weights to the client device
• Client loads the model and interacts with it directly
61
62. Full Stack Deep Learning - UC Berkeley Spring 2021
Model-as-service
• Low-latency
• Does not require an internet
connection
• Data security — data doesn’t
need to leave the user’s device
62
• Often limited hardware resources
available
• Embedded and mobile
frameworks are less full featured
than tensorflow / pytorch
• Difficult to update models
• Difficult to monitor and debug
when things go wrong
Pros Cons
63. Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
63
TensorRT: Model optimizer and inference runtime for tensor flow + NVIDIA devices
64. Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
64
Apache TVM: Library-agnostic and target-device agnostic inference runtime
65. Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
65
TFLite: tensor flow on mobile / edge devices
66. Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
66
PyTorch mobile: PyTorch on iOS and Android
67. Full Stack Deep Learning - UC Berkeley Spring 2021
Tools for edge deployment
67
TensorFlow.js: tensorflow in the browser
68. Full Stack Deep Learning - UC Berkeley Spring 2021
68
https://heartbeat.fritz.ai/core-ml-vs-ml-kit-which-mobile-machine-learning-framework-is-right-for-you-e25c5d34c765
CoreML
- Released at Apple WWDC 2017
- Inference only
- https://coreml.store/
- Supposed to work with both
CoreML and MLKit
- Announced at Google I/O 2018
- Either via API or on-device
- Offers pre-trained models, or
can upload Tensorflow Lite
model
Testing & Deployment - Hardware/Mobile
69. Full Stack Deep Learning - UC Berkeley Spring 2021
More efficient models
• Quantization and distillation from above
• Mobile-friendly model architectures
69
70. Full Stack Deep Learning - UC Berkeley Spring 2021
70
Standard
1x1
Inception
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
MobileNet
Testing & Deployment - Hardware/Mobile
MobileNets
71. Full Stack Deep Learning - UC Berkeley Spring 2021
71
https://medium.com/@yu4u/why-mobilenet-and-its-variants-e-g-shufflenet-are-fast-1c7048b9618d
Testing & Deployment - Hardware/Mobile
72. Full Stack Deep Learning - UC Berkeley Spring 2021
Recommended case study: DistillBERT
72
https://medium.com/huggingface/distilbert-8cf3380435b5
73. Full Stack Deep Learning - UC Berkeley Spring 2021
Mindsets for edge deployment
• Choose your architecture with your target hardware in mind
• You can make up a factor of 2-10 through distillation, quantization, and other
tricks, but not more than that
• Once you have a model that works on your edge device, you can iterate locally
as long as you add model size and latency to your metrics and avoid
regressions
• Treat tuning the model for your device as an additional risk in the deployment
cycle and test it accordingly
• E.g., always test your models on production hardware before deploying
• Since models can be finicky, it’s a good idea to build fallback mechanisms into
the application in case the model fails or is too slow
73
74. Full Stack Deep Learning - UC Berkeley Spring 2021
Edge deployment: conclusion
• Web deployment is easier, so use it if you need to
• Choose your framework to match the available hardware and
corresponding mobile frameworks, or try TVM to be more flexible
• Start considering hardware constraints at the beginning of the project and
choose architectures accordingly
74
75. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
75
76. Full Stack Deep Learning - UC Berkeley Spring 2021
Today
• Deployment
• Monitoring
76
77. Full Stack Deep Learning - UC Berkeley Spring 2021
Train, test, deploy. You’re done, right?
• Validation loss is below your target performance
• Test loss is not much worse than validation
• Your model performs well across all critical slices and
metrics
• Qualitatively the predictions make sense
• You verified that the prod model has the same performance
characteristics as the dev model
• You verified that the prod model is indeed better than the
previous one
77
Training & troubleshooting
lecture
Testing lecture
78. Full Stack Deep Learning - UC Berkeley Spring 2021
Train, test, deploy. You’re done, right?
• Validation loss is below your target performance
• Test loss is not much worse than validation
• Your model performs well across all critical slices and
metrics
• Qualitatively the predictions make sense
• You verified that the prod model has the same
performance characteristics as the dev model
• You verified that the prod model is indeed better than the
previous one
78
Training & troubleshooting
lecture
Testing lecture
What else could go wrong?
79. Full Stack Deep Learning - UC Berkeley Spring 2021
Model performance degrades post-deployment
79
Supervised ML: approximate by learning a parameterized model
on data sampled from
p(y ∣ x) fθ
p(x, y)
What can go wrong? Names Examples
p(x) changes • Data drift
• Bug in upstream data pipeline
• Malicious users
• Launch in a new region
• Add new users with different demographics
p(y | x) changes • Model drift
• Concept drift
• Users behavior changes (in response to your
model!)
Sample doesn’t adequately
approximate p(x, y)
• The long tail
• Domain shift
• Tasks where outliers matter
• Bug in training data pipeline
• Bias in the sampling process
80. Full Stack Deep Learning - UC Berkeley Spring 2021
Types of data drift
80
Time
Value
81. Full Stack Deep Learning - UC Berkeley Spring 2021
Instantaneous drift (i.e., shift)
Examples
• Model deployed in a new domain
• A bug is introduced in the preprocessing pipeline
• Big external shift, like Covid
81
Time
Value
82. Full Stack Deep Learning - UC Berkeley Spring 2021
Gradual drift
Examples
• Users preferences change over time
• New concepts get introduced to the corpus over time
82
Time
Value
83. Full Stack Deep Learning - UC Berkeley Spring 2021
Periodic drift
Examples
• Users preferences are seasonal
• People in different timezones use your model differently
83
Time
Value
84. Full Stack Deep Learning - UC Berkeley Spring 2021
Temporary drift
Examples
• Malicious user attacks your model
• A new user with different demographics tries your product and then churns
• Someone uses your product in a way it was not intended to be used
84
Time
Value
85. Full Stack Deep Learning - UC Berkeley Spring 2021
This can have a huge impact in practice
85
Story from global e-commerce company:
• Bug in retraining pipeline caused recommendations
not to be updated for new users
• New users got the same stale recommendations
for weeks before the bug was caught
• Estimated $Ms in revenue lost
86. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
86
87. Full Stack Deep Learning - UC Berkeley Spring 2021
Strategies for monitoring ML models
87
88. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?
• How do I tell if that change is bad?
• Tools for monitoring
• Monitoring and your broader ML system
88
89. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?
• How do I tell if that change is bad?
• Tools for monitoring
• Monitoring and your broader ML system
89
90. Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
90
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
91. Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
91
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
Ground truth performance metrics
92. Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
92
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
Approximate performance metrics
93. Full Stack Deep Learning - UC Berkeley Spring 2021
Four types of signals to monitor
93
Model metrics (e.g., accuracy)
Model inputs and predictions
System performance
More
informative
Less informative
Harder to measure
Easier to measure
Business metrics
System health metrics
94. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
94
95. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?
• How do I tell if that change is bad?
• Tools for monitoring
• Monitoring and your broader ML system
95
96. Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
96
Time
Value
97. Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
97
Time
Value
1. Select a window of “good” data to serve as a reference
98. Full Stack Deep Learning - UC Berkeley Spring 2021
Selecting a reference window
• You can use a fixed window of production data you believe to be healthy
• Some papers [1] advocate for using a sliding window of production data
• In practice, most of the time you probably should use your training or eval
data as the reference
98
[1] Automatic Model Monitoring for Data Streams, Pinto et al. (https://arxiv.org/abs/1908.04240)
99. Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
99
Time
Value
2. Select a new window of data to measure distance on
100. Full Stack Deep Learning - UC Berkeley Spring 2021
Selecting the measurement window
• Problem-dependent — not aware of a principled approach
• Pragmatic solution: pick one or several window sizes with a reasonable
amount of data and slide them [1]
• Special case: window size = 1, i.e., outlier detection [2]
100
[1] To avoid recomputing metrics over and over again when you slide the window, it’s worth looking into the literature on mergeable
(quantile) sketching algorithms (https://www.sketchingbigdata.org/)
[2] E.g., check out https://pyod.readthedocs.io/en/latest/
101. Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
101
Time
Value
3. Compare the windows using a distance metric
102. Full Stack Deep Learning - UC Berkeley Spring 2021
Measuring distribution change
102
Time
Value
3. Compare the windows using a distance metric
103. Full Stack Deep Learning - UC Berkeley Spring 2021
Distance metrics (1D data)
• Rule-based distance metrics (aka, data quality)
• Statistical distance metrics
• KL divergence
• Kolmogorov-Smirnov statistic
• D_1 distance
• Others
103
104. Full Stack Deep Learning - UC Berkeley Spring 2021
Rule-based distance metrics (i.e., data quality)
• Min, max, mean within
acceptable range
• Enough data points
• Not too many missing values
• Col A > Col B
• Etc
104
Do this!
105. Full Stack Deep Learning - UC Berkeley Spring 2021
Kullback-Leibler divergence
•
• Sensitive to what happens on
the tails of the distribution
• Not interpretable
• What if the P and Q don’t
have exactly the same range?
DKL(P|Q) = 𝔼x∼P log (P(x)/Q(x))
105
Friends don’t let
friends monitor the KL
106. Full Stack Deep Learning - UC Berkeley Spring 2021
Kolmogorov-Smirnov Statistic
•
• Max distance between CDFs
• Easy to interpret
• Used widely in practice
K = sup
t∈(0,1)
B(t)
106
K-S = “Oh Yes”
107. Full Stack Deep Learning - UC Berkeley Spring 2021
D1 distance
•
• Sum of distances between
CDFs
• Used at Google [1]
• Isn’t this a bit hacky?
d1(p, q) =
n
∑
i=1
pi − qi
107
If google does it….
[1] Data Validation for Machine Learning (Breck et al) https://mlsys.org/Conferences/2019/doc/2019/167.pdf
108. Full Stack Deep Learning - UC Berkeley Spring 2021
Other distance metrics?
• Earth mover distance
• Population stability index
• Etc
108
Worthy of some
research.
Request for research: how do drifts according to different distance metrics affect
model performance differently?
109. Full Stack Deep Learning - UC Berkeley Spring 2021
Distance metrics (>1D data)
• Open problem
• Maximum Mean Discrepancy
• Do a bunch of 1D comparisons
• Doesn’t capture cross-correlation
• Multiple hypothesis testing [1]
• Prioritize some features for 1D comparisons
• Feature importance?
• Projections
MMD(ℱ, p, q) = ||μp − μq ||2
ℱ
109
[1] Look into corrections like the Bonferroni correction (https://en.wikipedia.org/wiki/Bonferroni_correction)
110. Full Stack Deep Learning - UC Berkeley Spring 2021
Detecting high-dimensional data drift using projections
110
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
• Analytical projections, e.g.,
• Mean pixel value
• Length of sentence
• (feature_1 - feature_2) / feature_3
• Random projections (e.g., linear)
• Statistical projections, e.g.,
• Autoencoder or other density model
• PCA or T-SNE
111. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
111
112. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?
• How do I tell if that change is bad?
• Tools for monitoring
• Monitoring and your broader ML system
112
113. Full Stack Deep Learning - UC Berkeley Spring 2021
Setting thresholds on test values
• Statistical tests (e.g., KS-Test)
• Typically, return a p-value for likelihood that the data distributions are not the same
• When you have a lot of data, you will get very small p-values for small shifts
113
https://blog.anomalo.com/dynamic-data-testing-f831435dba90
Used most in practice
114. Full Stack Deep Learning - UC Berkeley Spring 2021
Request for research
• Approximating model performance changes given different distribution
changes
114
115. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
115
116. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?
• How do I tell if that change is bad?
• Tools for monitoring
• Monitoring and your broader ML system
116
117. Full Stack Deep Learning - UC Berkeley Spring 2021
System monitoring tools
117
• Alarms for when things go
wrong, and records for tuning
things.
• Cloud providers have decent
monitoring solutions.
• Anything that can be logged
can be monitored (i.e. data
skew)
• Will explore in lab
118. Full Stack Deep Learning - UC Berkeley Spring 2021
System monitoring tools
118
119. Full Stack Deep Learning - UC Berkeley Spring 2021
Data quality tools
119
120. Full Stack Deep Learning - UC Berkeley Spring 2021
ML monitoring
120
121. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
121
122. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• What should I monitor?
• How do I measure if it’s changed?
• How do I tell if that change is bad?
• Tools for monitoring
• Monitoring and your broader ML system
122
123. Full Stack Deep Learning - UC Berkeley Spring 2021
Monitoring is more central to ML than to traditional software
In traditional software…
• Bugs are usually obvious
and/or cause loud failures
• The data you monitor
(system metrics, etc) is
mostly valuable to detect &
diagnose problems
123
… in ML
• Bugs usually cause silent
depredations in performance
• The data you monitor
(system metrics, etc) is
literally the code used to
produce the next model
124. Full Stack Deep Learning - UC Berkeley Spring 2021
A traditional view of monitoring
124
Evaluation Production
Training Monitoring
125. Full Stack Deep Learning - UC Berkeley Spring 2021
The reality of monitoring in ML
125
Evaluation Production
Training
Monitoring
126. Full Stack Deep Learning - UC Berkeley Spring 2021
The reality of monitoring in ML
126
Evaluation Production
Training
Eval Store
127. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
127
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Eval Store
128. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
128
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Train
Eval Store
• Register data distribution and
performance for this model
• Warn us if training data looks
too different than prod
129. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
129
Collect data
Clean and
label
Train Report
Select problem
Deploy
Monitor
Train
Test
Eval Store
• Register performance for this
model on all test slices
• Pull historical that has been
flagged as “interesting” (e.g.,
gave another model trouble)
• Pull definitions of slices
130. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
130
Test
Collect data
Clean and
label
Train Report
Select problem
Monitor
Train
Deploy
Eval Store
• Run a shadow test or AB test by
pulling the diff in model
performance between versions
• Log data and approximate
performance back to the eval
store
131. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
131
Deploy Test
Collect data
Clean and
label
Train Report
Select problem Train
Monitor
Eval Store
• Fire an alert when approximate
performance on any of our
slices dips below a threshold
132. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
132
Monitor Deploy Test
Clean and
label
Train Report
Select problem Train
Collect data
Eval Store
• Log more data with low or
uncertain approximate
performance
133. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
133
Collect data
Monitor Deploy Test
Train Report
Select problem Train
Clean and
label
Eval Store
• Inspect & label data with low
approximate performance
134. Full Stack Deep Learning - UC Berkeley Spring 2021
Closing the data flywheel
134
Clean and
label
Collect data
Monitor Deploy Test
Train Report
Select problem Train
Eval Store
• Retrain when approximate
performance dips below a
threshold
135. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
135
136. Full Stack Deep Learning - UC Berkeley Spring 2021
Monitoring ML models: takeaways
• You should be monitoring your models
• Start by looking at data quality metrics and system metrics because it’s
easiest
• In a perfect world, your testing and monitoring systems should be linked,
and should help you close the data flywheel
• Lots of activity in tooling — watch this space!
• Big open research questions still to be answered
136
137. Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
137