Monitoring Models in Production

Monitoring Models in Production
Keeping track of complex models in a complex world
Jannes Klaas

About me
International
Business @ RSM
Financial Economics
@ Oxford Saïd
Course developer
machine learning @
Turing Society
Author “Machine
Learning for
Finance” out in July
ML consultant non-
profits / impact
investors
Prev. Urban Planning
@ IHS Rotterdam &
Destroyer of my
Startup

The life and
times of an
ML
practitioner
“We send you the data,
you send us back a model,
then we take it from there”
– Consulting Clients
“Define an approach,
evaluate on common
benchmark and publish” –
Academia

Repeat after
me
It is not done after we ship

Machine
learning 101
Estimate some function y = f(x)
using (x,y) pairs
Estimated function hopefully
represents the true relationship
between x and y
Model is function of data

Problems you encounter in
production
• The world changes, your training data might
no longer depict the real world
• Your model inputs might change
• There might be unintended bugs and side
effects in complex models
• Models influence the world the try to model
• Model decay: Your model usually becomes
worse over time

Are models
a liability
after
shipping?
No, the real world is the perfect
training environment
Datasets are only an
approximation of the real world
Active learning on real world
examples can greatly reduce
your data needs

Online learning
• Update model continuously as new data streams
in
• Good if you have continuous stream of ground
truth as well
• Needs more monitoring to ensure model does
not go off track
• Can be expensive for big models
• Might need separate training / inference
hardware

Active
learning
Make predictions
Request labels for low confidence
examples
Train on those ‘special cases’
Production is an opportunity for
learning
Monitoring is part of training

Model monitoring vs Ops monitoring
• Model monitoring models model behavior
• Inherently stochastic
• Can be driven by user behavior
• Almost certainly looking for unknown unknowns
• Few established guidelines on what to monitor

Monitoring inputs
•E.g. images arriving at model very small, very dark, high contrast, etc.
More similar to ops monitoring as there can be obvious failures
•Means
•Standard deviations
•Correlations
•KL Divergence between Training & Live data
Monitor classic stats, compare to training data

Output
monitoring
Harder, people might just upload more
plane images one day
Monitoring prediction distribution
surprisingly helpful
Monitor confidence (highest
probability – lowest probability)
Compare against other model
predictions
Compare against ground truth

Ground truth
• In absence of a ground truth signal, ground truth
needs to be established manually
• Can be done by data scientists themselves with good
UI design
• Yields extra insights ‘Our model does worse when
Instagram filters are applied’ / ‘Many users take
sideways pictures’
• Prioritize low confidence predictions for active
learning
• Sample randomly for monitoring

Implementation Example: Prodigy

Alerting / Monitoring is a
UI/UX problem
• The terms might be very hard to explain or
interpret
• Who here would comfortably write down
the formula for KL Divergence and
explain what it means?
• Key metrics are different depending on use
case
• Non – Datascientists might have to make
decisions based on alerts

Alerting Example
0
5
10
15
20
25
Husky Chihaua Mastif Pug Labrador Poodle Retriever Terrier
Training versus live distribution of dog breeds
Train Live

Alerting Example
• Detected !"#(%&'()| +(,- = 1.56394694
which is out of bounds
• Detected model output distribution
significantly different from training data
• Detected an unexpected amount of
pictures classified as Pugs

Model accountability
• Who made the decision
• Model versioning, all versions need to be retained
• On which grounds was the decision made
• All input data needs to be retained and must be linked to transaction ID
• Why was the decision made
• Use tools like LIME to interpret models
• Still a long way to interpretable deep models, but we are getting there

nth order effects
Societal
impact
Business
Metrics
(Revenue)
User
behavior
(e.g. CTR)
Model
metrics
(Accuracy)
Easy to monitor
Hard to monitor
Small impact
Large impact

Large impact effects…
• … are hard to monitor
• … are not what data scientists are trained for
• … only show with large scale deployment
• … are time delayed
• … are influenced by exogenous factors, too

Monitoring
high order
effects
Users are desperate to improve
your model, let them!
User input is a meta metric
showing how well your model
selection does

Implementation
example
Hosting monitoring sys as
separate microservice
Using flask to serve model
Flask service calls monitor
Alt. client can call monitor

A simple monitoring system with Flask
User Keras + Flask SciKit + Flask Data Scientist
Image
Classification
Image +
Classification Alerts
Transaction DB
Store
transaction
Provide
benchmark
data

Bare Bones Flask Serving
image = flask.request.files["image"].read()
image = prepare_image(image, target=(224, 224))
preds = model.predict(image)
results = decode_predictions(preds)
data["predictions"] = []
for (label, prob) in results[0]:
r = {"label": label, "probability": float(prob)}
data["predictions"].append(r)
data["success"] = True
return flask.jsonify(data)

Statistical monitoring with SciKit
ent = scipy.stats.entropy(pk,qk,base=2)
if ent > threshold:
abs_diff = np.abs(pk-qk)
worst_offender = lookup[np.argmax(abs_diff)]
max_deviation = np.max(abs_diff)
alert(model_id,ent,
worst_offender,max_deviation)

Data science teams should own the whole process
Define
approach
Feature
Engineering
Train
model
Deploy
Monitor

Unsolved challenges
• Model versioning
• Dataset versioning
• Continuous Integration for data scientists
• Communication and understanding of model
metrics in the Org
• Managing higher order effects

Recommended reading
• Sculley et al. (2015) Hidden Technical Debt in Machine Learning
Systems https://papers.nips.cc/paper/5656-hidden-technical-debt-
in-machine-learning-systems.pdf
• Breck et al. (2016) What’s your ML Test Score? A rubric for ML
production systems https://ai.google/research/pubs/pub45742
• How Zendesk Serves TensorFlow Models in Production
https://medium.com/zendesk-engineering/how-zendesk-serves-
tensorflow-models-in-production-751ee22f0f4b
• Machine Learning for Finance ;) https://www.packtpub.com/big-
data-and-business-intelligence/machine-learning-finance

Monitoring Models in Production

More Related Content

What's hot

Similar to Monitoring Models in Production

Recently uploaded

Monitoring Models in Production