Monitoring Models in Production
Keeping track of complex models in a complex world
Jannes Klaas
About me
International
Business @ RSM
Financial Economics
@ Oxford Saïd
Course developer
machine learning @
Turing Society
Author “Machine
Learning for
Finance” out in July
ML consultant non-
profits / impact
investors
Prev. Urban Planning
@ IHS Rotterdam &
Destroyer of my
Startup
The life and
times of an
ML
practitioner
“We send you the data,
you send us back a model,
then we take it from there”
– Consulting Clients
“Define an approach,
evaluate on common
benchmark and publish” –
Academia
Repeat after
me
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
It is not done after we ship
Machine
learning 101
Estimate some function y = f(x)
using (x,y) pairs
Estimated function hopefully
represents the true relationship
between x and y
Model is function of data
Problems you encounter in
production
• The world changes, your training data might
no longer depict the real world
• Your model inputs might change
• There might be unintended bugs and side
effects in complex models
• Models influence the world the try to model
• Model decay: Your model usually becomes
worse over time
Are models
a liability
after
shipping?
No, the real world is the perfect
training environment
Datasets are only an
approximation of the real world
Active learning on real world
examples can greatly reduce
your data needs
Online learning
• Update model continuously as new data streams
in
• Good if you have continuous stream of ground
truth as well
• Needs more monitoring to ensure model does
not go off track
• Can be expensive for big models
• Might need separate training / inference
hardware
Active
learning
Make predictions
Request labels for low confidence
examples
Train on those ‘special cases’
Production is an opportunity for
learning
Monitoring is part of training
Model monitoring vs Ops monitoring
• Model monitoring models model behavior
• Inherently stochastic
• Can be driven by user behavior
• Almost certainly looking for unknown unknowns
• Few established guidelines on what to monitor
Monitoring inputs
•E.g. images arriving at model very small, very dark, high contrast, etc.
More similar to ops monitoring as there can be obvious failures
•Means
•Standard deviations
•Correlations
•KL Divergence between Training & Live data
Monitor classic stats, compare to training data
Output
monitoring
Harder, people might just upload more
plane images one day
Monitoring prediction distribution
surprisingly helpful
Monitor confidence (highest
probability – lowest probability)
Compare against other model
predictions
Compare against ground truth
Ground truth
• In absence of a ground truth signal, ground truth
needs to be established manually
• Can be done by data scientists themselves with good
UI design
• Yields extra insights ‘Our model does worse when
Instagram filters are applied’ / ‘Many users take
sideways pictures’
• Prioritize low confidence predictions for active
learning
• Sample randomly for monitoring
Implementation Example: Prodigy
Alerting / Monitoring is a
UI/UX problem
• The terms might be very hard to explain or
interpret
• Who here would comfortably write down
the formula for KL Divergence and
explain what it means?
• Key metrics are different depending on use
case
• Non – Datascientists might have to make
decisions based on alerts
Alerting Example
0
5
10
15
20
25
Husky Chihaua Mastif Pug Labrador Poodle Retriever Terrier
Training versus live distribution of dog breeds
Train Live
Alerting Example
• Detected !"#(%&'()| +(,- = 1.56394694
which is out of bounds
• Detected model output distribution
significantly different from training data
• Detected an unexpected amount of
pictures classified as Pugs
Model accountability
• Who made the decision
• Model versioning, all versions need to be retained
• On which grounds was the decision made
• All input data needs to be retained and must be linked to transaction ID
• Why was the decision made
• Use tools like LIME to interpret models
• Still a long way to interpretable deep models, but we are getting there
nth order effects
Societal
impact
Business
Metrics
(Revenue)
User
behavior
(e.g. CTR)
Model
metrics
(Accuracy)
Easy to monitor
Hard to monitor
Small impact
Large impact
Large impact effects…
• … are hard to monitor
• … are not what data scientists are trained for
• … only show with large scale deployment
• … are time delayed
• … are influenced by exogenous factors, too
Monitoring
high order
effects
Users are desperate to improve
your model, let them!
User input is a meta metric
showing how well your model
selection does
Implementation
example
Hosting monitoring sys as
separate microservice
Using flask to serve model
Flask service calls monitor
Alt. client can call monitor
A simple monitoring system with Flask
User Keras + Flask SciKit + Flask Data Scientist
Image
Classification
Image +
Classification Alerts
Transaction DB
Store
transaction
Provide
benchmark
data
Bare Bones Flask Serving
image = flask.request.files["image"].read()
image = prepare_image(image, target=(224, 224))
preds = model.predict(image)
results = decode_predictions(preds)
data["predictions"] = []
for (label, prob) in results[0]:
r = {"label": label, "probability": float(prob)}
data["predictions"].append(r)
data["success"] = True
return flask.jsonify(data)
Statistical monitoring with SciKit
ent = scipy.stats.entropy(pk,qk,base=2)
if ent > threshold:
abs_diff = np.abs(pk-qk)
worst_offender = lookup[np.argmax(abs_diff)]
max_deviation = np.max(abs_diff)
alert(model_id,ent,
worst_offender,max_deviation)
Data science teams should own the whole process
Define
approach
Feature
Engineering
Train
model
Deploy
Monitor
Unsolved challenges
• Model versioning
• Dataset versioning
• Continuous Integration for data scientists
• Communication and understanding of model
metrics in the Org
• Managing higher order effects
Recommended reading
• Sculley et al. (2015) Hidden Technical Debt in Machine Learning
Systems https://papers.nips.cc/paper/5656-hidden-technical-debt-
in-machine-learning-systems.pdf
• Breck et al. (2016) What’s your ML Test Score? A rubric for ML
production systems https://ai.google/research/pubs/pub45742
• How Zendesk Serves TensorFlow Models in Production
https://medium.com/zendesk-engineering/how-zendesk-serves-
tensorflow-models-in-production-751ee22f0f4b
• Machine Learning for Finance ;) https://www.packtpub.com/big-
data-and-business-intelligence/machine-learning-finance

Monitoring Models in Production

  • 1.
    Monitoring Models inProduction Keeping track of complex models in a complex world Jannes Klaas
  • 2.
    About me International Business @RSM Financial Economics @ Oxford Saïd Course developer machine learning @ Turing Society Author “Machine Learning for Finance” out in July ML consultant non- profits / impact investors Prev. Urban Planning @ IHS Rotterdam & Destroyer of my Startup
  • 3.
    The life and timesof an ML practitioner “We send you the data, you send us back a model, then we take it from there” – Consulting Clients “Define an approach, evaluate on common benchmark and publish” – Academia
  • 4.
    Repeat after me It isnot done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship
  • 5.
    Machine learning 101 Estimate somefunction y = f(x) using (x,y) pairs Estimated function hopefully represents the true relationship between x and y Model is function of data
  • 7.
    Problems you encounterin production • The world changes, your training data might no longer depict the real world • Your model inputs might change • There might be unintended bugs and side effects in complex models • Models influence the world the try to model • Model decay: Your model usually becomes worse over time
  • 8.
    Are models a liability after shipping? No,the real world is the perfect training environment Datasets are only an approximation of the real world Active learning on real world examples can greatly reduce your data needs
  • 9.
    Online learning • Updatemodel continuously as new data streams in • Good if you have continuous stream of ground truth as well • Needs more monitoring to ensure model does not go off track • Can be expensive for big models • Might need separate training / inference hardware
  • 10.
    Active learning Make predictions Request labelsfor low confidence examples Train on those ‘special cases’ Production is an opportunity for learning Monitoring is part of training
  • 11.
    Model monitoring vsOps monitoring • Model monitoring models model behavior • Inherently stochastic • Can be driven by user behavior • Almost certainly looking for unknown unknowns • Few established guidelines on what to monitor
  • 12.
    Monitoring inputs •E.g. imagesarriving at model very small, very dark, high contrast, etc. More similar to ops monitoring as there can be obvious failures •Means •Standard deviations •Correlations •KL Divergence between Training & Live data Monitor classic stats, compare to training data
  • 13.
    Output monitoring Harder, people mightjust upload more plane images one day Monitoring prediction distribution surprisingly helpful Monitor confidence (highest probability – lowest probability) Compare against other model predictions Compare against ground truth
  • 14.
    Ground truth • Inabsence of a ground truth signal, ground truth needs to be established manually • Can be done by data scientists themselves with good UI design • Yields extra insights ‘Our model does worse when Instagram filters are applied’ / ‘Many users take sideways pictures’ • Prioritize low confidence predictions for active learning • Sample randomly for monitoring
  • 15.
  • 16.
    Alerting / Monitoringis a UI/UX problem • The terms might be very hard to explain or interpret • Who here would comfortably write down the formula for KL Divergence and explain what it means? • Key metrics are different depending on use case • Non – Datascientists might have to make decisions based on alerts
  • 17.
    Alerting Example 0 5 10 15 20 25 Husky ChihauaMastif Pug Labrador Poodle Retriever Terrier Training versus live distribution of dog breeds Train Live
  • 18.
    Alerting Example • Detected!"#(%&'()| +(,- = 1.56394694 which is out of bounds • Detected model output distribution significantly different from training data • Detected an unexpected amount of pictures classified as Pugs
  • 19.
    Model accountability • Whomade the decision • Model versioning, all versions need to be retained • On which grounds was the decision made • All input data needs to be retained and must be linked to transaction ID • Why was the decision made • Use tools like LIME to interpret models • Still a long way to interpretable deep models, but we are getting there
  • 20.
    nth order effects Societal impact Business Metrics (Revenue) User behavior (e.g.CTR) Model metrics (Accuracy) Easy to monitor Hard to monitor Small impact Large impact
  • 21.
    Large impact effects… •… are hard to monitor • … are not what data scientists are trained for • … only show with large scale deployment • … are time delayed • … are influenced by exogenous factors, too
  • 22.
    Monitoring high order effects Users aredesperate to improve your model, let them! User input is a meta metric showing how well your model selection does
  • 23.
    Implementation example Hosting monitoring sysas separate microservice Using flask to serve model Flask service calls monitor Alt. client can call monitor
  • 24.
    A simple monitoringsystem with Flask User Keras + Flask SciKit + Flask Data Scientist Image Classification Image + Classification Alerts Transaction DB Store transaction Provide benchmark data
  • 25.
    Bare Bones FlaskServing image = flask.request.files["image"].read() image = prepare_image(image, target=(224, 224)) preds = model.predict(image) results = decode_predictions(preds) data["predictions"] = [] for (label, prob) in results[0]: r = {"label": label, "probability": float(prob)} data["predictions"].append(r) data["success"] = True return flask.jsonify(data)
  • 26.
    Statistical monitoring withSciKit ent = scipy.stats.entropy(pk,qk,base=2) if ent > threshold: abs_diff = np.abs(pk-qk) worst_offender = lookup[np.argmax(abs_diff)] max_deviation = np.max(abs_diff) alert(model_id,ent, worst_offender,max_deviation)
  • 27.
    Data science teamsshould own the whole process Define approach Feature Engineering Train model Deploy Monitor
  • 28.
    Unsolved challenges • Modelversioning • Dataset versioning • Continuous Integration for data scientists • Communication and understanding of model metrics in the Org • Managing higher order effects
  • 29.
    Recommended reading • Sculleyet al. (2015) Hidden Technical Debt in Machine Learning Systems https://papers.nips.cc/paper/5656-hidden-technical-debt- in-machine-learning-systems.pdf • Breck et al. (2016) What’s your ML Test Score? A rubric for ML production systems https://ai.google/research/pubs/pub45742 • How Zendesk Serves TensorFlow Models in Production https://medium.com/zendesk-engineering/how-zendesk-serves- tensorflow-models-in-production-751ee22f0f4b • Machine Learning for Finance ;) https://www.packtpub.com/big- data-and-business-intelligence/machine-learning-finance