Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monitoring Models in Production


Published on

How to monitor machine learning models in production. Given at PyData Amsterdam 2018.

Published in: Technology
  • Be the first to comment

Monitoring Models in Production

  1. 1. Monitoring Models in Production Keeping track of complex models in a complex world Jannes Klaas
  2. 2. About me International Business @ RSM Financial Economics @ Oxford Saïd Course developer machine learning @ Turing Society Author “Machine Learning for Finance” out in July ML consultant non- profits / impact investors Prev. Urban Planning @ IHS Rotterdam & Destroyer of my Startup
  3. 3. The life and times of an ML practitioner “We send you the data, you send us back a model, then we take it from there” – Consulting Clients “Define an approach, evaluate on common benchmark and publish” – Academia
  4. 4. Repeat after me It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship It is not done after we ship
  5. 5. Machine learning 101 Estimate some function y = f(x) using (x,y) pairs Estimated function hopefully represents the true relationship between x and y Model is function of data
  6. 6. Problems you encounter in production • The world changes, your training data might no longer depict the real world • Your model inputs might change • There might be unintended bugs and side effects in complex models • Models influence the world the try to model • Model decay: Your model usually becomes worse over time
  7. 7. Are models a liability after shipping? No, the real world is the perfect training environment Datasets are only an approximation of the real world Active learning on real world examples can greatly reduce your data needs
  8. 8. Online learning • Update model continuously as new data streams in • Good if you have continuous stream of ground truth as well • Needs more monitoring to ensure model does not go off track • Can be expensive for big models • Might need separate training / inference hardware
  9. 9. Active learning Make predictions Request labels for low confidence examples Train on those ‘special cases’ Production is an opportunity for learning Monitoring is part of training
  10. 10. Model monitoring vs Ops monitoring • Model monitoring models model behavior • Inherently stochastic • Can be driven by user behavior • Almost certainly looking for unknown unknowns • Few established guidelines on what to monitor
  11. 11. Monitoring inputs •E.g. images arriving at model very small, very dark, high contrast, etc. More similar to ops monitoring as there can be obvious failures •Means •Standard deviations •Correlations •KL Divergence between Training & Live data Monitor classic stats, compare to training data
  12. 12. Output monitoring Harder, people might just upload more plane images one day Monitoring prediction distribution surprisingly helpful Monitor confidence (highest probability – lowest probability) Compare against other model predictions Compare against ground truth
  13. 13. Ground truth • In absence of a ground truth signal, ground truth needs to be established manually • Can be done by data scientists themselves with good UI design • Yields extra insights ‘Our model does worse when Instagram filters are applied’ / ‘Many users take sideways pictures’ • Prioritize low confidence predictions for active learning • Sample randomly for monitoring
  14. 14. Implementation Example: Prodigy
  15. 15. Alerting / Monitoring is a UI/UX problem • The terms might be very hard to explain or interpret • Who here would comfortably write down the formula for KL Divergence and explain what it means? • Key metrics are different depending on use case • Non – Datascientists might have to make decisions based on alerts
  16. 16. Alerting Example 0 5 10 15 20 25 Husky Chihaua Mastif Pug Labrador Poodle Retriever Terrier Training versus live distribution of dog breeds Train Live
  17. 17. Alerting Example • Detected !"#(%&'()| +(,- = 1.56394694 which is out of bounds • Detected model output distribution significantly different from training data • Detected an unexpected amount of pictures classified as Pugs
  18. 18. Model accountability • Who made the decision • Model versioning, all versions need to be retained • On which grounds was the decision made • All input data needs to be retained and must be linked to transaction ID • Why was the decision made • Use tools like LIME to interpret models • Still a long way to interpretable deep models, but we are getting there
  19. 19. nth order effects Societal impact Business Metrics (Revenue) User behavior (e.g. CTR) Model metrics (Accuracy) Easy to monitor Hard to monitor Small impact Large impact
  20. 20. Large impact effects… • … are hard to monitor • … are not what data scientists are trained for • … only show with large scale deployment • … are time delayed • … are influenced by exogenous factors, too
  21. 21. Monitoring high order effects Users are desperate to improve your model, let them! User input is a meta metric showing how well your model selection does
  22. 22. Implementation example Hosting monitoring sys as separate microservice Using flask to serve model Flask service calls monitor Alt. client can call monitor
  23. 23. A simple monitoring system with Flask User Keras + Flask SciKit + Flask Data Scientist Image Classification Image + Classification Alerts Transaction DB Store transaction Provide benchmark data
  24. 24. Bare Bones Flask Serving image = flask.request.files["image"].read() image = prepare_image(image, target=(224, 224)) preds = model.predict(image) results = decode_predictions(preds) data["predictions"] = [] for (label, prob) in results[0]: r = {"label": label, "probability": float(prob)} data["predictions"].append(r) data["success"] = True return flask.jsonify(data)
  25. 25. Statistical monitoring with SciKit ent = scipy.stats.entropy(pk,qk,base=2) if ent > threshold: abs_diff = np.abs(pk-qk) worst_offender = lookup[np.argmax(abs_diff)] max_deviation = np.max(abs_diff) alert(model_id,ent, worst_offender,max_deviation)
  26. 26. Data science teams should own the whole process Define approach Feature Engineering Train model Deploy Monitor
  27. 27. Unsolved challenges • Model versioning • Dataset versioning • Continuous Integration for data scientists • Communication and understanding of model metrics in the Org • Managing higher order effects
  28. 28. Recommended reading • Sculley et al. (2015) Hidden Technical Debt in Machine Learning Systems in-machine-learning-systems.pdf • Breck et al. (2016) What’s your ML Test Score? A rubric for ML production systems • How Zendesk Serves TensorFlow Models in Production tensorflow-models-in-production-751ee22f0f4b • Machine Learning for Finance ;) data-and-business-intelligence/machine-learning-finance