Successfully reported this slideshow.

Why APM Is Not the Same As ML Monitoring

0

Share

1 of 26
1 of 26

Why APM Is Not the Same As ML Monitoring

0

Share

Download to read offline

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.

As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.

In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.

As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.

In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

More Related Content

Why APM Is Not the Same As ML Monitoring

  1. 1. ML Monitoring is not APM Cory A. Johannsen Product Engineer, Verta Inc. www.verta.ai
  2. 2. Agenda ▴ What is APM? ▴ What is ML monitoring? ▴ How ML monitoring and APM differ ▴ The unique needs of ML monitoring ▴ A very cool solution to model monitoring from Verta
  3. 3. About https://www.verta.ai/product - End-to-end MLOps platform for ML model delivery, operations and management - Kubernetes-based, operations stack for ML - 23 years as a software engineer - Embedded systems, enterprise software, SaaS - 6 years in APM working at scale
  4. 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  5. 5. What is APM?
  6. 6. What is APM? ▴ Application performance Monitoring ▴ Metrics ○ Name ○ Value ○ Labels ○ Timestamp ▴ Visualization ▴ Alerting
  7. 7. What do I care about monitoring in APM? ▴ Health ▴ Availability ▴ Performance ▴ Stability ▴ Notification
  8. 8. APM in practice ▴ Production operations ▴ Diagnostics and debugging ▴ Critical incident response
  9. 9. What is Model Monitoring?
  10. 10. ▴ Know when models are failing ▴ Quickly find the root cause ▴ Close the loop by fast recovery 10 Ensuring model results are consistently of high quality *We refer to all latency, throughput etc. as model service health
  11. 11. ▴ w/o ground truth, model fails challenging to detect ▴ Need to monitor complex statistical summaries ▴ Distributions, anomalies, missing values, quantiles etc. ▴ Often model-specific ▴ Intelligent detection and alerting to pre-emptively identify issues and trigger remediations ▴ Execute re-trains, fallback models, and human intervention. 11 Know when a model fails Close the loop ▴ A model is one part of a inference pipeline ▴ Need global view of the pipeline jungle to see where the root issue may be Quickly find the root cause
  12. 12. How APM and ML monitoring align ▴ Error rate, Throughput, Latency ○ You need to know my production systems are operational ▴ Visualization ○ You need to see change over time ▴ Alerting ○ You need to know when something has gone wrong (and only when something has gone wrong)
  13. 13. What do you care about in ML Monitoring? ▴ Distribution ○ Training versus test ○ Iteration over iteration ○ Live prediction ▴ Drift ○ Change in Distribution over time
  14. 14. How APM and ML monitoring differ ▴ Error Rate, Throughput, Latency ○ Necessary, no longer sufficient ▴ Not all work is production work ○ ML monitoring happens from the beginning of the pipeline ▴ APM can tell you what is wrong ○ ML monitoring is about understanding why
  15. 15. What makes ML monitoring unique ▴ Quantitative analysis of model performance ○ Information you can use ▴ Controlled comparison of distributions ○ Repeatable ○ Reliable ○ Consistent ▴ Alerting on meaningful deviation ○ Actionable ○ Timely ○ Accurate
  16. 16. Only you know the shape of your data ▴ Every model and pipeline is different and specialized ○ You built them, you understand them ▴ You know what metrics and distributions are valuable ○ This is your model, you know the data and processes that created it ▴ You know the expected distributions ○ You can determine whether the behavior is correct
  17. 17. Only you know how to measure change ▴ Compare to reference set ○ Training, test, golden data set ▴ Compare to a baseline ○ Calculate a baseline from your data or production systems ▴ Compare to other ○ Use a comparison that makes sense in your domain
  18. 18. Only you know when a change matters ▴ You know your model and tolerances ▴ You know when a deviation is significant (or not!) ▴ You know when these conditions need to change
  19. 19. Verta understand model monitoring ▴ Designed for your workflows ▴ Easy integration to capture your monitoring data ▴ Visualize and understand your metrics, distributions, and drift ▴ Get alerted when you should - not otherwise
  20. 20. Introducing a generalized framework for Model Monitoring
  21. 21. Concepts ▴ Monitored Entity: A reference name (e.g. model or pipeline) that you want to monitor ▴ Profiler: A function that computes statistics about your data ▴ Summary: A collection of statistics about your data (output of profiler) ○ Samples: instance of a summary, i.e., a statistic ○ Labels: key-values attached to summary samples. Used for rich filtering and aggregation ▴ Alerter: Triggered periodically, it can talk with the Verta API to fetch information about summaries and identify if they look wrong
  22. 22. How does it work? 1. Define monitored entity: the entity to be monitored (e.g., model, data, pipeline) 2. Define summaries to monitor for the entity 3. Run profilers (manually or automatically) to produce summary samples 4. View samples, define alerts 5. Get alerted (e.g. via Slack) 6. Close the loop!
  23. 23. How does it work? Time-series DB for statistical summaries ... Ground truth Data/Model Pipelines Model (Live) Remediation - Retrain - Rollback - Human loop Model (Batch) Prediction Log
  24. 24. Summary ▴ Performance monitoring is no longer sufficient for the needs of modern ML systems ○ Model monitoring starts at the beginning of the pipeline and continues through production ○ Batch and live can be addressed in the same framework ▴ Knowing something is wrong is not enough, you need to know why ▴ Timely actionable alerting is mandatory ▴ Building these tools on-site is difficult, error-prone, and expensive ▴ Spark is a fantastic tool to enable model monitoring
  25. 25. Monitor Your Models with Verta ▴ Visit monitoring.verta.ai today and see it in action ▴ Join our community ▴ Get more out of your models ▴ Get more out of your alerts
  26. 26. Thank you. Cory A. Johannsen Product Engineer, Verta Inc. www.verta.ai

×