Machine Learning Vital Signs: Metrics and Monitoring of AI in Production
This talk details the tracking of machine learning models in production to ensure model reliability, consistency, and performance into the future. Production models are interacting with the real world, and it is terrifying that often times nobody has any idea how they are performing on live data. The world changes! Bias and variance can creep into your models over time and you should know when that happens.
6. The world changes slowly
Over time the nature of the world changes
Our models will not work as well
7. Big things can happen and fundamentally change the world
This may render previous models less useful or worthless
The world changes abruptly
8. Seasonal and periodic changes happen
This can impact model effectiveness temporarily or permanently
The world changes periodically
9. Current events can change the world for a small period of time
Model effectiveness (usually for worse) for a short period of time
Weird things happen then go away
10. They will happen
Can be troublesome to detect in machine learning pipelines
Bugs
11. Bad people exist
Could they exploit your model or training set to your detriment?
Adversaries
12. Proposed solution: Metrics & Monitoring
Instrument your models with “vital signs”
Timely catch your model:
• Suddenly breaking
• Drifting into worthlessness
• Doing something strange
13. Machine Learning Vital Signs
• Some metric from a productionalized model that you can
monitor for change over time
• Have alerts in place that detect:
• An unacceptable amount of drift over time
• A surprise and strange amount of errors in one period
• What is the average of the vital?
• What is the standard deviation of the vital?
• What are acceptable bounds for the vital?
14. Vital: Accuracy
How often the model is correct or not correct
• Naturally will decrease over time
• Big dips (or jumps) can be indicative of something wrong
• Can mimic how the data was initially labeled
• Automatically labeled as part of the data
• Manually labeled… uh oh
16. Vital: Accuracy Per Label
How often the model is correct or not correct, for each
potential output label
• More fine grained than Accuracy
• Can sometimes catch things that Accuracy with large class
imbalance
18. Vital: Model Agreement
How often the previous models, not in production, agree
with the new model
• Some disagreement is natural, but a large number amounts of
disagreement can be indicative of a bug or problem
• Can be an alternative to Accuracy if Accuracy is hard to
measure
20. Vital: Output Distribution
How often each class is predicted or the distribution of
regression output values
• Can catch long-term trends, permanent changes, and seasonal
changes
• Can catch bugs and problems with large swings outside of a
few standard deviations
• Can be an alternative to Accuracy, but hard to tell the difference
between “weird” and “bad”
22. Vital: Canaries
Does a test input case predict what we expect?
• Can catch obvious issues if a test case the model should get
right returns a wrong output value
• Good at testing all or nothing problems, but struggles on trends
• Very simple to implement and brutally effective
24. Vital: Human Complaints
Do humans agree with what the model is doing?
• People love complaining about AI
• Harness that power to give you feedback
• Effective in large-scale applications that interact with humans
• Can double as a continuous data labeling exercise
25. Metrics and Monitoring Tips
• Figure out which vital signs can be done for each model
• Create log files
• Send the logfiles somewhere
• Make pretty charts
• Build a dashboard
• Watch it