Machine Learning Vital Signs

•Download as PPTX, PDF•

1 like•1,251 views

Machine Learning Vital Signs: Metrics and Monitoring of AI in Production This talk details the tracking of machine learning models in production to ensure model reliability, consistency, and performance into the future. Production models are interacting with the real world, and it is terrifying that often times nobody has any idea how they are performing on live data. The world changes! Bias and variance can creep into your models over time and you should know when that happens.

Technology

Machine Learning Vital Signs
Metrics and Monitoring of AI in Production
Donald Miner
Miner & Kasch
OSCON
July 16, 2019

We build a model
• Someone builds a model
• They test it
• Everyone is happy
• It works in prod

We build lots of models
Multiple people
build multiple models
to solve multiple problems

Which ones are working?
What does working mean?

The world changes slowly
Over time the nature of the world changes
Our models will not work as well

Big things can happen and fundamentally change the world
This may render previous models less useful or worthless
The world changes abruptly

Seasonal and periodic changes happen
This can impact model effectiveness temporarily or permanently
The world changes periodically

Current events can change the world for a small period of time
Model effectiveness (usually for worse) for a short period of time
Weird things happen then go away

They will happen
Can be troublesome to detect in machine learning pipelines
Bugs

Bad people exist
Could they exploit your model or training set to your detriment?
Adversaries

Proposed solution: Metrics & Monitoring
Instrument your models with “vital signs”
Timely catch your model:
• Suddenly breaking
• Drifting into worthlessness
• Doing something strange

Machine Learning Vital Signs
• Some metric from a productionalized model that you can
monitor for change over time
• Have alerts in place that detect:
• An unacceptable amount of drift over time
• A surprise and strange amount of errors in one period
• What is the average of the vital?
• What is the standard deviation of the vital?
• What are acceptable bounds for the vital?

Vital: Accuracy
How often the model is correct or not correct
• Naturally will decrease over time
• Big dips (or jumps) can be indicative of something wrong
• Can mimic how the data was initially labeled
• Automatically labeled as part of the data
• Manually labeled… uh oh

Vital: Accuracy Per Label
How often the model is correct or not correct, for each
potential output label
• More fine grained than Accuracy
• Can sometimes catch things that Accuracy with large class
imbalance

Vital: Model Agreement
How often the previous models, not in production, agree
with the new model
• Some disagreement is natural, but a large number amounts of
disagreement can be indicative of a bug or problem
• Can be an alternative to Accuracy if Accuracy is hard to
measure

Vital: Output Distribution
How often each class is predicted or the distribution of
regression output values
• Can catch long-term trends, permanent changes, and seasonal
changes
• Can catch bugs and problems with large swings outside of a
few standard deviations
• Can be an alternative to Accuracy, but hard to tell the difference
between “weird” and “bad”

Vital: Output Distribution
GOOD
MAYBE BAD

Vital: Canaries
Does a test input case predict what we expect?
• Can catch obvious issues if a test case the model should get
right returns a wrong output value
• Good at testing all or nothing problems, but struggles on trends
• Very simple to implement and brutally effective

Vital: Human Complaints
Do humans agree with what the model is doing?
• People love complaining about AI
• Harness that power to give you feedback
• Effective in large-scale applications that interact with humans
• Can double as a continuous data labeling exercise

Metrics and Monitoring Tips
• Figure out which vital signs can be done for each model
• Create log files
• Send the logfiles somewhere
• Make pretty charts
• Build a dashboard
• Watch it

Summary
Track what your models are doing
Watch what your models are doing

What's hot

Automation vs. intelligence - "follow me if you want to live"Viktor Slavchev

User Research @ Bitspiration2013BDressler

Doing monitoring rightJohn-Daniel Trask

Data science toolkit for product managers ProductFolks

Data Science Toolkit for Product ManagersMahmoud Jalajel

Conversion rate optimisation. What's realluy proved to matter? Viacheslav kra...MeetMagentoNY2014

Product Experimentation Pitfalls & How to Avoid Them Optimizely

Worst practices in software testing by the Testing trollViktor Slavchev

Testing for cognitive bias in ai systemsPeter Varhol

Socialcam - Ammon bartamnxtcon

Test automation – the bitter truthViktor Slavchev

Mark AlstonAustralian Institute of Health & Safety

Intro to A/B Testing by Ever's Senior Product ManagerProduct School

Seriously Advanced A/B Testing by Wyatt JenkinsLean Startup Co.

"Worst" practices of software testingViktor Slavchev

Exploratory testing part 1Dawn Code

Building Lean and Agile in the Real WorldKevin Goldsmith

Testing for everyone agile yorkshireAdy Stokes

Quick & easy problem solving toolsLineView Academy (was OFX Academy)

Supercharging your bug reportsNeil Studd

What's hot (20)

Automation vs. intelligence - "follow me if you want to live"

User Research @ Bitspiration2013

Doing monitoring right

Data science toolkit for product managers

Data Science Toolkit for Product Managers

Conversion rate optimisation. What's realluy proved to matter? Viacheslav kra...

Product Experimentation Pitfalls & How to Avoid Them

Worst practices in software testing by the Testing troll

Testing for cognitive bias in ai systems

Socialcam - Ammon bartam

Test automation – the bitter truth

Mark Alston

Intro to A/B Testing by Ever's Senior Product Manager

Seriously Advanced A/B Testing by Wyatt Jenkins

"Worst" practices of software testing

Exploratory testing part 1

Building Lean and Agile in the Real World

Testing for everyone agile yorkshire

Quick & easy problem solving tools

Supercharging your bug reports

Similar to Machine Learning Vital Signs

Ml2 productionNikhil Ketkar

Safeabilty: Analyzing the Relationship between Safety and Reliability PlantEngineering

Unlock Your Data's Potential By Integrating Qualtrics & TableauQualtrics

Adapting Scrum in an Organization with Tailored ProcessesPrabhat Sinha

AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera

Future of software development - Danger of OversimplificationJon Ruby

Model validationUtkarsh Sharma

Making disaster routinePeter Varhol

Agile MetricsErik Weber

Poka yoke error proofing ssusercee91d

Continuous Quality - Moving Beyond Bug ReportsNeil Studd

Peter Shanley, Principal & Evangelist at Neo Startup Product

Brighton CRO Meetup #1 - Oh Boy These AB tests Sure Look Like Bullshit to MeCraig Sullivan

Arch factory - Agile Design: Best PracticesIgor Moochnick

Ericriesleanstartuppresentationforweb2Edmund FOng

QA is Broken, Fix it!Ffrees Family Finance

DevOps Year OneMagnus Hedemark

Usability tests - everything you need to know to start n less than 15 slideszliron

Usability testing - everything you need to know to start, in less than 15 slideszliron

Leading Change from the Quality teamJohn Ruberto

Similar to Machine Learning Vital Signs (20)

Ml2 production

Safeabilty: Analyzing the Relationship between Safety and Reliability

Unlock Your Data's Potential By Integrating Qualtrics & Tableau

Adapting Scrum in an Organization with Tailored Processes

AI in the Real World: Challenges, and Risks and how to handle them?

Future of software development - Danger of Oversimplification

Model validation

Making disaster routine

Agile Metrics

Poka yoke error proofing

Continuous Quality - Moving Beyond Bug Reports

Peter Shanley, Principal & Evangelist at Neo

Brighton CRO Meetup #1 - Oh Boy These AB tests Sure Look Like Bullshit to Me

Arch factory - Agile Design: Best Practices

Ericriesleanstartuppresentationforweb2

QA is Broken, Fix it!

DevOps Year One

Usability tests - everything you need to know to start n less than 15 slides

Usability testing - everything you need to know to start, in less than 15 slides

Leading Change from the Quality team

Recently uploaded

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

WordPress Websites for Engineers: Elevate Your Brandgvaughan

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Story boards and shot lists for my a level piececharlottematthew16

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

WordPress Websites for Engineers: Elevate Your Brand

SIP trunking in Janus @ Kamailio World 2024

Streamlining Python Development: A Guide to a Modern Project Setup

Are Multi-Cloud and Serverless Good or Bad?

Connect Wave/ connectwave Pitch Deck Presentation

DMCC Future of Trade Web3 - Special Edition

SAP Build Work Zone - Overview L2-L3.pptx

DevEX - reference for building teams, processes, and platforms

Human Factors of XR: Using Human Factors to Design XR Systems

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Gen AI in Business - Global Trends Report 2024.pdf

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Story boards and shot lists for my a level piece

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Understanding the Laravel MVC Architecture

Developer Data Modeling Mistakes: From Postgres to NoSQL

Machine Learning Vital Signs

1. Machine Learning Vital Signs Metrics and Monitoring of AI in Production Donald Miner Miner & Kasch OSCON July 16, 2019

2. We build a model • Someone builds a model • They test it • Everyone is happy • It works in prod

3. We build lots of models Multiple people build multiple models to solve multiple problems

4. Who’s watching all these models?

5. Which ones are working? What does working mean?

6. The world changes slowly Over time the nature of the world changes Our models will not work as well

7. Big things can happen and fundamentally change the world This may render previous models less useful or worthless The world changes abruptly

8. Seasonal and periodic changes happen This can impact model effectiveness temporarily or permanently The world changes periodically

9. Current events can change the world for a small period of time Model effectiveness (usually for worse) for a short period of time Weird things happen then go away

10. They will happen Can be troublesome to detect in machine learning pipelines Bugs

11. Bad people exist Could they exploit your model or training set to your detriment? Adversaries

12. Proposed solution: Metrics & Monitoring Instrument your models with “vital signs” Timely catch your model: • Suddenly breaking • Drifting into worthlessness • Doing something strange

13. Machine Learning Vital Signs • Some metric from a productionalized model that you can monitor for change over time • Have alerts in place that detect: • An unacceptable amount of drift over time • A surprise and strange amount of errors in one period • What is the average of the vital? • What is the standard deviation of the vital? • What are acceptable bounds for the vital?

14. Vital: Accuracy How often the model is correct or not correct • Naturally will decrease over time • Big dips (or jumps) can be indicative of something wrong • Can mimic how the data was initially labeled • Automatically labeled as part of the data • Manually labeled… uh oh

15. Vital: Accuracy

16. Vital: Accuracy Per Label How often the model is correct or not correct, for each potential output label • More fine grained than Accuracy • Can sometimes catch things that Accuracy with large class imbalance

17. Vital: Accuracy Per Label

18. Vital: Model Agreement How often the previous models, not in production, agree with the new model • Some disagreement is natural, but a large number amounts of disagreement can be indicative of a bug or problem • Can be an alternative to Accuracy if Accuracy is hard to measure

19. Vital: Model Agreement

20. Vital: Output Distribution How often each class is predicted or the distribution of regression output values • Can catch long-term trends, permanent changes, and seasonal changes • Can catch bugs and problems with large swings outside of a few standard deviations • Can be an alternative to Accuracy, but hard to tell the difference between “weird” and “bad”

21. Vital: Output Distribution GOOD MAYBE BAD

22. Vital: Canaries Does a test input case predict what we expect? • Can catch obvious issues if a test case the model should get right returns a wrong output value • Good at testing all or nothing problems, but struggles on trends • Very simple to implement and brutally effective

23. Vital: Canaries

24. Vital: Human Complaints Do humans agree with what the model is doing? • People love complaining about AI • Harness that power to give you feedback • Effective in large-scale applications that interact with humans • Can double as a continuous data labeling exercise

25. Metrics and Monitoring Tips • Figure out which vital signs can be done for each model • Create log files • Send the logfiles somewhere • Make pretty charts • Build a dashboard • Watch it

26. Summary Track what your models are doing Watch what your models are doing

27. Machine Learning Vital Signs Metrics and Monitoring of AI in Production Donald Miner Miner & Kasch OSCON July 16, 2019

Editor's Notes

Fashion trends Real estate market prices
Brexit Change of power in parliament or presidential power
Thanksgiving Security models
Carwash – snow, rain, etc.. “Fatigue” around these things
Pdf extraction tool changed behavior; deep learning models adhered to this bias; update made everything go to shit

Machine Learning Vital Signs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning Vital Signs

Similar to Machine Learning Vital Signs (20)

More from Donald Miner

More from Donald Miner (11)

Recently uploaded

Recently uploaded (20)

Machine Learning Vital Signs

Editor's Notes