Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real-World Strategies for
Model Debugging
Patrick Hall
Principal Scientist, bnh.ai
Visiting Faculty, George Washington Sch...
All software has bugs.
Machine learning is software.
Model Debugging
▪ Model debugging is an emergent discipline focused on remediating errors in the
internal mechanisms and o...
Why Debug ML
Models?
AI Incidents on the Rise
This information is based on a qualitative assessment of 146 publicly reported incidents between ...
Common Failure Modes
This information is based on a qualitative assessment of 169 publicly reported incidents between 1988...
Regulatory and Legal Considerations
EU: Proposal for a Regulation on a European Approach for Artificial Intelligence
https...
Textbook assessment is insufficient
for real-world woes ...
The Strawman
Model: gmono
The Strawman: gmono
▪ Constrained, monotonic GBM probability of
default (PD) classifier, gmono
.
▪ Grid search over hundre...
ML Models Can Be Unnecessary
gmono
is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT.
PAY_0 is overemphasized.
ML Models Perpetuate Sociological Biases
Group disparity metrics are out-of-range for gmono
across different marital statu...
ML Models Have Security Vulnerabilities
Full-size image available: https://resources.oreilly.com/examples/0636920415947/bl...
Software Quality Assurance
IT Governance
Methods of
Debugging
IT Governance and Software QA
Software Quality Assurance (QA)
● Unit testing
● Integration testing
● Functional testing
● ...
● Due to hype, data scientists and ML engineers are often:
○ Excused from basic QA requirements and IT governance.
○ Allow...
Detection and Remediation Strategies
Methods of
Debugging
Sensitivity Analysis
● ML models behave in complex and
unexpected ways.
● The only way to know how they will behave is
to ...
Sensitivity Analysis Example—Partial Dependence
▪Training data is sparse for
PAY_0 > 1.
▪ICE curves indicate that partial
...
Sensitivity Analysis Example—Adversarial Example
Search
An adversarial example is a
row of data that evokes a
strange pred...
Residual Analysis
● Learning from mistakes is important.
● Residual analysis is the mathematical study of
modeling mistake...
Residual Analysis Example—Segmented Error
For PAY_0:
▪Notable change in accuracy and error
characteristics for PAY_0 > 1.
...
Residual Analysis Example—Shapley Values
Globally important
features PAY_3 and
PAY_2 are more
important, on
average, to th...
Residual Analysis Example—Modeling Residuals
This tree encodes rules describing when gmono
is probably wrong!
Decision tre...
Benchmark Models
● Technical progress in training: Take small steps from reproducible
benchmarks. How else do you know if ...
Remediation of gmono
Strawman
▪ Overemphasis of PAY_0:
▪ Collect better data!
▪ Engineer features for payment trends or st...
Process Remediation Strategies
● Appeal and override: Always enable users to appeal inevitable wrong decisions.
● Audits o...
Technical Remediation Strategies
▪ Anomaly detection: Strange predictions can signal performance or security problems.
▪ C...
References and
Resources
Must Reads
AI Incidents Fundamental Limitations Risk Management
Study and catalog incidents
so you don’t repeat them.
Same...
Resources
ModelTracker: Redesigning Performance Analysis Tools for Machine Learning
https://www.microsoft.com/en-us/resear...
QUESTIONS? • CONTACT US • CONTACT@BNH.AI
Patrick Hall
Principal Scientist, bnh.ai
ph@bnh.ai
Disclaimer: bnh.ai leverages a...
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Real-world Strategies for Debugging Machine Learning Systems
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Real-world Strategies for Debugging Machine Learning Systems

Download to read offline

You used cross-validation, early stopping, grid search, monotonicity constraints, and regularization to train a generalizable, interpretable, and stable machine learning (ML) model. Its fit statistics look just fine on out-of-time test data, and better than the linear model it’s replacing. You selected your probability cutoff based on business goals and you even containerized your model to create a real-time scoring engine for your pals in information technology (IT). Time to deploy?

Not so fast. Current best practices for ML model training and assessment can be insufficient for high-stakes, real-world systems. Much like other complex IT systems, ML models must be debugged for logical or run-time errors and security vulnerabilities. Recent, high-profile failures have made it clear that ML models must also be debugged for disparate impact and other types of discrimination.

This presentation introduces model debugging, an emergent discipline focused on finding and fixing errors in the internal mechanisms and outputs of ML models. Model debugging attempts to test ML models like code (because they are code). It enhances trust in ML directly by increasing accuracy in new or holdout data, by decreasing or identifying hackable attack surfaces, or by decreasing discrimination. As a side-effect, model debugging should also increase the understanding and interpretability of model mechanisms and predictions.

  • Be the first to like this

Real-world Strategies for Debugging Machine Learning Systems

  1. 1. Real-World Strategies for Model Debugging Patrick Hall Principal Scientist, bnh.ai Visiting Faculty, George Washington School of Business Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
  2. 2. All software has bugs. Machine learning is software.
  3. 3. Model Debugging ▪ Model debugging is an emergent discipline focused on remediating errors in the internal mechanisms and outputs of machine learning (ML) models. ▪ Model debugging attempts to test ML models like software (because models are code). ▪ Model debugging is similar to regression diagnostics, but for ML models. ▪ Model debugging promotes trust directly and enhances interpretability as a side effect. See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.
  4. 4. Why Debug ML Models?
  5. 5. AI Incidents on the Rise This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.
  6. 6. Common Failure Modes This information is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.
  7. 7. Regulatory and Legal Considerations EU: Proposal for a Regulation on a European Approach for Artificial Intelligence https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence ● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be used for the development, quality control and quality assurance of the high-risk AI system” U.S. FTC: Using Artificial Intelligence and Algorithms https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms ● “Make sure that your AI models are validated and revalidated to ensure that they work as intended” Brookings Institution: Products liability law as a way to address AI harms https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/ ● “Manufacturers have an obligation to make products that will be safe when used in reasonably foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that outcome.”
  8. 8. Textbook assessment is insufficient for real-world woes ... The Strawman Model: gmono
  9. 9. The Strawman: gmono ▪ Constrained, monotonic GBM probability of default (PD) classifier, gmono . ▪ Grid search over hundreds of models. ▪ Best model selected by validation-based early stopping. ▪ Seemingly well-regularized (row and column sampling, explicit specification of L1 and L2 penalties). ▪ No evidence of over- or underfitting. ▪ Better validation logloss than benchmark GLM. ▪ Decision threshold selected by maximization of F1 statistics. ▪ BUT traditional assessment can be insufficient!
  10. 10. ML Models Can Be Unnecessary gmono is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT. PAY_0 is overemphasized.
  11. 11. ML Models Perpetuate Sociological Biases Group disparity metrics are out-of-range for gmono across different marital statuses.
  12. 12. ML Models Have Security Vulnerabilities Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png
  13. 13. Software Quality Assurance IT Governance Methods of Debugging
  14. 14. IT Governance and Software QA Software Quality Assurance (QA) ● Unit testing ● Integration testing ● Functional testing ● Chaos testing More for ML: ● Reproducible benchmarks ● Random attack IT Governance ● Incident response ● Managed development processes ● Code reviews (even pair programming) ● Security and privacy policies More for ML: Model governance and model risk management (MRM) ● Executive oversight ● Documentation standards ● Multiple lines of defense ● Model inventories and monitoring Further Reading: Interagency Guidance on Model Risk Management (SR 11-7) https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf
  15. 15. ● Due to hype, data scientists and ML engineers are often: ○ Excused from basic QA requirements and IT governance. ○ Allowed to operate in violation of security and privacy policies (and laws). ● Many organizations have incident response plans for all mission-critical computing except ML. ● Very few nonregulated organizations practice MRM. ● We are in the Wild West days of AI. Further Reading: Overview of Debugging ML Models (Google) https://developers.google.com/machine-learning/testing-debugging/common/overview IT Governance and Software QA
  16. 16. Detection and Remediation Strategies Methods of Debugging
  17. 17. Sensitivity Analysis ● ML models behave in complex and unexpected ways. ● The only way to know how they will behave is to test them. ● With sensitivity analysis, we can test model behavior in interesting, critical, adversarial, or random situations. Important Tests: ● Visualizations of model performance (ALE, ICE, partial dependence) ● Stress-testing and adversarial example searches ● Random attacks ● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html
  18. 18. Sensitivity Analysis Example—Partial Dependence ▪Training data is sparse for PAY_0 > 1. ▪ICE curves indicate that partial dependence is likely trustworthy and empirically confirm monotonicity, but also expose adversarial attack vulnerabilities. ▪Partial dependence and ICE indicate gmono likely learned very little for PAY_0 > 1. ▪Pay_0 = missing gives lowest probability of default?! Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
  19. 19. Sensitivity Analysis Example—Adversarial Example Search An adversarial example is a row of data that evokes a strange prediction—we can learn a lot from them. Adversary search confirms multiple avenues of attack and exposes a potential flaw in gmono inductive logic: default is predicted for customers who make payments above their credit limit. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
  20. 20. Residual Analysis ● Learning from mistakes is important. ● Residual analysis is the mathematical study of modeling mistakes. ● With residual analysis, we can see the mistakes our models are likely to make and correct or mitigate them. Important Tests: ● Residuals by feature and level ● Segmented error analysis (including differential validity tests for social discrimination) ● Shapley contributions to logloss ● Models of residuals Source: Residual (Sur)Realism https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf
  21. 21. Residual Analysis Example—Segmented Error For PAY_0: ▪Notable change in accuracy and error characteristics for PAY_0 > 1. ▪Varying performance across segments can also be an indication of underspecification. ▪For SEX, accuracy and error characteristics vary little across individuals represented in the training data. Nondiscrimination should be tested by more involved disparate impact analysis. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  22. 22. Residual Analysis Example—Shapley Values Globally important features PAY_3 and PAY_2 are more important, on average, to the loss than to the predictions! Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  23. 23. Residual Analysis Example—Modeling Residuals This tree encodes rules describing when gmono is probably wrong! Decision tree model of gmono DEFAULT_NEXT_MONTH=1 logloss residuals with 3-fold CV MSE=0.0070 and R2 =0.8871. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  24. 24. Benchmark Models ● Technical progress in training: Take small steps from reproducible benchmarks. How else do you know if the code changes you made today to your incredibly complex ML system made it any better? ● Sanity checks on real-world performance: Compare complex model predictions to benchmark model predictions. How else can you know if your incredibly complex ML system is giving strange predictions on real-world data?
  25. 25. Remediation of gmono Strawman ▪ Overemphasis of PAY_0: ▪ Collect better data! ▪ Engineer features for payment trends or stability. ▪ Strong regularization or missing value injection. ▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?) ▪ Payments ≥ credit limit: Inference-time model assertion. ▪ Disparate impact: Model selection by minimal disparate impact. (Pre-, in-, post-processing?) ▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring. ▪ Large logloss importance: Evaluate dropping non-robust features.
  26. 26. Process Remediation Strategies ● Appeal and override: Always enable users to appeal inevitable wrong decisions. ● Audits or red-teaming: Pay external experts to find bugs and problems. ● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your (ML) software. ● Demographic and professional diversity: Diverse teams spot different kinds of problems. ● Domain expertise: Understand the context in which you are operating; crucial for testing. ● Incident response plan: Complex systems fail; be prepared. ● IT governance and QA: Treat ML systems like other mission-critical software assets! ● Model risk management: Empower executives; align incentives; challenge and document design decisions; and monitor models. ● Past known incidents: Those who ignore history are doomed to repeat it.
  27. 27. Technical Remediation Strategies ▪ Anomaly detection: Strange predictions can signal performance or security problems. ▪ Calibration to past data: Make output probabilities meaningful in the real world. ▪ Experimental design: Use science to select training data that addresses your implicit hypotheses. ▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand. ▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal predictions. ▪ Model or model artifact editing: Directly edit the inference code of your model. ▪ Model monitoring: Always watch the behavior of ML models in the real world. ▪ Monotonicity and interaction constraints: Force your models to obey reality. ▪ Strong regularization or missing value injection: Penalize your models for overemphasizing non-robust input features.
  28. 28. References and Resources
  29. 29. Must Reads AI Incidents Fundamental Limitations Risk Management Study and catalog incidents so you don’t repeat them. Same processes from transportation incidents. ML must be constrained and tested in the context of domain knowledge … or it doesn’t really work. Somethings cannot be predicted … no matter how good the data or how many data scientists are involved.  Executive oversight, incentives, culture, and process are crucial to mitigate risk.
  30. 30. Resources ModelTracker: Redesigning Performance Analysis Tools for Machine Learning https://www.microsoft.com/en-us/research/publication/modeltracker-redesigning-performance-analysis-tools-for-machine-learning/ BIML Interactive Machine Learning Risk Framework https://berryvilleiml.com/interactive/ Debugging Machine Learning Models https://debug-ml-iclr2019.github.io/ Safe and Reliable Machine Learning https://www.dropbox.com/s/sdu26h96bc0f4l7/FAT19-AI-Reliability-Final.pdf?dl=0 Tools: allennlp, cleverhans, manifold, SALib, shap, What-If Tool
  31. 31. QUESTIONS? • CONTACT US • CONTACT@BNH.AI Patrick Hall Principal Scientist, bnh.ai ph@bnh.ai Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
  32. 32. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

You used cross-validation, early stopping, grid search, monotonicity constraints, and regularization to train a generalizable, interpretable, and stable machine learning (ML) model. Its fit statistics look just fine on out-of-time test data, and better than the linear model it’s replacing. You selected your probability cutoff based on business goals and you even containerized your model to create a real-time scoring engine for your pals in information technology (IT). Time to deploy? Not so fast. Current best practices for ML model training and assessment can be insufficient for high-stakes, real-world systems. Much like other complex IT systems, ML models must be debugged for logical or run-time errors and security vulnerabilities. Recent, high-profile failures have made it clear that ML models must also be debugged for disparate impact and other types of discrimination. This presentation introduces model debugging, an emergent discipline focused on finding and fixing errors in the internal mechanisms and outputs of ML models. Model debugging attempts to test ML models like code (because they are code). It enhances trust in ML directly by increasing accuracy in new or holdout data, by decreasing or identifying hackable attack surfaces, or by decreasing discrimination. As a side-effect, model debugging should also increase the understanding and interpretability of model mechanisms and predictions.

Views

Total views

111

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×