Real-World Strategies for
Model Debugging
Patrick Hall
Principal Scientist, bnh.ai
Visiting Faculty, George Washington School of Business
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
All software has bugs.
Machine learning is software.
Model Debugging
▪ Model debugging is an emergent discipline focused on remediating errors in the
internal mechanisms and outputs of machine learning (ML) models.
▪ Model debugging attempts to test ML models like software (because models are
code).
▪ Model debugging is similar to regression diagnostics, but for ML models.
▪ Model debugging promotes trust directly and enhances interpretability as a side
effect.
See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.
Why Debug ML
Models?
AI Incidents on the Rise
This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.
Common Failure Modes
This information is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.
Regulatory and Legal Considerations
EU: Proposal for a Regulation on a European Approach for Artificial Intelligence
https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence
● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be
used for the development, quality control and quality assurance of the high-risk AI system”
U.S. FTC: Using Artificial Intelligence and Algorithms
https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms
● “Make sure that your AI models are validated and revalidated to ensure that they work as intended”
Brookings Institution: Products liability law as a way to address AI harms
https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/
● “Manufacturers have an obligation to make products that will be safe when used in reasonably
foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a
plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that
outcome.”
Textbook assessment is insufficient
for real-world woes ...
The Strawman
Model: gmono
The Strawman: gmono
▪ Constrained, monotonic GBM probability of
default (PD) classifier, gmono
.
▪ Grid search over hundreds of models.
▪ Best model selected by validation-based early
stopping.
▪ Seemingly well-regularized (row and column
sampling, explicit specification of L1 and L2
penalties).
▪ No evidence of over- or underfitting.
▪ Better validation logloss than benchmark GLM.
▪ Decision threshold selected by maximization of
F1 statistics.
▪ BUT traditional assessment can be insufficient!
ML Models Can Be Unnecessary
gmono
is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT.
PAY_0 is overemphasized.
ML Models Perpetuate Sociological Biases
Group disparity metrics are out-of-range for gmono
across different marital statuses.
ML Models Have Security Vulnerabilities
Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png
Software Quality Assurance
IT Governance
Methods of
Debugging
IT Governance and Software QA
Software Quality Assurance (QA)
● Unit testing
● Integration testing
● Functional testing
● Chaos testing
More for ML:
● Reproducible benchmarks
● Random attack
IT Governance
● Incident response
● Managed development processes
● Code reviews (even pair programming)
● Security and privacy policies
More for ML: Model governance and
model risk management (MRM)
● Executive oversight
● Documentation standards
● Multiple lines of defense
● Model inventories and monitoring
Further Reading:
Interagency Guidance on Model Risk Management (SR 11-7)
https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf
● Due to hype, data scientists and ML engineers are often:
○ Excused from basic QA requirements and IT governance.
○ Allowed to operate in violation of security and privacy policies
(and laws).
● Many organizations have incident response plans for all
mission-critical computing except ML.
● Very few nonregulated organizations practice MRM.
● We are in the Wild West days of AI.
Further Reading: Overview of Debugging ML Models (Google)
https://developers.google.com/machine-learning/testing-debugging/common/overview
IT Governance and Software QA
Detection and Remediation Strategies
Methods of
Debugging
Sensitivity Analysis
● ML models behave in complex and
unexpected ways.
● The only way to know how they will behave is
to test them.
● With sensitivity analysis, we can test model
behavior in interesting, critical, adversarial, or
random situations.
Important Tests:
● Visualizations of model performance
(ALE, ICE, partial dependence)
● Stress-testing and adversarial example
searches
● Random attacks
● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html
Sensitivity Analysis Example—Partial Dependence
▪Training data is sparse for
PAY_0 > 1.
▪ICE curves indicate that partial
dependence is likely trustworthy
and empirically confirm
monotonicity, but also expose
adversarial attack vulnerabilities.
▪Partial dependence and ICE
indicate gmono
likely learned very
little for PAY_0 > 1.
▪Pay_0 = missing gives lowest
probability of default?!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
Sensitivity Analysis Example—Adversarial Example
Search
An adversarial example is a
row of data that evokes a
strange prediction—we can
learn a lot from them.
Adversary search confirms
multiple avenues of attack and
exposes a potential flaw in
gmono
inductive logic: default is
predicted for customers who
make payments above their
credit limit.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
Residual Analysis
● Learning from mistakes is important.
● Residual analysis is the mathematical study of
modeling mistakes.
● With residual analysis, we can see the mistakes
our models are likely to make and correct or
mitigate them.
Important Tests:
● Residuals by feature and level
● Segmented error analysis
(including differential validity tests for
social discrimination)
● Shapley contributions to logloss
● Models of residuals Source: Residual (Sur)Realism
https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf
Residual Analysis Example—Segmented Error
For PAY_0:
▪Notable change in accuracy and error
characteristics for PAY_0 > 1.
▪Varying performance across segments
can also be an indication of
underspecification.
▪For SEX, accuracy and error
characteristics vary little across
individuals represented in the training
data.
Nondiscrimination should be tested by
more involved disparate impact
analysis.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
Residual Analysis Example—Shapley Values
Globally important
features PAY_3 and
PAY_2 are more
important, on
average, to the loss
than to the
predictions!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
Residual Analysis Example—Modeling Residuals
This tree encodes rules describing when gmono
is probably wrong!
Decision tree model of gmono
DEFAULT_NEXT_MONTH=1 logloss residuals with
3-fold CV MSE=0.0070 and R2
=0.8871.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
Benchmark Models
● Technical progress in training: Take small steps from reproducible
benchmarks. How else do you know if the code changes you made
today to your incredibly complex ML system made it any better?
● Sanity checks on real-world performance: Compare complex model
predictions to benchmark model predictions. How else can you know if
your incredibly complex ML system is giving strange predictions on
real-world data?
Remediation of gmono
Strawman
▪ Overemphasis of PAY_0:
▪ Collect better data!
▪ Engineer features for payment trends or stability.
▪ Strong regularization or missing value injection.
▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?)
▪ Payments ≥ credit limit: Inference-time model assertion.
▪ Disparate impact: Model selection by minimal disparate impact.
(Pre-, in-, post-processing?)
▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring.
▪ Large logloss importance: Evaluate dropping non-robust features.
Process Remediation Strategies
● Appeal and override: Always enable users to appeal inevitable wrong decisions.
● Audits or red-teaming: Pay external experts to find bugs and problems.
● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your
(ML) software.
● Demographic and professional diversity: Diverse teams spot different kinds of problems.
● Domain expertise: Understand the context in which you are operating; crucial for testing.
● Incident response plan: Complex systems fail; be prepared.
● IT governance and QA: Treat ML systems like other mission-critical software assets!
● Model risk management: Empower executives; align incentives; challenge and document
design decisions; and monitor models.
● Past known incidents: Those who ignore history are doomed to repeat it.
Technical Remediation Strategies
▪ Anomaly detection: Strange predictions can signal performance or security problems.
▪ Calibration to past data: Make output probabilities meaningful in the real world.
▪ Experimental design: Use science to select training data that addresses your implicit
hypotheses.
▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand.
▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal
predictions.
▪ Model or model artifact editing: Directly edit the inference code of your model.
▪ Model monitoring: Always watch the behavior of ML models in the real world.
▪ Monotonicity and interaction constraints: Force your models to obey reality.
▪ Strong regularization or missing value injection: Penalize your models for
overemphasizing non-robust input features.
References and
Resources
Must Reads
AI Incidents Fundamental Limitations Risk Management
Study and catalog incidents
so you don’t repeat them.
Same processes from
transportation incidents.
ML must be constrained and
tested in the context of
domain knowledge … or it
doesn’t really work.
Somethings cannot be
predicted … no matter how
good the data or how many
data scientists are
involved. 
Executive oversight,
incentives, culture, and
process are crucial to
mitigate risk.
Resources
ModelTracker: Redesigning Performance Analysis Tools for Machine Learning
https://www.microsoft.com/en-us/research/publication/modeltracker-redesigning-performance-analysis-tools-for-machine-learning/
BIML Interactive Machine Learning Risk Framework
https://berryvilleiml.com/interactive/
Debugging Machine Learning Models
https://debug-ml-iclr2019.github.io/
Safe and Reliable Machine Learning
https://www.dropbox.com/s/sdu26h96bc0f4l7/FAT19-AI-Reliability-Final.pdf?dl=0
Tools:
allennlp, cleverhans, manifold, SALib, shap, What-If Tool
QUESTIONS? • CONTACT US • CONTACT@BNH.AI
Patrick Hall
Principal Scientist, bnh.ai
ph@bnh.ai
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Real-world Strategies for Debugging Machine Learning Systems

  • 1.
    Real-World Strategies for ModelDebugging Patrick Hall Principal Scientist, bnh.ai Visiting Faculty, George Washington School of Business Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
  • 2.
    All software hasbugs. Machine learning is software.
  • 3.
    Model Debugging ▪ Modeldebugging is an emergent discipline focused on remediating errors in the internal mechanisms and outputs of machine learning (ML) models. ▪ Model debugging attempts to test ML models like software (because models are code). ▪ Model debugging is similar to regression diagnostics, but for ML models. ▪ Model debugging promotes trust directly and enhances interpretability as a side effect. See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.
  • 4.
  • 6.
    AI Incidents onthe Rise This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.
  • 7.
    Common Failure Modes Thisinformation is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.
  • 8.
    Regulatory and LegalConsiderations EU: Proposal for a Regulation on a European Approach for Artificial Intelligence https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence ● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be used for the development, quality control and quality assurance of the high-risk AI system” U.S. FTC: Using Artificial Intelligence and Algorithms https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms ● “Make sure that your AI models are validated and revalidated to ensure that they work as intended” Brookings Institution: Products liability law as a way to address AI harms https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/ ● “Manufacturers have an obligation to make products that will be safe when used in reasonably foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that outcome.”
  • 9.
    Textbook assessment isinsufficient for real-world woes ... The Strawman Model: gmono
  • 10.
    The Strawman: gmono ▪Constrained, monotonic GBM probability of default (PD) classifier, gmono . ▪ Grid search over hundreds of models. ▪ Best model selected by validation-based early stopping. ▪ Seemingly well-regularized (row and column sampling, explicit specification of L1 and L2 penalties). ▪ No evidence of over- or underfitting. ▪ Better validation logloss than benchmark GLM. ▪ Decision threshold selected by maximization of F1 statistics. ▪ BUT traditional assessment can be insufficient!
  • 11.
    ML Models CanBe Unnecessary gmono is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT. PAY_0 is overemphasized.
  • 12.
    ML Models PerpetuateSociological Biases Group disparity metrics are out-of-range for gmono across different marital statuses.
  • 13.
    ML Models HaveSecurity Vulnerabilities Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png
  • 14.
    Software Quality Assurance ITGovernance Methods of Debugging
  • 15.
    IT Governance andSoftware QA Software Quality Assurance (QA) ● Unit testing ● Integration testing ● Functional testing ● Chaos testing More for ML: ● Reproducible benchmarks ● Random attack IT Governance ● Incident response ● Managed development processes ● Code reviews (even pair programming) ● Security and privacy policies More for ML: Model governance and model risk management (MRM) ● Executive oversight ● Documentation standards ● Multiple lines of defense ● Model inventories and monitoring Further Reading: Interagency Guidance on Model Risk Management (SR 11-7) https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf
  • 16.
    ● Due tohype, data scientists and ML engineers are often: ○ Excused from basic QA requirements and IT governance. ○ Allowed to operate in violation of security and privacy policies (and laws). ● Many organizations have incident response plans for all mission-critical computing except ML. ● Very few nonregulated organizations practice MRM. ● We are in the Wild West days of AI. Further Reading: Overview of Debugging ML Models (Google) https://developers.google.com/machine-learning/testing-debugging/common/overview IT Governance and Software QA
  • 17.
    Detection and RemediationStrategies Methods of Debugging
  • 18.
    Sensitivity Analysis ● MLmodels behave in complex and unexpected ways. ● The only way to know how they will behave is to test them. ● With sensitivity analysis, we can test model behavior in interesting, critical, adversarial, or random situations. Important Tests: ● Visualizations of model performance (ALE, ICE, partial dependence) ● Stress-testing and adversarial example searches ● Random attacks ● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html
  • 19.
    Sensitivity Analysis Example—PartialDependence ▪Training data is sparse for PAY_0 > 1. ▪ICE curves indicate that partial dependence is likely trustworthy and empirically confirm monotonicity, but also expose adversarial attack vulnerabilities. ▪Partial dependence and ICE indicate gmono likely learned very little for PAY_0 > 1. ▪Pay_0 = missing gives lowest probability of default?! Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
  • 20.
    Sensitivity Analysis Example—AdversarialExample Search An adversarial example is a row of data that evokes a strange prediction—we can learn a lot from them. Adversary search confirms multiple avenues of attack and exposes a potential flaw in gmono inductive logic: default is predicted for customers who make payments above their credit limit. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb
  • 21.
    Residual Analysis ● Learningfrom mistakes is important. ● Residual analysis is the mathematical study of modeling mistakes. ● With residual analysis, we can see the mistakes our models are likely to make and correct or mitigate them. Important Tests: ● Residuals by feature and level ● Segmented error analysis (including differential validity tests for social discrimination) ● Shapley contributions to logloss ● Models of residuals Source: Residual (Sur)Realism https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf
  • 22.
    Residual Analysis Example—SegmentedError For PAY_0: ▪Notable change in accuracy and error characteristics for PAY_0 > 1. ▪Varying performance across segments can also be an indication of underspecification. ▪For SEX, accuracy and error characteristics vary little across individuals represented in the training data. Nondiscrimination should be tested by more involved disparate impact analysis. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  • 23.
    Residual Analysis Example—ShapleyValues Globally important features PAY_3 and PAY_2 are more important, on average, to the loss than to the predictions! Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  • 24.
    Residual Analysis Example—ModelingResiduals This tree encodes rules describing when gmono is probably wrong! Decision tree model of gmono DEFAULT_NEXT_MONTH=1 logloss residuals with 3-fold CV MSE=0.0070 and R2 =0.8871. Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb
  • 25.
    Benchmark Models ● Technicalprogress in training: Take small steps from reproducible benchmarks. How else do you know if the code changes you made today to your incredibly complex ML system made it any better? ● Sanity checks on real-world performance: Compare complex model predictions to benchmark model predictions. How else can you know if your incredibly complex ML system is giving strange predictions on real-world data?
  • 26.
    Remediation of gmono Strawman ▪Overemphasis of PAY_0: ▪ Collect better data! ▪ Engineer features for payment trends or stability. ▪ Strong regularization or missing value injection. ▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?) ▪ Payments ≥ credit limit: Inference-time model assertion. ▪ Disparate impact: Model selection by minimal disparate impact. (Pre-, in-, post-processing?) ▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring. ▪ Large logloss importance: Evaluate dropping non-robust features.
  • 27.
    Process Remediation Strategies ●Appeal and override: Always enable users to appeal inevitable wrong decisions. ● Audits or red-teaming: Pay external experts to find bugs and problems. ● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your (ML) software. ● Demographic and professional diversity: Diverse teams spot different kinds of problems. ● Domain expertise: Understand the context in which you are operating; crucial for testing. ● Incident response plan: Complex systems fail; be prepared. ● IT governance and QA: Treat ML systems like other mission-critical software assets! ● Model risk management: Empower executives; align incentives; challenge and document design decisions; and monitor models. ● Past known incidents: Those who ignore history are doomed to repeat it.
  • 28.
    Technical Remediation Strategies ▪Anomaly detection: Strange predictions can signal performance or security problems. ▪ Calibration to past data: Make output probabilities meaningful in the real world. ▪ Experimental design: Use science to select training data that addresses your implicit hypotheses. ▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand. ▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal predictions. ▪ Model or model artifact editing: Directly edit the inference code of your model. ▪ Model monitoring: Always watch the behavior of ML models in the real world. ▪ Monotonicity and interaction constraints: Force your models to obey reality. ▪ Strong regularization or missing value injection: Penalize your models for overemphasizing non-robust input features.
  • 29.
  • 30.
    Must Reads AI IncidentsFundamental Limitations Risk Management Study and catalog incidents so you don’t repeat them. Same processes from transportation incidents. ML must be constrained and tested in the context of domain knowledge … or it doesn’t really work. Somethings cannot be predicted … no matter how good the data or how many data scientists are involved.  Executive oversight, incentives, culture, and process are crucial to mitigate risk.
  • 31.
    Resources ModelTracker: Redesigning PerformanceAnalysis Tools for Machine Learning https://www.microsoft.com/en-us/research/publication/modeltracker-redesigning-performance-analysis-tools-for-machine-learning/ BIML Interactive Machine Learning Risk Framework https://berryvilleiml.com/interactive/ Debugging Machine Learning Models https://debug-ml-iclr2019.github.io/ Safe and Reliable Machine Learning https://www.dropbox.com/s/sdu26h96bc0f4l7/FAT19-AI-Reliability-Final.pdf?dl=0 Tools: allennlp, cleverhans, manifold, SALib, shap, What-If Tool
  • 32.
    QUESTIONS? • CONTACTUS • CONTACT@BNH.AI Patrick Hall Principal Scientist, bnh.ai ph@bnh.ai Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data, analytics, and AI investments. Not all firm personnel, including named partners, are authorized to practice law.
  • 33.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.