Real-world Strategies for Debugging Machine Learning Systems

Real-World Strategies for
Model Debugging
Patrick Hall
Principal Scientist, bnh.ai
Visiting Faculty, George Washington School of Business
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all ﬁrm personnel, including named partners, are authorized to practice law.

All software has bugs.
Machine learning is software.

Model Debugging
▪ Model debugging is an emergent discipline focused on remediating errors in the
internal mechanisms and outputs of machine learning (ML) models.
▪ Model debugging attempts to test ML models like software (because models are
code).
▪ Model debugging is similar to regression diagnostics, but for ML models.
▪ Model debugging promotes trust directly and enhances interpretability as a side
effect.
See https://debug-ml-iclr2019.github.io for numerous model debugging approaches.

AI Incidents on the Rise
This information is based on a qualitative assessment of 146 publicly reported incidents between 2015 and 2020.

Common Failure Modes
This information is based on a qualitative assessment of 169 publicly reported incidents between 1988 and February 1, 2021.

Regulatory and Legal Considerations
EU: Proposal for a Regulation on a European Approach for Artificial Intelligence
https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-approach-artificial-intelligence
● Article 17 - Quality management system (c): “techniques, procedures and systematic actions to be
used for the development, quality control and quality assurance of the high-risk AI system”
U.S. FTC: Using Artificial Intelligence and Algorithms
https://www.ftc.gov/news-events/blogs/business-blog/2020/04/using-artificial-intelligence-algorithms
● “Make sure that your AI models are validated and revalidated to ensure that they work as intended”
Brookings Institution: Products liability law as a way to address AI harms
https://www.brookings.edu/research/products-liability-law-as-a-way-to-address-ai-harms/
● “Manufacturers have an obligation to make products that will be safe when used in reasonably
foreseeable ways. If an AI system is used in a foreseeable way and yet becomes a source of harm, a
plaintiff could assert that the manufacturer was negligent in not recognizing the possibility of that
outcome.”

Textbook assessment is insufficient
for real-world woes ...
The Strawman
Model: gmono

The Strawman: gmono
▪ Constrained, monotonic GBM probability of
default (PD) classifier, gmono
.
▪ Grid search over hundreds of models.
▪ Best model selected by validation-based early
stopping.
▪ Seemingly well-regularized (row and column
sampling, explicit specification of L1 and L2
penalties).
▪ No evidence of over- or underfitting.
▪ Better validation logloss than benchmark GLM.
▪ Decision threshold selected by maximization of
F1 statistics.
▪ BUT traditional assessment can be insufficient!

ML Models Can Be Unnecessary
gmono
is a glorified business rule: IF PAY_0 > 1, THEN DEFAULT.
PAY_0 is overemphasized.

ML Models Perpetuate Sociological Biases
Group disparity metrics are out-of-range for gmono
across different marital statuses.

ML Models Have Security Vulnerabilities
Full-size image available: https://resources.oreilly.com/examples/0636920415947/blob/master/Attack_Cheat_Sheet.png

Software Quality Assurance
IT Governance
Methods of
Debugging

IT Governance and Software QA
Software Quality Assurance (QA)
● Unit testing
● Integration testing
● Functional testing
● Chaos testing
More for ML:
● Reproducible benchmarks
● Random attack
IT Governance
● Incident response
● Managed development processes
● Code reviews (even pair programming)
● Security and privacy policies
More for ML: Model governance and
model risk management (MRM)
● Executive oversight
● Documentation standards
● Multiple lines of defense
● Model inventories and monitoring
Further Reading:
Interagency Guidance on Model Risk Management (SR 11-7)
https://www.federalreserve.gov/supervisionreg/srletters/sr1107a1.pdf

● Due to hype, data scientists and ML engineers are often:
○ Excused from basic QA requirements and IT governance.
○ Allowed to operate in violation of security and privacy policies
(and laws).
● Many organizations have incident response plans for all
mission-critical computing except ML.
● Very few nonregulated organizations practice MRM.
● We are in the Wild West days of AI.
Further Reading: Overview of Debugging ML Models (Google)
https://developers.google.com/machine-learning/testing-debugging/common/overview
IT Governance and Software QA

Detection and Remediation Strategies
Methods of
Debugging

Sensitivity Analysis
● ML models behave in complex and
unexpected ways.
● The only way to know how they will behave is
to test them.
● With sensitivity analysis, we can test model
behavior in interesting, critical, adversarial, or
random situations.
Important Tests:
● Visualizations of model performance
(ALE, ICE, partial dependence)
● Stress-testing and adversarial example
searches
● Random attacks
● Tests for underspecification Source: http://www.vias.org/tmdatanaleng/cc_ann_extrapolation.html

Sensitivity Analysis Example—Partial Dependence
▪Training data is sparse for
PAY_0 > 1.
▪ICE curves indicate that partial
dependence is likely trustworthy
and empirically confirm
monotonicity, but also expose
adversarial attack vulnerabilities.
▪Partial dependence and ICE
indicate gmono
likely learned very
little for PAY_0 > 1.
▪Pay_0 = missing gives lowest
probability of default?!
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb

Sensitivity Analysis Example—Adversarial Example
Search
An adversarial example is a
row of data that evokes a
strange prediction—we can
learn a lot from them.
Adversary search conﬁrms
multiple avenues of attack and
exposes a potential ﬂaw in
gmono
inductive logic: default is
predicted for customers who
make payments above their
credit limit.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_sens_analysis_redux.ipynb

Residual Analysis
● Learning from mistakes is important.
● Residual analysis is the mathematical study of
modeling mistakes.
● With residual analysis, we can see the mistakes
our models are likely to make and correct or
mitigate them.
Important Tests:
● Residuals by feature and level
● Segmented error analysis
(including differential validity tests for
social discrimination)
● Shapley contributions to logloss
● Models of residuals Source: Residual (Sur)Realism
https://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/Residual_Surrealism_TAS_2007.pdf

Residual Analysis Example—Segmented Error
For PAY_0:
▪Notable change in accuracy and error
characteristics for PAY_0 > 1.
▪Varying performance across segments
can also be an indication of
underspecification.
▪For SEX, accuracy and error
characteristics vary little across
individuals represented in the training
data.
Nondiscrimination should be tested by
more involved disparate impact
analysis.
Example code: https://nbviewer.jupyter.org/github/jphall663/interpretable_machine_learning_with_python/blob/master/debugging_resid_analysis_redux.ipynb

Residual Analysis Example—Shapley Values
Globally important
features PAY_3 and
PAY_2 are more
important, on
average, to the loss
than to the
predictions!

Residual Analysis Example—Modeling Residuals
This tree encodes rules describing when gmono
is probably wrong!
Decision tree model of gmono
DEFAULT_NEXT_MONTH=1 logloss residuals with
3-fold CV MSE=0.0070 and R2
=0.8871.

Benchmark Models
● Technical progress in training: Take small steps from reproducible
benchmarks. How else do you know if the code changes you made
today to your incredibly complex ML system made it any better?
● Sanity checks on real-world performance: Compare complex model
predictions to benchmark model predictions. How else can you know if
your incredibly complex ML system is giving strange predictions on
real-world data?

Remediation of gmono
Strawman
▪ Overemphasis of PAY_0:
▪ Collect better data!
▪ Engineer features for payment trends or stability.
▪ Strong regularization or missing value injection.
▪ Sparsity of PAY_0 > 1 training data: Get better data! (Increase observation weights?)
▪ Payments ≥ credit limit: Inference-time model assertion.
▪ Disparate impact: Model selection by minimal disparate impact.
(Pre-, in-, post-processing?)
▪ Security vulnerabilities: API throttling, authentication, real-time model monitoring.
▪ Large logloss importance: Evaluate dropping non-robust features.

Process Remediation Strategies
● Appeal and override: Always enable users to appeal inevitable wrong decisions.
● Audits or red-teaming: Pay external experts to find bugs and problems.
● Bug bounties: Pay rewards to researchers (and teenagers) who find bugs in your
(ML) software.
● Demographic and professional diversity: Diverse teams spot different kinds of problems.
● Domain expertise: Understand the context in which you are operating; crucial for testing.
● Incident response plan: Complex systems fail; be prepared.
● IT governance and QA: Treat ML systems like other mission-critical software assets!
● Model risk management: Empower executives; align incentives; challenge and document
design decisions; and monitor models.
● Past known incidents: Those who ignore history are doomed to repeat it.

Technical Remediation Strategies
▪ Anomaly detection: Strange predictions can signal performance or security problems.
▪ Calibration to past data: Make output probabilities meaningful in the real world.
▪ Experimental design: Use science to select training data that addresses your implicit
hypotheses.
▪ Interpretable models/XAI: It’s easier to debug systems we can actually understand.
▪ Manual prediction limits: Don’t let models make embarrassing, harmful, or illegal
predictions.
▪ Model or model artifact editing: Directly edit the inference code of your model.
▪ Model monitoring: Always watch the behavior of ML models in the real world.
▪ Monotonicity and interaction constraints: Force your models to obey reality.
▪ Strong regularization or missing value injection: Penalize your models for
overemphasizing non-robust input features.

Must Reads
AI Incidents Fundamental Limitations Risk Management
Study and catalog incidents
so you don’t repeat them.
Same processes from
transportation incidents.
ML must be constrained and
tested in the context of
domain knowledge … or it
doesn’t really work.
Somethings cannot be
predicted … no matter how
good the data or how many
data scientists are
involved.
Executive oversight,
incentives, culture, and
process are crucial to
mitigate risk.

Resources
ModelTracker: Redesigning Performance Analysis Tools for Machine Learning
https://www.microsoft.com/en-us/research/publication/modeltracker-redesigning-performance-analysis-tools-for-machine-learning/
BIML Interactive Machine Learning Risk Framework
https://berryvilleiml.com/interactive/
Debugging Machine Learning Models
https://debug-ml-iclr2019.github.io/
Safe and Reliable Machine Learning
https://www.dropbox.com/s/sdu26h96bc0f4l7/FAT19-AI-Reliability-Final.pdf?dl=0
Tools:
allennlp, cleverhans, manifold, SALib, shap, What-If Tool

QUESTIONS? • CONTACT US • CONTACT@BNH.AI
Patrick Hall
Principal Scientist, bnh.ai
ph@bnh.ai
Disclaimer: bnh.ai leverages a unique blend of legal and technical expertise to protect and advance clients’ data,
analytics, and AI investments. Not all ﬁrm personnel, including named partners, are authorized to practice law.

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Real-world Strategies for Debugging Machine Learning Systems

More Related Content

What's hot

Similar to Real-world Strategies for Debugging Machine Learning Systems

More from Databricks

Recently uploaded

Real-world Strategies for Debugging Machine Learning Systems