Avoiding Machine Learning Pitfalls 2-10-18

Machine
Learning Pitfalls
Dan Elton
Silver Spring AI Information Meetup
2/21/18

2/22/2018 Dan Elton 2
What is a Machine Learning?
"Machine Learning is a field of study that gives computers the ability to learn without
being explicitly programmed" - Arthur Samuel, 1959
"A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by
P, improves with the experience E." - Tom M. Mitchell.
Reinforcement
learning
Unsupervised learningSupervised learning
• Regression
• Classification
Model Y = f(x) to match data (x,y)
• Parametric models
• Linear models
• Polynomial model
• Logistic model
• Neural network model
• Convolutional Neural network
• Non parametric models
• Kernel Ridge regression
• Decision tree
• Gaussian Process regression
• Kernel SVM
• Clustering
• Dimensionality
reduction
• Autoencoders
• Robotics , etc

2/22/2018 Dan Elton 3
Supervised learning workflow
Source : sci-kit-learn.org

Dan Elton, Silver Spring AI Information Meetup2/22/2018 4
Pitfall # 1 : Overfitting

Check for overfitting with cross validation
Look at the gap between performance in
training data and performance in test data.
It should be as small as possible.
Regularization and/or change
hyperparameters

Always show your test data score and your training data score
Error from overfitting
Error from bias

What is bias?
Meanings of the term “bias”
• Statistical bias: The “bias” part of the error term,
from the model not being the true model
• Biased training data
• Training data collected in a biased way
• Target signal leaks into data
• Social bias: when the ML system does things that
are against our values
The last two are closely related.

Statistical bias

Biased training data
The famous “tank story”

1. L. N. Kanal and N. C. Randall. 1964. Recognition system design by statistical
analysis. In Proceedings of the 1964 19th ACM national conference (ACM '64).
Should we be telling the tank story?
Gwern (https://www.gwern.net/Tanks) argues not since:
The story is often described as fact when there’s no evidence it actually
happened. Higher epistemic rigor should be demanded.
“the tank story tends to promote complacency and underestimation of the state of the
art ”
Yet the story is most likely based on real research done in the 60s on
trying to identify tanks in areal photos. 1 However, the published research
corrects for brightness levels by applying a Laplacian filter to the images.

In the 1990s, the Cost Effective Health Care (CEHC) funded
a study to see if ML could predict risk of death for patients
with pneumonia.
The most accurate model was a multi task neural net, with
an AUC=0.86 compared to 0.77 for logistic regression
The system was almost fielded, but the researchers felt it was
risky putting a black box model into production without
knowing at all how it was working. So they trained a rule-
based learning system on the same data. It had lower
accuracy, but was highly transparent. One rule it learned was:
HasAsthma(x)  LowerRisk(x)
A better story:
Cooper et al. Predicting dire outcomes of patients with community acquired pneumonia, Journal of Biomedical Informatics,
v.38 n.5, p.347-366, 2005
Caruana et al. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
https://vimeo.com/125940125

Bias
Problems of bias in social applications
Kate Crawford (NIPS, 2018) identifies 2 types of bias:
Harms of allocation
• Discrimination in products & services
• mortgage approval
• Parole granting
• Insurance rates
Harms of representation
• More subtle
• Perpetuation of social inequalities and stereotypes we
don’t want to be perpetuated
• Misrepresentation of sensitive topics like personal and
group identity

Bias
Examples of harms of distribution
Datta, Amit, Michael Carl Tschantz, and Anupam Datta. "Automated experiments on ad
privacy settings." Proceedings on Privacy Enhancing Technologies 2015.1 (2015): 92-
112. APA

Bias
Examples of harms of representation

Bias
A newly published study found high error rates for dark skinned women
-- Microsoft 21%
-- IBM – 35%
Less than 1% error for white males
Nytimes

Bias
This problem was fixed by Google in Dec. 2016

Bias
Sweeney L. Discrimination in Online Ad Delivery. Communications of the
Association of Computing Machinery (CACM), Vol. 56 No. 5, Pages 44-54
(2013) http://arxiv.org/abs/1301.6822.

Bias
1. L. N. Kanal and N. C. Randall. 1964. Recognition system design by statistical
analysis. In Proceedings of the 1964 19th ACM national conference (ACM '64).

Bias
Implicit gender bias in word2vec
Bolukbasi et al, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings arXiv:1607.06520
(2016)
Biased gender associations:
midwife:doctor; sewing:carpentry, and registered_nurse:physician,
not biased
feminine:manly; convent:monastery; handbag:briefcase, etc

Parole granting case
Consider the COMPAS algorithm for granting parole.. It uses 100 variables/features/risk
factors. Race is not explicitly considered.
It was widely reported in the media as being biased because it granted parole to white
people with much higher probability.
The media criticized it for being biased.
Yet, conditional on risk factors considered legitimate (such as number of prior
convictions, etc) the system did not exhibit any bias between white people and black
people.
(Corbett-Davies et al 2017)
Trying to make the algorithm ‘fair’ has a real social cost, quantified by Corbet-Davies et
al.
Debiasing demo at https://research.google.com/bigpicture/attacking-discrimination-in-ml/

Simpson’s paradox (aka confounding)
Stanford Admissions

Pitfall – not cleaning your data
The “Schenectady Problem”
https://s6.io/schenectady-12345/
“The fallacies of self-
reported data”

Pitfall - Not normalizing your data
Kernel methods are based on the distance between points – if one
feature (dimension) is very large, it will dominate.
Normalization helps speed up optimization of models by removing
”long valleys” in the cost function:

Simpson’s paradox

Pitfall – not comparing with baseline predictors
Scikit-learn contains a
dummy repressor
which just returns the
mean y as well as
dummy classifiers

Pitfall - Not normalizing your data
It’s sometimes important to normalize the target variables as well
Log (y) or Logistic(y) can be used to ‘squash the data’ to a narrow range of values.
Kernel Ridge
Regression
Random Forest Support Vector
Regression

Pitfall: trying to extrapolate
What’s the next number in this
sequence?
1, 3, 5, 7, ?
Correct solution
217,341

Pitfall: trying to extrapolate
Sometimes you get lucky…
But typically machine learning
models with nonlinearity do not
extrapolate.

Is “data science” a “science” ?
Technically, yes.
Data scientists generally follow the scientific method :
They collect data
They create a “hypothesis” (the model to be fit)
They see if the model can fit the data. If it doesn’t, some parameters are tweaked.
Eventually, they test the model on test data.
If the model works, it goes into production (“becomes a theory”)
But are ML models falsafiable? ….. Sort of (?)

ML models are “bad explanations”
https://www.ted.com/talks/david_deutsch_a_new_way_to_explain_explanation
Good explanations of the world cannot
easily be changed to accommodate new
data

One way of looking at it…

2/22/2018 Dan Elton, P.W. Chung Group Meeting 32
Meta lesson – don’t be arrogant!
“With four parameters I can
fit an elephant, and with
five I can make him wiggle
his trunk.”
- John von Neumann
When you do regression, even with deep learning, typically all you are really
doing is curve fitting! Some ML can be recast as data compression. You
are not coming up with good explanations as to what is happening.

The End
Thanks for listening!

Avoiding Machine Learning Pitfalls 2-10-18

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Avoiding Machine Learning Pitfalls 2-10-18

Similar to Avoiding Machine Learning Pitfalls 2-10-18 (20)

Recently uploaded

Recently uploaded (20)

Avoiding Machine Learning Pitfalls 2-10-18

Editor's Notes