Avoiding Machine
Learning Pitfalls
Dan Elton
Tech Valley Machine Learning Meetup
11/20/17
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 2
Pitfall # 1 : Overfitting
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 3
Pitfall # 1 : Overfitting
Check for overfitting with cross validation
Look at the gap between performance in
training data and performance in test data.
It should be as small as possible.
Regularization and/or change
hyperparameters
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 4
Pitfall # 1 : Overfitting
Always show your test data score and your training data score
Error from overfitting
Error from bias
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 5
What is bias?
Meanings of the term “bias”
• Statistical bias: The “bias” part of the error term,
from the model not being the true model
• Biased training data
• Training data collected in a biased way
• Target signal leaks into data
• Social bias: when the ML system does things that
are against our values
The last two are closely related.
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 6
Biased training data
The famous “tank story”
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 7
Biased training data
1. L. N. Kanal and N. C. Randall. 1964. Recognition system design by statistical
analysis. In Proceedings of the 1964 19th ACM national conference (ACM '64).
Should we be telling the tank story?
Gwern (https://www.gwern.net/Tanks) argues not since:
The story is often described as fact when there’s no evidence it actually
happened. Higher epistemic rigor should be demanded.
“the tank story tends to promote complacency and underestimation of the state of the
art ”
Yet the story is most likely based on real research done in the 60s on
trying to identify tanks in areal photos. 1 However, the published research
corrects for brightness levels by applying a Laplacian filter to the images.
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 8
Biased training data
In the 1990s, the Cost Effective Health Care (CEHC) funded
a study to see if ML could predict risk of death for patients
with pneumonia.
The most accurate model was a multi task neural net, with
an AUC=0.86 compared to 0.77 for logistic regression
The system was almost fielded, but the researchers felt it was
risky putting a black box model into production without
knowing at all how it was working. So they trained a rule-
based learning system on the same data. It had lower
accuracy, but was highly transparent. One rule it learned was:
HasAsthma(x)  LowerRisk(x)
A better story:
Cooper et al. Predicting dire outcomes of patients with community acquired pneumonia, Journal of Biomedical Informatics,
v.38 n.5, p.347-366, 2005
Caruana et al. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
https://vimeo.com/125940125
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 9
Bias
Problems of bias in social applications
Kate Crawford (NIPS, 2018) identifies 2 types of bias:
Harms of allocation
• Discrimination in products & services
• mortgage approval
• Parole granting
• Insurance rates
Harms of representation
• More subtle
• Perpetuation of social inequalities and stereotypes we
don’t want to be perpetuated
• Misrepresentation of sensitive topics like personal and
group identity
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 10
Bias
Examples of harms of distribution
Datta, Amit, Michael Carl Tschantz, and Anupam Datta. "Automated experiments on ad
privacy settings." Proceedings on Privacy Enhancing Technologies 2015.1 (2015): 92-
112. APA
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 11
Bias
Examples of harms of representation
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 12
Bias
Examples of harms of representation
This problem was fixed by Google in Dec. 2016
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 13
Bias
Examples of harms of representation
Sweeney L. Discrimination in Online Ad Delivery. Communications of the
Association of Computing Machinery (CACM), Vol. 56 No. 5, Pages 44-54
(2013) http://arxiv.org/abs/1301.6822.
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 14
Bias
1. L. N. Kanal and N. C. Randall. 1964. Recognition system design by statistical
analysis. In Proceedings of the 1964 19th ACM national conference (ACM '64).
Examples of harms of representation
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 15
Bias
Implicit gender bias in word2vec
Bolukbasi et al, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings arXiv:1607.06520
(2016)
Biased gender associations:
midwife:doctor; sewing:carpentry, and registered_nurse:physician,
not biased
feminine:manly; convent:monastery; handbag:briefcase, etc
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 16
Pitfall – not cleaning your data
The “Schenectady Problem”
https://s6.io/schenectady-12345/
“The fallacies of self-
reported data”
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 17
Pitfall - Not normalizing your data
Kernel methods are based on the distance between points – if one
feature (dimension) is very large, it will dominate.
Normalization helps speed up optimization of models by removing
”long valleys” in the cost function:
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 18
Pitfall - Not normalizing your data
It’s sometimes important to normalize the target variables as well
Log (y) or Logistic(y) can be used to ‘squash the data’ to a narrow range of values.
Kernel Ridge
Regression
Random Forest Support Vector
Regression
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 19
Pitfall – not comparing with baseline predictors
Scikit-learn contains a
dummy repressor
which just returns the
mean y as well as
dummy classifiers
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 20
Pitfall: trying to extrapolate
What’s the next number in this
sequence?
1, 3, 5, 7, ?
Correct solution
217,341
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 21
Pitfall: trying to extrapolate
Sometimes you get lucky…
But typically machine learning
models with nonlinearity do not
extrapolate.
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 22
Is “data science” a “science” ?
Technically, yes.
Data scientists generally follow the scientific method :
They collect data
They create a “hypothesis” (the model to be fit)
They see if the model can fit the data. If it doesn’t, some parameters are tweaked.
Eventually, they test the model on test data.
If the model works, it goes into production (“becomes a theory”)
But are ML models falsafiable? ….. Sort of (?)
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 23
ML models are “bad explanations”
https://www.ted.com/talks/david_deutsch_a_new_way_to_explain_explanation
Good explanations of the world cannot
easily be changed to accommodate new
data
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 24
One way of looking at it…
12/29/2017 Dan Elton, P.W. Chung Group Meeting 25
Meta lesson – don’t be arrogant!
“With four parameters I can
fit an elephant, and with
five I can make him wiggle
his trunk.”
- John von Neumann
When you do regression, even with deep learning, typically all you are really
doing is curve fitting! Some ML can be recast as data compression. You
are not coming up with good explanations as to what is happening.
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 26
The End
Thanks for listening!
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 27
Pitfall: trying to use machine learning for extrapolation
Dan Elton, Tech Valley Machine Learning Meetup12/29/2017 28
Pitfall # : extrapolating relative model performance with small data
To large data

Machine Learning Pitfalls

  • 1.
    Avoiding Machine Learning Pitfalls DanElton Tech Valley Machine Learning Meetup 11/20/17
  • 2.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 2 Pitfall # 1 : Overfitting
  • 3.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 3 Pitfall # 1 : Overfitting Check for overfitting with cross validation Look at the gap between performance in training data and performance in test data. It should be as small as possible. Regularization and/or change hyperparameters
  • 4.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 4 Pitfall # 1 : Overfitting Always show your test data score and your training data score Error from overfitting Error from bias
  • 5.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 5 What is bias? Meanings of the term “bias” • Statistical bias: The “bias” part of the error term, from the model not being the true model • Biased training data • Training data collected in a biased way • Target signal leaks into data • Social bias: when the ML system does things that are against our values The last two are closely related.
  • 6.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 6 Biased training data The famous “tank story”
  • 7.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 7 Biased training data 1. L. N. Kanal and N. C. Randall. 1964. Recognition system design by statistical analysis. In Proceedings of the 1964 19th ACM national conference (ACM '64). Should we be telling the tank story? Gwern (https://www.gwern.net/Tanks) argues not since: The story is often described as fact when there’s no evidence it actually happened. Higher epistemic rigor should be demanded. “the tank story tends to promote complacency and underestimation of the state of the art ” Yet the story is most likely based on real research done in the 60s on trying to identify tanks in areal photos. 1 However, the published research corrects for brightness levels by applying a Laplacian filter to the images.
  • 8.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 8 Biased training data In the 1990s, the Cost Effective Health Care (CEHC) funded a study to see if ML could predict risk of death for patients with pneumonia. The most accurate model was a multi task neural net, with an AUC=0.86 compared to 0.77 for logistic regression The system was almost fielded, but the researchers felt it was risky putting a black box model into production without knowing at all how it was working. So they trained a rule- based learning system on the same data. It had lower accuracy, but was highly transparent. One rule it learned was: HasAsthma(x)  LowerRisk(x) A better story: Cooper et al. Predicting dire outcomes of patients with community acquired pneumonia, Journal of Biomedical Informatics, v.38 n.5, p.347-366, 2005 Caruana et al. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining https://vimeo.com/125940125
  • 9.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 9 Bias Problems of bias in social applications Kate Crawford (NIPS, 2018) identifies 2 types of bias: Harms of allocation • Discrimination in products & services • mortgage approval • Parole granting • Insurance rates Harms of representation • More subtle • Perpetuation of social inequalities and stereotypes we don’t want to be perpetuated • Misrepresentation of sensitive topics like personal and group identity
  • 10.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 10 Bias Examples of harms of distribution Datta, Amit, Michael Carl Tschantz, and Anupam Datta. "Automated experiments on ad privacy settings." Proceedings on Privacy Enhancing Technologies 2015.1 (2015): 92- 112. APA
  • 11.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 11 Bias Examples of harms of representation
  • 12.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 12 Bias Examples of harms of representation This problem was fixed by Google in Dec. 2016
  • 13.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 13 Bias Examples of harms of representation Sweeney L. Discrimination in Online Ad Delivery. Communications of the Association of Computing Machinery (CACM), Vol. 56 No. 5, Pages 44-54 (2013) http://arxiv.org/abs/1301.6822.
  • 14.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 14 Bias 1. L. N. Kanal and N. C. Randall. 1964. Recognition system design by statistical analysis. In Proceedings of the 1964 19th ACM national conference (ACM '64). Examples of harms of representation
  • 15.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 15 Bias Implicit gender bias in word2vec Bolukbasi et al, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings arXiv:1607.06520 (2016) Biased gender associations: midwife:doctor; sewing:carpentry, and registered_nurse:physician, not biased feminine:manly; convent:monastery; handbag:briefcase, etc
  • 16.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 16 Pitfall – not cleaning your data The “Schenectady Problem” https://s6.io/schenectady-12345/ “The fallacies of self- reported data”
  • 17.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 17 Pitfall - Not normalizing your data Kernel methods are based on the distance between points – if one feature (dimension) is very large, it will dominate. Normalization helps speed up optimization of models by removing ”long valleys” in the cost function:
  • 18.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 18 Pitfall - Not normalizing your data It’s sometimes important to normalize the target variables as well Log (y) or Logistic(y) can be used to ‘squash the data’ to a narrow range of values. Kernel Ridge Regression Random Forest Support Vector Regression
  • 19.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 19 Pitfall – not comparing with baseline predictors Scikit-learn contains a dummy repressor which just returns the mean y as well as dummy classifiers
  • 20.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 20 Pitfall: trying to extrapolate What’s the next number in this sequence? 1, 3, 5, 7, ? Correct solution 217,341
  • 21.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 21 Pitfall: trying to extrapolate Sometimes you get lucky… But typically machine learning models with nonlinearity do not extrapolate.
  • 22.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 22 Is “data science” a “science” ? Technically, yes. Data scientists generally follow the scientific method : They collect data They create a “hypothesis” (the model to be fit) They see if the model can fit the data. If it doesn’t, some parameters are tweaked. Eventually, they test the model on test data. If the model works, it goes into production (“becomes a theory”) But are ML models falsafiable? ….. Sort of (?)
  • 23.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 23 ML models are “bad explanations” https://www.ted.com/talks/david_deutsch_a_new_way_to_explain_explanation Good explanations of the world cannot easily be changed to accommodate new data
  • 24.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 24 One way of looking at it…
  • 25.
    12/29/2017 Dan Elton,P.W. Chung Group Meeting 25 Meta lesson – don’t be arrogant! “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” - John von Neumann When you do regression, even with deep learning, typically all you are really doing is curve fitting! Some ML can be recast as data compression. You are not coming up with good explanations as to what is happening.
  • 26.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 26 The End Thanks for listening!
  • 27.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 27 Pitfall: trying to use machine learning for extrapolation
  • 28.
    Dan Elton, TechValley Machine Learning Meetup12/29/2017 28 Pitfall # : extrapolating relative model performance with small data To large data

Editor's Notes