Common Errors in
Machine Learning
References
 http://ml.posthaven.com/machine-learning-done-wrong
by Cheng-Tao Chu
 http://dataskeptic.com/epnotes/machine-learning-done-
wrong.php
on Data Skeptic episode #101
Overview
 Basics (8 topics)
 Just enough knowledge to be dangerous
 Thinking scientifically about black box solutions
 Intermediate (4 topics)
 Developed instincts on methods
 Ability to penalize and tune effectively
 Advanced (2 topics)
 Mastery of methods
 Pushing the information theoretic limits of what can be
extracted from data
Linearity assumption
Temperature Heart Rate Healthy
98 86 Yes
97 92 Yes
104 100 No
78 50 No
Leakage
Hours practice
per week
Trainer years
experience
Score Got metal
60 10 9.8 Yes
40 30 9.6 Yes
55 15 6.7 No
54 20 8.2 No
Predicting Olympic figure skating winners.
Data cleaning failures
Class imbalance
Store Amount Distance from
Home (miles)
Is Fraud
Walmart 100 10 No
Walmart 500 45 Yes
Ralph’s 125 2 No
Taco Bell 6 10 No
Omar’s Exotic
Birds
50 25 No
7-Eleven 8 15 No
SM City parking 3 30 No
United Oil 22 6 No
Roto-Rooter 500 45 No
Mis-handling of missing values
Loan issuer Income Owns home? Default on loan
Tukey Bank 56k No No
Fisher Credit Union 100k Yes No
Ponzi Bank 80k N/A Yes
MadoffTrust 85k N/A Yes
Generic Bank 78k Yes No
Overfitting
f1 f2 f3 f4 f5 f6 f7 … f999 y
0.4 0.2 0.7 0.6 0.1 0.9 0.5 … 0.4 .9
0.3 0.3 0.4 0.1 0.9 0.4 0.6 … 0.7 .5
0.8 0.8 0.3 0.1 0.7 0.2 0.4 … 0.1 .4
Assuming a representative sample
Current
Customer
Former
Customer
Former
Customer
Never a
Customer
Never a
Customer
Current
Customer
Not knowing your algorithm
8 Basics (summary)
1. Linearity assumption
2. Leakage
3. Data cleaning failures
4. Class imbalance
5. Mis-handling of missing values
6. Overfitting
7. Assuming a representative sample
8. Not knowing your algorithm
Experimental design errors
 Non-random A/B tests
 User leakage
 Insufficient control
 Not controlling for multiple comparison
 Death by 1000 slices
Inappropriate interpretation of
model parameters
To increase revenue,
we need to buy
de-humidifiers
Ignoring heterskedasticity
Failure to regularize correctly
Est. Purchase Price = -1000 * odometer + 100,000 * MPG
4 Intermediate (summary)
1. Experimental design errors
2. Inappropriate interpretation of model
parameters
3. Ignoring heteroskedasticity
4. Failure to regularize appropriately
Not understanding Goodhart’s Law
“When a measure becomes a target,
it ceases to be a good measure”
Losing sight of the business use case
2 Advanced (summary)
1. Not understanding Goodhart’s Law
2. Losing sight of the business use case
General advice
Never stop learning and reading
papers
Never stop doing EDA
Spend 20% of your time trying to
break your own models
Thanks!
@DataSkeptic
DataSkeptic.com

Common Errors in ML

Editor's Notes

  • #4 Your mileage may vary. When I happen to make them in my career.
  • #5 I rarely encounter properly linear data in my work Explain Tidy data frame Design matrix, objective function
  • #6 Usually more subtle. Time is common. Visit cancel page on site.
  • #7 Extreme values that are not just outliers but errors. Data is being recorded differently
  • #8 ML takes the easy way out
  • #9 Don’t just drop those, ask about the provenance of the data
  • #10 Too many features Not controlling for multiple comparisons
  • #12 K-means. Ignores Num previous owners because scale is different
  • #15 Fit describes training data, with CV better But doesn’t mean you can control it. We just need rich customers
  • #16 A lot of methods are fairly robust to this problem. Some statistical tests will fail. More importantly, it tells you there’s a signal in your data that you haven’t exploited.
  • #17 Must normalize Should use, sometimes ignored
  • #20 In an interesting coincidence, this problem is perfectly solved by my thesis!
  • #21 By some definitions, an advanced person makes few errors