Objective: to separate hype from reality in evaluating machine learning claims
1. What is machine learning
2. Machine learning: what's the point?
3. Interpolation and extrapolation
4. Statistics for evaluation
5. t-distributed stochastic neighbor embedding
6. Reflections
License on SlideShare to "All Rights Reserved" because individual slides have different licenses. The overall structure of this presentation is CC BY 4.0.
1. Michael M. Hoffman
Princess Margaret Cancer Centre
Vector Institute
Department of Medical Biophysics
Department of Computer Science
University of Toronto
https://hoffmanlab.org/
Evaluating machine learning claims
@michaelhoffman
2. Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944
Should radiologists worry about their jobs?
2017
3. Stanford ML Group, fair use. https://stanfordmlgroup.github.io/competitions/mura/
Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944
Should radiologists worry about their jobs?
2017 2018
9. Chihuahua and muffin deep learning
Photos
1) Learning
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
10. Chihuahua and muffin deep learning
Photos
1) Learning
Labels
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
muffin chihuahua
11. Chihuahua and muffin deep learning
Photos
1) Learning
Backprop
Labels
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
Model
muffin chihuahua
15. It’s already been done
Enkhtogtokh Togootogtokh and Amarzaya Amartuvshin, CC BY 4.0. https://arxiv.org/abs/1801.09573
16. Machine learning: regression
Input variable
1) Learning
Response
variable
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Least squares
17. Machine learning: regression
Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
18. Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Regression
function
Input variable
2) Prediction
Machine learning: regression
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
19. Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Regression
function
Input variable
Response
variable
2) Prediction
Machine learning: regression
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
×
20. Robert Hazen, fair use. https://science.sciencemag.org/content/202/4370/823
David Roubik, fair use. https://science.sciencemag.org/content/201/4360/1030
21. Degree of polynomial and fundamental trade-off
• As the polynomial degree increases, the training error goes down.
• But approximation error goes up: we start overfitting with large M.
Adapted from Arnaud Doucet and Mark Schmidt, CC BY 4.0. http://www.cs.ubc.ca/~arnaud/stat535/slides5_revised.pdf
23. Supervised learning
• We are given training data where we know labels:
• But there is also test data we want to label:
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
+
+
–
+
+
X = y =
Egg Milk Fish Wheat Shellfish Peanuts …
0.5 0 1 0.6 2 1
0 0.7 0 1 0 0
3 1 0 0.5 0 0
Sick?
?
?
?
𝑋= 𝑦=
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
24. Supervised learning
• Typical supervised learning steps:
1. Build model based on training data X and y.
2. Model makes predictions 𝑦 on test data 𝑋.
• In machine learning:
– What we care about is the test error!
– You’ve only learned if you can do well in new situations.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
25. The golden rule of machine learning evaluation
The test data cannot influence
the training phase in any way.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
26. Parameters and hyperparameters
• Parameters: decision tree rules
– Parameters control how well we fit a dataset.
– We “train” a model by trying to find the best
parameters on training data.
• Hyperparameters: decision tree depth
– Hyperparameters control how complex our
model is.
– We can’t “train” a hyperparameter.
• You can always fit training data better by making the model more complicated.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
27. Tuning hyperparameters
• How do we set hyperparameters?
• We care about test error.
• But we can’t look at test data.
• So what do we do?????
• One answer: Use part of your dataset to approximate test error.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
28. Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
tuning set
training set
whole dataset
29. tuning set
training set
whole dataset
tuning set
predict class
for tuning set
Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
30. Terminology for datasets
Unified terminology Traditional ML terminology Biomedical terminology
Learn parameters Training Training Discovery
Tune hyperparameters Tuning Validation
Measure performance Test Test
Show generalization Validation
Michael Hoffman, CC BY 4.0.
31. Choosing hyperparameters with tuning set
• So to choose a good value of depth (“hyperparameter”), we could:
– Try a depth-1 decision tree, compute tuning error.
– Try a depth-2 decision tree, compute tuning error.
– Try a depth-3 decision tree, compute tuning error.
– …
– Try a depth-20 decision tree, compute tuning error.
– Return the depth with the lowest tuning error.
• After you choose the hyperparameter, we often
re-train on the full training set with the chosen hyperparameter.
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
32. Should you trust them?
• Scenario 1:
– “I built a model based on the data you gave me.”
– “It classified your data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably not:
– They are reporting training error.
– This might have nothing to do with test error.
– For example, they could have fit a very deep decision tree.
• Why ‘probably’?
– If they only tried a few very simple models, the 98% might be reliable.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
33. Should you trust them?
• Scenario 2:
– “I built a model based on half of the data you gave me.”
– “It classified the other half of the data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably:
– They computed the test error once.
– This is an unbiased approximation of the test error.
– Trust them if you believe they didn’t violate the golden rule.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
34. Should you trust them?
• Scenario 3:
– “I built 1 billion models based on half of the data you gave me.”
– “One of them classified the other half of the data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably not:
– They computed the test error a huge number of times.
– Maximizing over these errors is a biased approximation of test error.
– They tried so many models, one of them is likely to work by chance.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
35. Should you trust them?
• Scenario 4:
– “I built 1 billion models based on the first third of the data you gave me.”
– “One of them classified the second third of the data with 98% accuracy.”
– “It also classified the last third of the data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably:
– They computed the first tuning error a huge number of times.
– But they had a test set that they only looked at once.
– The test set gives unbiased test error approximation.
– This is ideal, if they didn’t violate golden rule on the last third.
– Assuming you have data points independent from each other in the first place.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
37. Publish Average
Accuracy
Determine
Analysis
Parameters
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Preprocess
Data
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Train
Data
Cross Validation
Tuning
Data
How Cross Validation is meant to be used
Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816
38. Average Accuracy
Good Enough?
Yes
Publish Average
Accuracy
No
Modify Analysis
Parameters
Retest on
Same Data
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Preprocess
Data
Train
Data
Cross Validation
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Initial
Analysis
Parameters
EntireDataSet How Overfitting happens with Cross Validation
Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816
Tuning
Data
39. Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Set Aside
Some Data
Parameter
Optimization Set
Lock
Box
Initial
Parameters
Preprocess
Data
Train
Data
Cross Validation
Average Accuracy
Good Enough?
Yes
No
Modify Analysis
Parameters
Retest on
Same Data
Preprocess
Lock Box
Publish Accuracies on Parameter
Optimization Set and Lock Box
Lock
Box
Lock
Box
EntireDataSet
Using a Lockbox to obtain true estimate of out-of-sample accuracy
AdaptedfromBradWybleetal.,CCBY-NC4.0.https://doi.org/10.1101/078816
Tuning
Data
51. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
52. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
53. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
54. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
55. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
56. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
aka true positive rate (TPR)
57. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
58. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
59. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
aka positive predictive value
(PPV)
60. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
61. The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
62. The same value of a metric can correspond to
very different classifier performance
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
63. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)score
64. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)
65. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)
72. Recommended reading
Classification evaluation. https://www.nature.com/articles/nmeth.3945
I tried a bunch of things: the dangers of unexpected overfitting in classification.
https://doi.org/10.1101/078816
Evolution of Translational Omics: Lessons Learned and the Path Forward (2012),
National Academy of Sciences. https://www.nap.edu/read/13297/chapter/1
The precision-recall plot is more informative than the ROC plot when evaluating
binary classifiers on imbalanced datasets.
https://doi.org/10.1371/journal.pone.0118432
The TRIPOD (Transparent Reporting of a multivariable prediction model for
Individual Prognosis Or Diagnosis) Statement. https://www.tripod-
statement.org/TRIPOD
How to use t-SNE effectively. https://distill.pub/2016/misread-tsne/
Michael Hoffman, CC BY 4.0.
73. Reflections
• The most important thing about machine learning evaluation is that it
accurately describes performance in a realistic deployment scenario.
• The Golden Rule: the test data cannot influence the training phase in
any way.
• Controls: they’re not just for wet lab experiments.
• If positive predictive value or precision aren’t mentioned, ask.
• Never trust performance boiled down to a single number.
• Published literature at good journals not necessarily done correctly.
Michael Hoffman, CC BY 4.0.
74. Acknowledgments Mark Schmidt
Shannon Ellis
Dariya Sydykova
Brad Wyble
Martin Krzywinski
Tiffany Timbers
Arnaud Doucet
Casey Greene
Benjamin Haibe-Kains
Funding
Canadian Institutes of Health Research; Princess
Margaret Cancer Foundation; Natural Sciences
and Engineering Research Council of Canada;
Ontario Institute for Cancer Research; Ontario
Ministry of Economic Development, Job Creation
and Trade; Medicine by Design; McLaughlin
Centre
The Hoffman Lab
Samantha Wilson
Eric Roberts
Mickaël Mendez
Linh Huynh
Coby Viner
Rachel Chan
Natalia Mukhina
Editor's Notes
- need laser pointer
- turn Workrave off
turn phone off
turn iPad off
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
(a–d) Each panel shows three different classification scenarios with a table of corresponding values of accuracy (ac), sensitivity (sn), precision (pr), F1 score (F1) and Matthews correlation coefficient (MCC). Scenarios in a group have the same value (0.8) for the metric in bold in each table: (a) accuracy, (b) sensitivity (recall), (c) precision and (d) F1 score. In each panel, those observations that do not contribute to the corresponding metric are struck through with a red line. The color-coding is the same as in Figure 1; for example, blue circles (cases known to be positive) on a gray background (predicted to be negative) are FNs.
Say we decided to see how well expression and TF binding correlate
Say we decided to see how well expression and TF binding correlate
How it translates to correct predictions
High prediction for some TFs that are not sequence specific
t-SNE is not a clustering method!!!
Controls! They’re not just for wet lab experiments
And finally, I'd like to thank you for your very kind attention.