Evaluating machine learning claims

Michael M. Hoffman
Princess Margaret Cancer Centre
Vector Institute
Department of Medical Biophysics
Department of Computer Science
University of Toronto
https://hoffmanlab.org/
Evaluating machine learning claims
@michaelhoffman

Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944
Should radiologists worry about their jobs?
2017

Stanford ML Group, fair use. https://stanfordmlgroup.github.io/competitions/mura/
Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944
Should radiologists worry about their jobs?
2017 2018

Today’s objective
To separate hype from reality in
evaluating machine learning claims

What is machine learning?
Data
1) Learning
Method
Model
Michael Hoffman, CC BY 4.0.

Data
1) Learning
Method
Model
MethodData Predictions
2) Prediction
What is machine learning?

Karen Zack, fair use.
https://twitter.com/teenybiscuit/status/707727863571582978

Chihuahua and muffin deep learning
Photos
1) Learning
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978

Photos
1) Learning
Labels
muffin chihuahua

Photos
1) Learning
Backprop
Labels
Model
muffin chihuahua

Photos
1) Learning
Backprop
Labels
Photos
2) Prediction
Model
= ?

Photos
1) Learning
Backprop
Labels
InferencePhotos Labels
2) Prediction
Model
= ?

Photos
1) Learning
Backprop
Labels
InferencePhotos Labels
2) Prediction
Model
= ch

It’s already been done
Enkhtogtokh Togootogtokh and Amarzaya Amartuvshin, CC BY 4.0. https://arxiv.org/abs/1801.09573

Machine learning: regression
Input variable
1) Learning
Response
variable
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Least squares

Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable

Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Regression
function
Input variable
2) Prediction

Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Regression
function
Input variable
Response
variable
2) Prediction
×

Robert Hazen, fair use. https://science.sciencemag.org/content/202/4370/823
David Roubik, fair use. https://science.sciencemag.org/content/201/4360/1030

Degree of polynomial and fundamental trade-off
• As the polynomial degree increases, the training error goes down.
• But approximation error goes up: we start overfitting with large M.
Adapted from Arnaud Doucet and Mark Schmidt, CC BY 4.0. http://www.cs.ubc.ca/~arnaud/stat535/slides5_revised.pdf

Machine learning:
what’s the point?

Supervised learning
• We are given training data where we know labels:
• But there is also test data we want to label:
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
+
+
–
+
+
X = y =
Egg Milk Fish Wheat Shellfish Peanuts …
0.5 0 1 0.6 2 1
0 0.7 0 1 0 0
3 1 0 0.5 0 0
Sick?
?
?
?
𝑋= 𝑦=
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/

Supervised learning
• Typical supervised learning steps:
1. Build model based on training data X and y.
2. Model makes predictions 𝑦 on test data 𝑋.
• In machine learning:
– What we care about is the test error!
– You’ve only learned if you can do well in new situations.

The golden rule of machine learning evaluation
The test data cannot influence
the training phase in any way.

Parameters and hyperparameters
• Parameters: decision tree rules
– Parameters control how well we fit a dataset.
– We “train” a model by trying to find the best
parameters on training data.
• Hyperparameters: decision tree depth
– Hyperparameters control how complex our
model is.
– We can’t “train” a hyperparameter.
• You can always fit training data better by making the model more complicated.

Tuning hyperparameters
• How do we set hyperparameters?
• We care about test error.
• But we can’t look at test data.
• So what do we do?????
• One answer: Use part of your dataset to approximate test error.

Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
tuning set
training set
whole dataset

tuning set
training set
whole dataset
tuning set
predict class
for tuning set

Terminology for datasets
Unified terminology Traditional ML terminology Biomedical terminology
Learn parameters Training Training Discovery
Tune hyperparameters Tuning Validation
Measure performance Test Test
Show generalization Validation

Choosing hyperparameters with tuning set
• So to choose a good value of depth (“hyperparameter”), we could:
– Try a depth-1 decision tree, compute tuning error.
– …
– Return the depth with the lowest tuning error.
• After you choose the hyperparameter, we often
re-train on the full training set with the chosen hyperparameter.
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/

Should you trust them?
• Scenario 1:
– “I built a model based on the data you gave me.”
– “It classified your data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably not:
– They are reporting training error.
– This might have nothing to do with test error.
– For example, they could have fit a very deep decision tree.
• Why ‘probably’?
– If they only tried a few very simple models, the 98% might be reliable.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/

• Scenario 2:
– “I built a model based on half of the data you gave me.”
– “It classified the other half of the data with 98% accuracy.”
• Probably:
– They computed the test error once.
– This is an unbiased approximation of the test error.
– Trust them if you believe they didn’t violate the golden rule.

• Scenario 3:
– “I built 1 billion models based on half of the data you gave me.”
– “One of them classified the other half of the data with 98% accuracy.”
• Probably not:
– They computed the test error a huge number of times.
– Maximizing over these errors is a biased approximation of test error.
– They tried so many models, one of them is likely to work by chance.

• Scenario 4:
– “I built 1 billion models based on the first third of the data you gave me.”
– “One of them classified the second third of the data with 98% accuracy.”
– “It also classified the last third of the data with 98% accuracy.”
• Probably:
– They computed the first tuning error a huge number of times.
– But they had a test set that they only looked at once.
– The test set gives unbiased test error approximation.
– This is ideal, if they didn’t violate golden rule on the last third.
– Assuming you have data points independent from each other in the first place.

Publish Average
Accuracy
Determine
Analysis
Parameters
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Preprocess
Data
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Train
Data
Cross Validation
Tuning
Data
How Cross Validation is meant to be used
Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816

Average Accuracy
Good Enough?
Yes
Publish Average
Accuracy
No
Modify Analysis
Parameters
Retest on
Same Data
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Preprocess
Data
Train
Data
Cross Validation
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Initial
Analysis
Parameters
EntireDataSet How Overfitting happens with Cross Validation
Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816
Tuning
Data

Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Set Aside
Some Data
Parameter
Optimization Set
Lock
Box
Initial
Parameters
Preprocess
Data
Train
Data
Cross Validation
Average Accuracy
Good Enough?
Yes
No
Modify Analysis
Parameters
Retest on
Same Data
Preprocess
Lock Box
Publish Accuracies on Parameter
Optimization Set and Lock Box
Lock
Box
Lock
Box
EntireDataSet
Using a Lockbox to obtain true estimate of out-of-sample accuracy
AdaptedfromBradWybleetal.,CCBY-NC4.0.https://doi.org/10.1101/078816
Tuning
Data

Interpolation and extrapolation

No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/

The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945

aka true positive rate (TPR)

aka positive predictive value
(PPV)

The same value of a metric can correspond to
very different classifier performance

Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)score

score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)

score false positive rate (FPR) = FP/(FP+TN)
truepositiverate(TPR)=sensitivity=recall
=TP/(TP+FN)
true positive rate (TPR) = sensitivity = recall
= TP/(TP+FN)
precision=positivepredictivevalue(PPV)
=TP/(TP+FP)

Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score

Michael Hoffman, CC BY 4.0. https://twitter.com/michaelhoffman/status/1148254326881705984

Recommended reading
Classification evaluation. https://www.nature.com/articles/nmeth.3945
I tried a bunch of things: the dangers of unexpected overfitting in classification.
https://doi.org/10.1101/078816
Evolution of Translational Omics: Lessons Learned and the Path Forward (2012),
National Academy of Sciences. https://www.nap.edu/read/13297/chapter/1
The precision-recall plot is more informative than the ROC plot when evaluating
binary classifiers on imbalanced datasets.
https://doi.org/10.1371/journal.pone.0118432
The TRIPOD (Transparent Reporting of a multivariable prediction model for
Individual Prognosis Or Diagnosis) Statement. https://www.tripod-
statement.org/TRIPOD
How to use t-SNE effectively. https://distill.pub/2016/misread-tsne/

Reflections
• The most important thing about machine learning evaluation is that it
accurately describes performance in a realistic deployment scenario.
• The Golden Rule: the test data cannot influence the training phase in
any way.
• Controls: they’re not just for wet lab experiments.
• If positive predictive value or precision aren’t mentioned, ask.
• Never trust performance boiled down to a single number.
• Published literature at good journals not necessarily done correctly.

Acknowledgments Mark Schmidt
Shannon Ellis
Dariya Sydykova
Brad Wyble
Martin Krzywinski
Tiffany Timbers
Arnaud Doucet
Casey Greene
Benjamin Haibe-Kains
Funding
Canadian Institutes of Health Research; Princess
Margaret Cancer Foundation; Natural Sciences
and Engineering Research Council of Canada;
Ontario Institute for Cancer Research; Ontario
Ministry of Economic Development, Job Creation
and Trade; Medicine by Design; McLaughlin
Centre
The Hoffman Lab
Samantha Wilson
Eric Roberts
Mickaël Mendez
Linh Huynh
Coby Viner
Rachel Chan
Natalia Mukhina

Evaluating machine learning claims

Recommended

Recommended

More Related Content

Similar to Evaluating machine learning claims

Similar to Evaluating machine learning claims (20)

More from Hoffman Lab

More from Hoffman Lab (20)

Recently uploaded

Recently uploaded (20)

Evaluating machine learning claims

Editor's Notes