SlideShare a Scribd company logo
1 of 74
Michael M. Hoffman
Princess Margaret Cancer Centre
Vector Institute
Department of Medical Biophysics
Department of Computer Science
University of Toronto
https://hoffmanlab.org/
Evaluating machine learning claims
@michaelhoffman
Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944
Should radiologists worry about their jobs?
2017
Stanford ML Group, fair use. https://stanfordmlgroup.github.io/competitions/mura/
Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944
Should radiologists worry about their jobs?
2017 2018
Today’s objective
To separate hype from reality in
evaluating machine learning claims
What is machine learning?
What is machine learning?
Data
1) Learning
Method
Model
Michael Hoffman, CC BY 4.0.
Data
1) Learning
Method
Model
MethodData Predictions
2) Prediction
Michael Hoffman, CC BY 4.0.
What is machine learning?
Karen Zack, fair use.
https://twitter.com/teenybiscuit/status/707727863571582978
Chihuahua and muffin deep learning
Photos
1) Learning
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
Chihuahua and muffin deep learning
Photos
1) Learning
Labels
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
muffin chihuahua
Chihuahua and muffin deep learning
Photos
1) Learning
Backprop
Labels
Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
Model
muffin chihuahua
Photos
1) Learning
Backprop
Labels
Photos
2) Prediction
Michael Hoffman, CC BY 4.0.
Model
Chihuahua and muffin deep learning
= ?
Photos
1) Learning
Backprop
Labels
InferencePhotos Labels
2) Prediction
Michael Hoffman, CC BY 4.0.
Model
Chihuahua and muffin deep learning
= ?
Photos
1) Learning
Backprop
Labels
InferencePhotos Labels
2) Prediction
Michael Hoffman, CC BY 4.0.
Model
Chihuahua and muffin deep learning
= ch
It’s already been done
Enkhtogtokh Togootogtokh and Amarzaya Amartuvshin, CC BY 4.0. https://arxiv.org/abs/1801.09573
Machine learning: regression
Input variable
1) Learning
Response
variable
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Least squares
Machine learning: regression
Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Regression
function
Input variable
2) Prediction
Machine learning: regression
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Input variable
1) Learning
Least squares
Slope
Intercept
Response
variable
Regression
function
Input variable
Response
variable
2) Prediction
Machine learning: regression
Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
×
Robert Hazen, fair use. https://science.sciencemag.org/content/202/4370/823
David Roubik, fair use. https://science.sciencemag.org/content/201/4360/1030
Degree of polynomial and fundamental trade-off
• As the polynomial degree increases, the training error goes down.
• But approximation error goes up: we start overfitting with large M.
Adapted from Arnaud Doucet and Mark Schmidt, CC BY 4.0. http://www.cs.ubc.ca/~arnaud/stat535/slides5_revised.pdf
Machine learning:
what’s the point?
Supervised learning
• We are given training data where we know labels:
• But there is also test data we want to label:
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
+
+
–
+
+
X = y =
Egg Milk Fish Wheat Shellfish Peanuts …
0.5 0 1 0.6 2 1
0 0.7 0 1 0 0
3 1 0 0.5 0 0
Sick?
?
?
?
𝑋= 𝑦=
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
Supervised learning
• Typical supervised learning steps:
1. Build model based on training data X and y.
2. Model makes predictions 𝑦 on test data 𝑋.
• In machine learning:
– What we care about is the test error!
– You’ve only learned if you can do well in new situations.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
The golden rule of machine learning evaluation
The test data cannot influence
the training phase in any way.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
Parameters and hyperparameters
• Parameters: decision tree rules
– Parameters control how well we fit a dataset.
– We “train” a model by trying to find the best
parameters on training data.
• Hyperparameters: decision tree depth
– Hyperparameters control how complex our
model is.
– We can’t “train” a hyperparameter.
• You can always fit training data better by making the model more complicated.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
Tuning hyperparameters
• How do we set hyperparameters?
• We care about test error.
• But we can’t look at test data.
• So what do we do?????
• One answer: Use part of your dataset to approximate test error.
Adapted from Mark Schmidt, CC BY 4.0.
https://www.cs.ubc.ca/~schmidtm/Courses/
Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
tuning set
training set
whole dataset
tuning set
training set
whole dataset
tuning set
predict class
for tuning set
Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
Terminology for datasets
Unified terminology Traditional ML terminology Biomedical terminology
Learn parameters Training Training Discovery
Tune hyperparameters Tuning Validation
Measure performance Test Test
Show generalization Validation
Michael Hoffman, CC BY 4.0.
Choosing hyperparameters with tuning set
• So to choose a good value of depth (“hyperparameter”), we could:
– Try a depth-1 decision tree, compute tuning error.
– Try a depth-2 decision tree, compute tuning error.
– Try a depth-3 decision tree, compute tuning error.
– …
– Try a depth-20 decision tree, compute tuning error.
– Return the depth with the lowest tuning error.
• After you choose the hyperparameter, we often
re-train on the full training set with the chosen hyperparameter.
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Should you trust them?
• Scenario 1:
– “I built a model based on the data you gave me.”
– “It classified your data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably not:
– They are reporting training error.
– This might have nothing to do with test error.
– For example, they could have fit a very deep decision tree.
• Why ‘probably’?
– If they only tried a few very simple models, the 98% might be reliable.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Should you trust them?
• Scenario 2:
– “I built a model based on half of the data you gave me.”
– “It classified the other half of the data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably:
– They computed the test error once.
– This is an unbiased approximation of the test error.
– Trust them if you believe they didn’t violate the golden rule.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Should you trust them?
• Scenario 3:
– “I built 1 billion models based on half of the data you gave me.”
– “One of them classified the other half of the data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably not:
– They computed the test error a huge number of times.
– Maximizing over these errors is a biased approximation of test error.
– They tried so many models, one of them is likely to work by chance.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Should you trust them?
• Scenario 4:
– “I built 1 billion models based on the first third of the data you gave me.”
– “One of them classified the second third of the data with 98% accuracy.”
– “It also classified the last third of the data with 98% accuracy.”
– “It should get 98% accuracy on the rest of your data.”
• Probably:
– They computed the first tuning error a huge number of times.
– But they had a test set that they only looked at once.
– The test set gives unbiased test error approximation.
– This is ideal, if they didn’t violate golden rule on the last third.
– Assuming you have data points independent from each other in the first place.
Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
Publish Average
Accuracy
Determine
Analysis
Parameters
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Preprocess
Data
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Train
Data
Cross Validation
Tuning
Data
How Cross Validation is meant to be used
Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816
Average Accuracy
Good Enough?
Yes
Publish Average
Accuracy
No
Modify Analysis
Parameters
Retest on
Same Data
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Preprocess
Data
Train
Data
Cross Validation
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Initial
Analysis
Parameters
EntireDataSet How Overfitting happens with Cross Validation
Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816
Tuning
Data
Data from Class One
Data from Class Two
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Train Classifier.
Get Accuracy on
Tuning Data
Set Aside
Some Data
Parameter
Optimization Set
Lock
Box
Initial
Parameters
Preprocess
Data
Train
Data
Cross Validation
Average Accuracy
Good Enough?
Yes
No
Modify Analysis
Parameters
Retest on
Same Data
Preprocess
Lock Box
Publish Accuracies on Parameter
Optimization Set and Lock Box
Lock
Box
Lock
Box
EntireDataSet
Using a Lockbox to obtain true estimate of out-of-sample accuracy
AdaptedfromBradWybleetal.,CCBY-NC4.0.https://doi.org/10.1101/078816
Tuning
Data
Interpolation and extrapolation
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
No Free Lunch, Consistency, and the Future
Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
Statistics for evaluation
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
aka true positive rate (TPR)
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
aka positive predictive value
(PPV)
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The confusion matrix
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
The same value of a metric can correspond to
very different classifier performance
Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)score
Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)
Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)
Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score false positive rate (FPR) = FP/(FP+TN)
truepositiverate(TPR)=sensitivity=recall
=TP/(TP+FN)
true positive rate (TPR) = sensitivity = recall
= TP/(TP+FN)
precision=positivepredictivevalue(PPV)
=TP/(TP+FP)
Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score false positive rate (FPR) = FP/(FP+TN)
truepositiverate(TPR)=sensitivity=recall
=TP/(TP+FN)
true positive rate (TPR) = sensitivity = recall
= TP/(TP+FN)
precision=positivepredictivevalue(PPV)
=TP/(TP+FP)
Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score
Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects
score
Michael Hoffman, CC BY 4.0. https://twitter.com/michaelhoffman/status/1148254326881705984
Reflections
Recommended reading
Classification evaluation. https://www.nature.com/articles/nmeth.3945
I tried a bunch of things: the dangers of unexpected overfitting in classification.
https://doi.org/10.1101/078816
Evolution of Translational Omics: Lessons Learned and the Path Forward (2012),
National Academy of Sciences. https://www.nap.edu/read/13297/chapter/1
The precision-recall plot is more informative than the ROC plot when evaluating
binary classifiers on imbalanced datasets.
https://doi.org/10.1371/journal.pone.0118432
The TRIPOD (Transparent Reporting of a multivariable prediction model for
Individual Prognosis Or Diagnosis) Statement. https://www.tripod-
statement.org/TRIPOD
How to use t-SNE effectively. https://distill.pub/2016/misread-tsne/
Michael Hoffman, CC BY 4.0.
Reflections
• The most important thing about machine learning evaluation is that it
accurately describes performance in a realistic deployment scenario.
• The Golden Rule: the test data cannot influence the training phase in
any way.
• Controls: they’re not just for wet lab experiments.
• If positive predictive value or precision aren’t mentioned, ask.
• Never trust performance boiled down to a single number.
• Published literature at good journals not necessarily done correctly.
Michael Hoffman, CC BY 4.0.
Acknowledgments Mark Schmidt
Shannon Ellis
Dariya Sydykova
Brad Wyble
Martin Krzywinski
Tiffany Timbers
Arnaud Doucet
Casey Greene
Benjamin Haibe-Kains
Funding
Canadian Institutes of Health Research; Princess
Margaret Cancer Foundation; Natural Sciences
and Engineering Research Council of Canada;
Ontario Institute for Cancer Research; Ontario
Ministry of Economic Development, Job Creation
and Trade; Medicine by Design; McLaughlin
Centre
The Hoffman Lab
Samantha Wilson
Eric Roberts
Mickaël Mendez
Linh Huynh
Coby Viner
Rachel Chan
Natalia Mukhina

More Related Content

Similar to Evaluating machine learning claims

An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 

Similar to Evaluating machine learning claims (20)

Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big data
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Big Data Science - hype?
Big Data Science - hype?Big Data Science - hype?
Big Data Science - hype?
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
Frequency response analysis.pptx
Frequency response analysis.pptxFrequency response analysis.pptx
Frequency response analysis.pptx
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Fraud detection- Retail, Banking, Finance & FMCG
Fraud detection- Retail, Banking, Finance & FMCGFraud detection- Retail, Banking, Finance & FMCG
Fraud detection- Retail, Banking, Finance & FMCG
 
Anomly and fraud detection using AI - Artivatic.ai
Anomly and fraud detection using AI - Artivatic.aiAnomly and fraud detection using AI - Artivatic.ai
Anomly and fraud detection using AI - Artivatic.ai
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
 
Reinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual BanditsReinforcement Learning in Practice: Contextual Bandits
Reinforcement Learning in Practice: Contextual Bandits
 
2020 FRSecure CISSP Mentor Program - Class 10
2020 FRSecure CISSP Mentor Program - Class 102020 FRSecure CISSP Mentor Program - Class 10
2020 FRSecure CISSP Mentor Program - Class 10
 
Mammography with Inception
Mammography with InceptionMammography with Inception
Mammography with Inception
 

More from Hoffman Lab

Efficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetEfficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with gget
Hoffman Lab
 
MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...
Hoffman Lab
 

More from Hoffman Lab (20)

GNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talkGNU Parallel: Lab meeting—technical talk
GNU Parallel: Lab meeting—technical talk
 
TCRpower
TCRpowerTCRpower
TCRpower
 
Efficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with ggetEfficient querying of genomic reference databases with gget
Efficient querying of genomic reference databases with gget
 
WashU Epigenome Browser
WashU Epigenome BrowserWashU Epigenome Browser
WashU Epigenome Browser
 
Wireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network TunnelWireguard: A Virtual Private Network Tunnel
Wireguard: A Virtual Private Network Tunnel
 
Plotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seabornPlotting heatmap with matplotlib/seaborn
Plotting heatmap with matplotlib/seaborn
 
Go Get Data (GGD)
Go Get Data (GGD)Go Get Data (GGD)
Go Get Data (GGD)
 
fastp: the FASTQ pre-processor
fastp: the FASTQ pre-processorfastp: the FASTQ pre-processor
fastp: the FASTQ pre-processor
 
R markdown and Rmdformats
R markdown and RmdformatsR markdown and Rmdformats
R markdown and Rmdformats
 
File searching tools
File searching toolsFile searching tools
File searching tools
 
Better BibTeX (BBT) for Zotero
Better BibTeX (BBT) for ZoteroBetter BibTeX (BBT) for Zotero
Better BibTeX (BBT) for Zotero
 
Awk primer and Bioawk
Awk primer and BioawkAwk primer and Bioawk
Awk primer and Bioawk
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and Shells
 
BioRender & Glossary/Acronym
BioRender & Glossary/AcronymBioRender & Glossary/Acronym
BioRender & Glossary/Acronym
 
Linters in R
Linters in RLinters in R
Linters in R
 
BioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biologyBioSyntax: syntax highlighting for computational biology
BioSyntax: syntax highlighting for computational biology
 
Get Good With Git
Get Good With GitGet Good With Git
Get Good With Git
 
Tech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome BrowserTech Talk: UCSC Genome Browser
Tech Talk: UCSC Genome Browser
 
MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...MultiQC: summarize analysis results for multiple tools and samples in a singl...
MultiQC: summarize analysis results for multiple tools and samples in a singl...
 
dreamRs: interactive ggplot2
dreamRs: interactive ggplot2dreamRs: interactive ggplot2
dreamRs: interactive ggplot2
 

Recently uploaded

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Evaluating machine learning claims

  • 1. Michael M. Hoffman Princess Margaret Cancer Centre Vector Institute Department of Medical Biophysics Department of Computer Science University of Toronto https://hoffmanlab.org/ Evaluating machine learning claims @michaelhoffman
  • 2. Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944 Should radiologists worry about their jobs? 2017
  • 3. Stanford ML Group, fair use. https://stanfordmlgroup.github.io/competitions/mura/ Andrew Ng, fair use. https://twitter.com/AndrewYNg/status/930938692310482944 Should radiologists worry about their jobs? 2017 2018
  • 4. Today’s objective To separate hype from reality in evaluating machine learning claims
  • 5. What is machine learning?
  • 6. What is machine learning? Data 1) Learning Method Model Michael Hoffman, CC BY 4.0.
  • 7. Data 1) Learning Method Model MethodData Predictions 2) Prediction Michael Hoffman, CC BY 4.0. What is machine learning?
  • 8. Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
  • 9. Chihuahua and muffin deep learning Photos 1) Learning Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978
  • 10. Chihuahua and muffin deep learning Photos 1) Learning Labels Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978 muffin chihuahua
  • 11. Chihuahua and muffin deep learning Photos 1) Learning Backprop Labels Michael Hoffman, CC BY 4.0. Photos by Karen Zack, fair use. https://twitter.com/teenybiscuit/status/707727863571582978 Model muffin chihuahua
  • 12. Photos 1) Learning Backprop Labels Photos 2) Prediction Michael Hoffman, CC BY 4.0. Model Chihuahua and muffin deep learning = ?
  • 13. Photos 1) Learning Backprop Labels InferencePhotos Labels 2) Prediction Michael Hoffman, CC BY 4.0. Model Chihuahua and muffin deep learning = ?
  • 14. Photos 1) Learning Backprop Labels InferencePhotos Labels 2) Prediction Michael Hoffman, CC BY 4.0. Model Chihuahua and muffin deep learning = ch
  • 15. It’s already been done Enkhtogtokh Togootogtokh and Amarzaya Amartuvshin, CC BY 4.0. https://arxiv.org/abs/1801.09573
  • 16. Machine learning: regression Input variable 1) Learning Response variable Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/ Least squares
  • 17. Machine learning: regression Input variable 1) Learning Least squares Slope Intercept Response variable Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 18. Input variable 1) Learning Least squares Slope Intercept Response variable Regression function Input variable 2) Prediction Machine learning: regression Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 19. Input variable 1) Learning Least squares Slope Intercept Response variable Regression function Input variable Response variable 2) Prediction Machine learning: regression Michael Hoffman, CC BY 4.0. Plot by Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/ ×
  • 20. Robert Hazen, fair use. https://science.sciencemag.org/content/202/4370/823 David Roubik, fair use. https://science.sciencemag.org/content/201/4360/1030
  • 21. Degree of polynomial and fundamental trade-off • As the polynomial degree increases, the training error goes down. • But approximation error goes up: we start overfitting with large M. Adapted from Arnaud Doucet and Mark Schmidt, CC BY 4.0. http://www.cs.ubc.ca/~arnaud/stat535/slides5_revised.pdf
  • 23. Supervised learning • We are given training data where we know labels: • But there is also test data we want to label: Egg Milk Fish Wheat Shellfish Peanuts … 0 0.7 0 0.3 0 0 0.3 0.7 0 0.6 0 0.01 0 0 0 0.8 0 0 0.3 0.7 1.2 0 0.10 0.01 0.3 0 1.2 0.3 0.10 0.01 Sick? + + – + + X = y = Egg Milk Fish Wheat Shellfish Peanuts … 0.5 0 1 0.6 2 1 0 0.7 0 1 0 0 3 1 0 0.5 0 0 Sick? ? ? ? 𝑋= 𝑦= Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 24. Supervised learning • Typical supervised learning steps: 1. Build model based on training data X and y. 2. Model makes predictions 𝑦 on test data 𝑋. • In machine learning: – What we care about is the test error! – You’ve only learned if you can do well in new situations. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 25. The golden rule of machine learning evaluation The test data cannot influence the training phase in any way. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 26. Parameters and hyperparameters • Parameters: decision tree rules – Parameters control how well we fit a dataset. – We “train” a model by trying to find the best parameters on training data. • Hyperparameters: decision tree depth – Hyperparameters control how complex our model is. – We can’t “train” a hyperparameter. • You can always fit training data better by making the model more complicated. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 27. Tuning hyperparameters • How do we set hyperparameters? • We care about test error. • But we can’t look at test data. • So what do we do????? • One answer: Use part of your dataset to approximate test error. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 28. Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/ tuning set training set whole dataset
  • 29. tuning set training set whole dataset tuning set predict class for tuning set Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
  • 30. Terminology for datasets Unified terminology Traditional ML terminology Biomedical terminology Learn parameters Training Training Discovery Tune hyperparameters Tuning Validation Measure performance Test Test Show generalization Validation Michael Hoffman, CC BY 4.0.
  • 31. Choosing hyperparameters with tuning set • So to choose a good value of depth (“hyperparameter”), we could: – Try a depth-1 decision tree, compute tuning error. – Try a depth-2 decision tree, compute tuning error. – Try a depth-3 decision tree, compute tuning error. – … – Try a depth-20 decision tree, compute tuning error. – Return the depth with the lowest tuning error. • After you choose the hyperparameter, we often re-train on the full training set with the chosen hyperparameter. Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 32. Should you trust them? • Scenario 1: – “I built a model based on the data you gave me.” – “It classified your data with 98% accuracy.” – “It should get 98% accuracy on the rest of your data.” • Probably not: – They are reporting training error. – This might have nothing to do with test error. – For example, they could have fit a very deep decision tree. • Why ‘probably’? – If they only tried a few very simple models, the 98% might be reliable. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 33. Should you trust them? • Scenario 2: – “I built a model based on half of the data you gave me.” – “It classified the other half of the data with 98% accuracy.” – “It should get 98% accuracy on the rest of your data.” • Probably: – They computed the test error once. – This is an unbiased approximation of the test error. – Trust them if you believe they didn’t violate the golden rule. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 34. Should you trust them? • Scenario 3: – “I built 1 billion models based on half of the data you gave me.” – “One of them classified the other half of the data with 98% accuracy.” – “It should get 98% accuracy on the rest of your data.” • Probably not: – They computed the test error a huge number of times. – Maximizing over these errors is a biased approximation of test error. – They tried so many models, one of them is likely to work by chance. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 35. Should you trust them? • Scenario 4: – “I built 1 billion models based on the first third of the data you gave me.” – “One of them classified the second third of the data with 98% accuracy.” – “It also classified the last third of the data with 98% accuracy.” – “It should get 98% accuracy on the rest of your data.” • Probably: – They computed the first tuning error a huge number of times. – But they had a test set that they only looked at once. – The test set gives unbiased test error approximation. – This is ideal, if they didn’t violate golden rule on the last third. – Assuming you have data points independent from each other in the first place. Adapted from Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 36. Adapted from Tiffany Timbers, CC BY 4.0. https://github.com/UBC-DSCI/dsci-100/
  • 37. Publish Average Accuracy Determine Analysis Parameters Data from Class One Data from Class Two Train Classifier. Get Accuracy on Tuning Data Preprocess Data Train Classifier. Get Accuracy on Tuning Data Train Classifier. Get Accuracy on Tuning Data Train Data Cross Validation Tuning Data How Cross Validation is meant to be used Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816
  • 38. Average Accuracy Good Enough? Yes Publish Average Accuracy No Modify Analysis Parameters Retest on Same Data Data from Class One Data from Class Two Train Classifier. Get Accuracy on Tuning Data Preprocess Data Train Data Cross Validation Train Classifier. Get Accuracy on Tuning Data Train Classifier. Get Accuracy on Tuning Data Initial Analysis Parameters EntireDataSet How Overfitting happens with Cross Validation Adapted from Brad Wyble et al., CC BY-NC 4.0. https://doi.org/10.1101/078816 Tuning Data
  • 39. Data from Class One Data from Class Two Train Classifier. Get Accuracy on Tuning Data Train Classifier. Get Accuracy on Tuning Data Train Classifier. Get Accuracy on Tuning Data Set Aside Some Data Parameter Optimization Set Lock Box Initial Parameters Preprocess Data Train Data Cross Validation Average Accuracy Good Enough? Yes No Modify Analysis Parameters Retest on Same Data Preprocess Lock Box Publish Accuracies on Parameter Optimization Set and Lock Box Lock Box Lock Box EntireDataSet Using a Lockbox to obtain true estimate of out-of-sample accuracy AdaptedfromBradWybleetal.,CCBY-NC4.0.https://doi.org/10.1101/078816 Tuning Data
  • 41. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 42. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 43. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 44. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 45. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 46. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 47. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 48. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 49. No Free Lunch, Consistency, and the Future Mark Schmidt, CC BY 4.0. https://www.cs.ubc.ca/~schmidtm/Courses/
  • 51. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 52. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 53. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 54. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 55. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 56. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945 aka true positive rate (TPR)
  • 57. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 58. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 59. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945 aka positive predictive value (PPV)
  • 60. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 61. The confusion matrix Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 62. The same value of a metric can correspond to very different classifier performance Jake Lever, Martin Krzywinski, and Naomi Altman, non-commercial use only. https://www.nature.com/articles/nmeth.3945
  • 63. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)score
  • 64. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)
  • 65. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects score false positive rate (FPR) = FP/(FP+TN)truepositiverate(TPR)=sensitivity=recall=TP/(TP+FN)
  • 66. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects score false positive rate (FPR) = FP/(FP+TN) truepositiverate(TPR)=sensitivity=recall =TP/(TP+FN) true positive rate (TPR) = sensitivity = recall = TP/(TP+FN) precision=positivepredictivevalue(PPV) =TP/(TP+FP)
  • 67. Adapted from Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects score false positive rate (FPR) = FP/(FP+TN) truepositiverate(TPR)=sensitivity=recall =TP/(TP+FN) true positive rate (TPR) = sensitivity = recall = TP/(TP+FN) precision=positivepredictivevalue(PPV) =TP/(TP+FP)
  • 68. Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects score
  • 69. Dariya Sydykova, MIT License. https://github.com/dariyasydykova/open_projects score
  • 70. Michael Hoffman, CC BY 4.0. https://twitter.com/michaelhoffman/status/1148254326881705984
  • 72. Recommended reading Classification evaluation. https://www.nature.com/articles/nmeth.3945 I tried a bunch of things: the dangers of unexpected overfitting in classification. https://doi.org/10.1101/078816 Evolution of Translational Omics: Lessons Learned and the Path Forward (2012), National Academy of Sciences. https://www.nap.edu/read/13297/chapter/1 The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. https://doi.org/10.1371/journal.pone.0118432 The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) Statement. https://www.tripod- statement.org/TRIPOD How to use t-SNE effectively. https://distill.pub/2016/misread-tsne/ Michael Hoffman, CC BY 4.0.
  • 73. Reflections • The most important thing about machine learning evaluation is that it accurately describes performance in a realistic deployment scenario. • The Golden Rule: the test data cannot influence the training phase in any way. • Controls: they’re not just for wet lab experiments. • If positive predictive value or precision aren’t mentioned, ask. • Never trust performance boiled down to a single number. • Published literature at good journals not necessarily done correctly. Michael Hoffman, CC BY 4.0.
  • 74. Acknowledgments Mark Schmidt Shannon Ellis Dariya Sydykova Brad Wyble Martin Krzywinski Tiffany Timbers Arnaud Doucet Casey Greene Benjamin Haibe-Kains Funding Canadian Institutes of Health Research; Princess Margaret Cancer Foundation; Natural Sciences and Engineering Research Council of Canada; Ontario Institute for Cancer Research; Ontario Ministry of Economic Development, Job Creation and Trade; Medicine by Design; McLaughlin Centre The Hoffman Lab Samantha Wilson Eric Roberts Mickaël Mendez Linh Huynh Coby Viner Rachel Chan Natalia Mukhina

Editor's Notes

  1. - need laser pointer - turn Workrave off turn phone off turn iPad off
  2. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  3. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  4. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  5. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  6. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  7. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  8. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  9. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  10. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  11. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  12. Blue and gray circles indicate cases known to be positive (TP + FN) and negative (FP + TN), respectively, and blue and gray backgrounds/squares depict cases predicted as positive (TP + FP) and negative (FN + TN), respectively. Equations for calculating each metric are encoded graphically in terms of the quantities in the confusion matrix. FDR, false discovery rate.
  13. (a–d) Each panel shows three different classification scenarios with a table of corresponding values of accuracy (ac), sensitivity (sn), precision (pr), F1 score (F1) and Matthews correlation coefficient (MCC). Scenarios in a group have the same value (0.8) for the metric in bold in each table: (a) accuracy, (b) sensitivity (recall), (c) precision and (d) F1 score. In each panel, those observations that do not contribute to the corresponding metric are struck through with a red line. The color-coding is the same as in Figure 1; for example, blue circles (cases known to be positive) on a gray background (predicted to be negative) are FNs.
  14. Say we decided to see how well expression and TF binding correlate
  15. Say we decided to see how well expression and TF binding correlate
  16. How it translates to correct predictions High prediction for some TFs that are not sequence specific
  17. t-SNE is not a clustering method!!! Controls! They’re not just for wet lab experiments
  18. And finally, I'd like to thank you for your very kind attention.