SlideShare a Scribd company logo
1 of 63
Download to read offline
3
The forgotten practicalities of machine learning:
Evaluating machine learning models
and their diagnostic value
Gaël Varoquaux
Based on Varoquaux  Colliot,
hal.archives-ouvertes.fr/hal-03682454
Evaluation: the Achilles heel of machine learning practice
Brain imaging prediction challenge [Traut... 2022]
Autism status prediction
10 000 € incentives
Analysts overfit the public set
G Varoquaux 1
Evaluation: the Achilles heel of machine learning practice
Kaggle competitions [Varoquaux and Cheplygina 2022]
Observed improvement in score
Lung cancer
Classification
Prize: $1 000 000
Test size: max 1K
-0.75 0.0 +0.75
Pneumonia
Detection (localization)
Prize: $ 30 000
Test size: 3 000
-0.15 0.0 +0.15
Covid 19
Abnormality localization
Prize: $ 100 000
Test size: 1 200
-0.01 0.0 +0.01
Nerve
Segmentation
Prize: $100 000
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Many submis-
sions fall within
evaluation error
G Varoquaux 1
Evaluation: the Achilles heel of machine learning practice
Deep learning literature [Bouthillier... 2021]
90.0
92.5
95.0
97.5
100.0
cifar10
2012 2014 2016 2018 2020
85
90
95
100
sst2
non-'SOTA' results
Significant
Non-Significant
Year
Accuracy
Published im-
provements often
fall within error
bars
G Varoquaux 1
Problem setting: defining objects
(x, y) ∈ X × Y drawn from D
a function f : X → Y such that f (x) ≈ y Notation: ŷ
def
= f (x)
Evaluate the risk
Error function l : Y × Y → Ò captures the cost of errors
Expected error Åx,y∼D

l(f (x), y)

.Training might be done with
- another D
- another l
G Varoquaux 2
Outline
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
G Varoquaux 3
1 Performance metrics
Metrics must capture and reflect application
Going further for images analysis [Maier-Hein... 2022]
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Summary metric: accuracy
Accuracy: counts the fraction of error
Simple summary
Overly simplified picture
Distinguishing false detections from misses
eg false detection of a brain tumor?
- Very bad if leads to brain surgery
- Minor if leads to confirmatory imaging
Relevant cost (error metric) dependent on application setting
G Varoquaux 6
Confusion matrix: beyond accuracy
Truth: Disease status
D+ D−
Predicted:
test
result
T+ TP FP
T−
FN TN
G Varoquaux 7
Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
G Varoquaux 8
[Varoquaux and Colliot 2022]
Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN
G Varoquaux 8
[Varoquaux and Colliot 2022]
Metrics on positive and negative classes
T denotes test: classifier output; D denotes diseased status.
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN Estimates P(T+ | D+)
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP Estimates P(T− | D−)
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP Estimates P(D+ | T+)
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN Estimates P(D− | T−)
G Varoquaux 8
[Varoquaux and Colliot 2022]
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Accuracy, sensitivity, specificity Truth: Disease status
D+ D−
Predicted:
test
result
T+
TP FP
T−
FN TN
Accuracy
uninformative under class imbalance
90% of class 0
⇒ predicting only class 0 gives Acc=90%
Balanced accuracy: errors on class 0 and class 1
Sensitivity (also called recall): fraction of class 1 retrieved. TP
TP + FN
Specificity: fraction of class 0 actually classified as 0. TN
TN + FP
Balanced accuracy: 1
2 (sensitivity + specificity)
Sensitivity: P(T+|D+) Specificity: P(T − |D − )
G Varoquaux 10
Asking the right question: P(T+|D+) vs P(D+|T+)
Positive predictive value (via Bayes’ theorem):
P(D+ | T+) =
sensitivity × prevalence
(1 − specificity) × (1 − prevalence) + sensitivity × prevalence
.
Diseased
Test positive
Summary metric: Markedness: PPV + NPV - 1
Drawback: depends on prevalence
⇒ Characterizes not only the classifier, but also the dataset
G Varoquaux 11
Odds ratios for invariance to sampling
Definition: Odds of a O(a) =
P(a)
1 − P(a)
Likelihood ratio of positive class:
LR+ =
O(D+|T+)
O(D + )
=
Sensitivity
1 − Specificity
Independent of class prevalence1
Use prevalence on target population to compute O(D+|T+)
Useful to extrapolate from case-control study to target
population
1Proof in Varoquaux  Colliot “Evaluating machine learning models and their diagnostic value”
G Varoquaux 12
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Multi-threshold metrics: ROC curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
True
positive
rate
=
sensitivity
or
recall
Excellent, AUC=0.95
Good, AUC=0.88
Poor, AUC=0.75
Chance, AUC=0.50
Classifiers about continuous
scores
ROC obtained by varying the
threshold
Summary metric: AUC
Area Under the Curve
Chance = .5
G Varoquaux 14
Multi-threshold metrics: Precision Recall curve
0.0 0.2 0.4 0.6 0.8 1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
= TPR or sensitivity
=
PPV
Excellent, AUC=0.96
Good, AUC=0.92
Poor, AUC=0.78
Chance, AUC=0.57
Better dynamical range than
ROC for imbalanced classes
Area Under the Curve:
Chance depends on class
imbalance
G Varoquaux 15
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
Interpreting classifier score as a probability? – Calibration
Calibration
Average error rate for all
samples with score s is s
Computed in bins on score s
ECE:
expected calibration error
Average error on all bins
0.0 0.2 0.4 0.6 0.8 1.0
Mean confidence score
0.0
0.2
0.4
0.6
0.8
1.0
Observed
fraction
of
positive
Underconfident
Overconfident
Calibration plot
Perfectly calibrated
GaussianNB
.Does not control individual probabilities
G Varoquaux 17
Classifier score as a probability? – Individual probabilities
Does the classifier approach P(y|X)?
Metric:
Brier score =
Õ
i
(ŝi − yi)
2
Observed (binary) label
Confidence score
Minimal for ŝ = P(y|X)
Scaled version:
Brier skill score = 1 −
Brier(ŝ, y)
Brier(ȳ, y)
Class prevalence
1 = perfect prediction
0 = scores not more informative than class prevalence
G Varoquaux 18
Choose well your metrics
Machine learning research chases metrics
These should reflect application as well as possible
Worry about P(D + |T+)
- Accuracy reasonable proxy only for balanced classes
- LR+ interesting to keep in mind
Worry about uncertainty
- Calibration quantifies average errors
- Brier scores controls individual uncertainty
A single number does not tell the whole story
G Varoquaux 19
2 Evaluation procedures
Controlled estimation of e = Åx,y∼D

l(f (x), y)

Corresponding research paper: [Bouthillier... 2021]
Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
G Varoquaux 21
Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
For application claims: eg medicine
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
For machine-learning research (claims on algorithms)
G Varoquaux 21
Model evaluation 101
In-sample error
is not meaningful
X, y ̸⊥
⊥ f̂
Å estimated wrong
G Varoquaux 22
Model evaluation 101
In-sample error
is not meaningful
X, y ̸⊥
⊥ f̂
Å estimated wrong
Typically:
Test set
Train set
Full data
typically 10% trade off better learn-
ing vs better estimation
G Varoquaux 22
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
External validation Before putting in production
We are given f : X → Y
Xtest different enough from Xtrain
- No repeated acquisition of same individual in train  test
[Little... 2017]
- Ideally: show generalization to new site, later in time...
Xtest representative of target population
Sample Xtest:
- To match statistical moments
- To minimize a confounding association (shortcuts)
[Chyzhyk... 2018]
G Varoquaux 24
Evaluation error: Sampling noise on test set
Evaluation quality is limited by number of test examples
Sampling noise1 for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Confidence intervals ntest = 1 000 interval: 5.7%
ntest = 10 000 interval: 1.8%
ntest = 100 000 interval: 0.6%
1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and
sampling other data will lead to other results.
G Varoquaux 25
[Varoquaux 2018]
Evaluation error: Sampling noise on test set
“in the wild”
102
103
104
105
106
Test set size
0
1
2
3
4
Standard
deviation
(%
acc) In Theory:
From a Binomial
In Practice:
Random splits
Binom(n', 0.66)
Binom(n', 0.95)
Binom(n', 0.91)
Glue-RTE BERT
(n'=277)
Glue-SST2 BERT
(n'=872)
CIFAR10 VGG11
(n'=10000)
G Varoquaux 26
[Bouthillier... 2021]
Evaluation noise is not negligible – in Kaggle competitions
Lung cancer classification
Test size: max 1K
Smaller improvements than noise
-0.75 0.0 +0.75
Diminishing returns
Schizophrenia classification
Test size: 120
-0.2 0.0 +0.2
Diminishing returns
Lung tumor segmentation
Test size: max 6k
Poorer score on private set
-0.05 0.0 +0.05
Overfit
Nerve segmentation
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Actual
improvement
G Varoquaux 27
[Varoquaux and Cheplygina 2022]
Confidence intervals  statistical testing
Evaluate uncertainty to know when to stop, what to trust
(diminishing returns, creeping complexity)
Confidence interval1: Range of values compatible with the
observations
Null hypothesis testing: p-value: the chance to observe the
results if a null hypothesis were true
Typical null: model comparison
model p1 and p2 give same expected error
1Technically not making the difference with a credible interval
G Varoquaux 28
Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Confidence intervals:
N 65% 80% 90% 95%
100 [-9.0% 9.0%] [-8.0% 8.0%] [-6.0% 5.0%] [-5.0% 4.0%]
1000 [-3.0% 2.9%] [-2.5% 2.4%] [-1.9% 1.8%] [-1.4% 1.3%]
10000 [-0.9% 0.9%] [-0.8% 0.8%] [-0.6% 0.6%] [-0.4% 0.4%]
100000 [-0.3% 0.3%] [-0.2% 0.2%] [-0.2% 0.2%] [-0.1% 0.1%]
Table from [Varoquaux and Colliot 2022]
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29
Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Non-parametric:
- Confidence intervals of performance of a prediction rule
Bootstrap test set
- Statistical testing (compare prediction rules with correlated errors)
Permutations [Bandos... 2005]
Sample null by randomly switching predictions from p1 and p2.
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
Benchmarking to conclude on good training procedures
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
We want machine-learning research claims
(novel frobnicate improves prediction)
Many arbitrary components
torch.manual seed(3407) ?? [Picard 2021]
Useless to tune random seeds
(for weights init, dropout, data augmentation)
will not carry over to new training data
G Varoquaux 31
Arbitrary variance in a machine-learning benchmark
Variance when rerunning an evaluation,
modifying arbitrary elements:
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
relative scale
Across various computer vision and NLP tasks
G Varoquaux 32
[Bouthillier... 2021]
Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
[Bouthillier... 2021]
Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
Better evaluation
Sample multiple times these arbitrary choices [Bouthillier... 2021]
Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
G Varoquaux 34
Test set
Train set
Full data
Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
But, a full learning pipeline comes with hyper-parameters
These are set to minimize error
on left-out data
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
G Varoquaux 34
Test set
Train set
Full data
Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
G Varoquaux 35
Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
Multiple splits
+ automated tuning of hyperparameters Validation set
Full data
Test set
Train set
Hyper-parameter tuning
Outer loop
G Varoquaux 35
Hyper-parameter optimization procedures
Grid search
Vary each hyper-parameter on a grid
G Varoquaux 36
Hyper-parameter optimization procedures
Grid search
Random search
[Bergstra and Bengio 2012]
prefer to grid-search for
more than 2 params
Sub-optimal hyper-
parameters on models
routinely lead to invalid
conclusions
See refs in [Bouthillier... 2021] Region of good
hyperparameters
Hyperparameter 1
Hyperparameter
2
Grid Search
Randomized
Search
(important hyperparameter)
(unimportant
hyperparameter)
G Varoquaux 36
Benchmarking training procedures (eg to compare them)
Control arbitrary fluctuations (that will not generalize)
Sample all:
data sampling
Multiple train-test splits (cross-validation)
Test set
Train set
Full data
arbitrary choices (seeds)
Randomize them all
hyper-parameters
Hyper-parameter optimization Too expensive to randomize
In practice: set hyper parameters once, then randomize model seeds
and data splits [Bouthillier... 2021]
Counterintuitive: randomization decorrelates errors, improving benchmarks
G Varoquaux 37
Accounting for variance in conclusions
Confidence intervals  statistical testing
Test set
Train set
Full data
Statistical testing with multiple folds
Challenge: folds are not independent1
.t-test/Wilcoxon across folds are not valid
Don’t divide std by number of folds
Approximate corrections for dependence across folds2
5x2cv: repeat 5 times randomized 2-fold [Dietterich 1998]
Corrected resampled t-test statistic [Nadeau and Bengio 2003]
1Train sets overlap, and often test sets also do.
2Does not account for sources of variance beyond data sampling, eg random seeds, hyper parameters.
G Varoquaux 38
Statistical tests: across datasets
(more general claims on training pipelines)
Challenge:
metrics not comparable across datasets
=⇒ Tests based on rank statistics
Wilcoxon signed rank test
Tests how often p1 outperforms p2 across datasets
G Varoquaux 39
[Demšar 2006]
Statistical tests: multiple pipelines across datasets
Challenge: multiple comparisons1
The Wilcoxon-Holm approach [Demšar 2006]
Pairwise comparisons + Bonferroni-Holm correction
The Friedman-Nemenyi approach2 [Demšar 2006]
1. Friedman test across all pipelines (omnibus)
2. Critical difference via Nemenyi test 1
2
3
4
5
4.2000
clf13.7667
clf23.5000
clf4
2.0000clf5
1.5333clf3
Accuracy (rank)
Critical difference diagrams
Replicability analysis [Dror... 2017]
Perform dataset-level pairwise tests
Combine by testing (partial conjunction multiple-testing):
“Does p1 perform better than p2 on at least u datasets?”
More powerful for a small number of datasets
1If we do many tests, some will show large differences by chance.
2The Holm approach can be more interesting when considering comparisons to a referent classifier.
G Varoquaux 40
Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires  15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
G Varoquaux 41
[Demšar 2008]
Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires  15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons2
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
2FDR (False Discovery Rate) attempts to solve this.
G Varoquaux 41
[Demšar 2008]
Statistical tests: accounting for effect sizes
Neyman-Pearson view of hypothesis testing
Two hypothesis, H0 and H1
H1: p1 outperforms p2 by a margin1
Which is mostly likely? H0 H1
H0 H1
Requires the choice of the margin
Related to superiority testing in clinical trials
[Lesaffre 2008]
1Related to the rejection region in the Neyman-Pearson lemma.
G Varoquaux 42
Pragmatic compromises
Test on P(p1  p2)  δ
δ  .5: Neyman-Pearson view
Evaluate P(p1  p2) by resampling
Randomize everything: data splits, seeds,...
Gaussian approximation: amounts to comparing differences to
standard deviations
Not an inference on the expected difference in performance1
1Unlike standard error, standard deviation does not shrink to zero with the number of resampling.
G Varoquaux 43
[Bouthillier... 2021]
Summary – comparing learning procedures
Sample multiple sources of variance:
- data split
- random seeds
- hyper-parameter tuning
Null-hypothesis testing: no t-test on cross-validation!
Don’t mis-interpret p-value:
- Not significant: more data could change that
- Significant differences may be trivial
Detect practical differences:
delta in performance  standard deviation
G Varoquaux 44
Better experimental procedures
Crack the black box open
A prediction score is seldom insightful
Ablation studies: remove/change atomic elements
Learning curves
Better benchmarking in these
Tune hyper-parameters to the same quality
Randomize everything
Account for variance in conclusions
G Varoquaux 45
Summary – Better benchmarking procedures
Reminder: Your validation
measure is intrinsically unreli-
able (sampling noise)
An arbitrary choice (random
seed) may give seemingly-good
results that do not generalize
Optimizing test accuracy will
explore the tails
Evaluation is a bottleneck
300
200
100
30
Number of available samples   
­2% +2%
­4% +4%
­5% +5%
­7% +7%
­15% +12%
G Varoquaux 46
Model evaluation
Choice of performance metrics
Should be suited to the application setting
Machine learning does metric chasing Š
Evaluation procedures
Account for variance
Different sources of variance for applying prediction rules vs
learning them
Careful benchmarking is crucial
Optimistic flukes will not generalize
What is our purpose? External validity a
G Varoquaux 47
@GaelVaroquaux
References I
A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in
areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873,
2005.
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of
Machine Learning Research, 13:281, 2012.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto,
N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in
machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021.
D. Chyzhyk, G. Varoquaux, B. Thirion, and M. Milham. Controlling a confound in predictive
models with a test set minimizing its effect. In 2018 International Workshop on
Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2018.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of
Machine Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop
on Evaluation Methods for Machine Learning in conjunction with ICML, page 65.
Citeseer, 2008.
G Varoquaux 48
References II
T. G. Dietterich. Approximate statistical tests for comparing supervised classification
learning algorithms. Neural computation, 10(7):1895–1923, 1998.
R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language
processing: Testing significance with multiple datasets. Transactions of the
Association for Computational Linguistics, 2017.
E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU
hospital for joint diseases, 66(2), 2008.
M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording.
Using and understanding cross-validation strategies. perspectives on saeb et al.
GigaScience, 6(5):1–6, 2017.
L. Maier-Hein, A. Reinke, E. Christodoulou, B. Glocker, P. Godau, F. Isensee, J. Kleesiek,
M. Kozubek, M. Reyes, M. A. Riegler, ... Metrics reloaded: Pitfalls and recommendations
for image analysis validation. arXiv preprint arXiv:2206.01653, 2022.
G Varoquaux 49
References III
A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and
C. Archambeau. Overfitting in bayesian optimization: an empirical study and
early-stopping solution. arXiv preprint arXiv:2104.08166, 2021.
C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3):
239–281, 2003.
D. Picard. Torch. manual seed (3407) is all you need: On the influence of random seeds
in deep learning architectures for computer vision. arXiv:2109.08203, 2021.
N. Traut, K. Heuer, G. Lemaı̂tre, A. Beggiato, D. Germanaud, M. Elmaleh, A. Bethegnies,
L. Bonnasse-Gahot, W. Cai, S. Chambon, ... Insights from an autism imaging biomarker
challenge: promises and threats to biomarker discovery. NeuroImage, 255:119171, 2022.
G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars.
Neuroimage, 180:68–77, 2018.
G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological
failures and recommendations for the future. NPJ digital medicine, 5(1):1–8, 2022.
G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic
value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022.
G Varoquaux 50

More Related Content

What's hot

Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Balázs Hidasi
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learningtrygub
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptxChode Amarnath
 
Die freie Programmiersprache Python
Die freie Programmiersprache Python Die freie Programmiersprache Python
Die freie Programmiersprache Python Andreas Schreiber
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationAdnan Masood
 
Unsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewUnsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewRidge-i, Inc.
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning BasicsSuresh Arora
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Experimenting the TextTiling Algorithm
Experimenting the TextTiling AlgorithmExperimenting the TextTiling Algorithm
Experimenting the TextTiling AlgorithmEstelle Delpech
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learningmahutte
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMNYC Predictive Analytics
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classificationManu Chandel
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Learning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILLearning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILPavithra Thippanaik
 

What's hot (20)

Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Python and Machine Learning
Python and Machine LearningPython and Machine Learning
Python and Machine Learning
 
Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
Die freie Programmiersprache Python
Die freie Programmiersprache Python Die freie Programmiersprache Python
Die freie Programmiersprache Python
 
Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)Deep Learning for Computer Vision: Object Detection (UPC 2016)
Deep Learning for Computer Vision: Object Detection (UPC 2016)
 
Belief Networks & Bayesian Classification
Belief Networks & Bayesian ClassificationBelief Networks & Bayesian Classification
Belief Networks & Bayesian Classification
 
Unsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewUnsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overview
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
Perceptron
PerceptronPerceptron
Perceptron
 
Bayesian network
Bayesian networkBayesian network
Bayesian network
 
Experimenting the TextTiling Algorithm
Experimenting the TextTiling AlgorithmExperimenting the TextTiling Algorithm
Experimenting the TextTiling Algorithm
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Intro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVMIntro to Classification: Logistic Regression & SVM
Intro to Classification: Logistic Regression & SVM
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Learning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILLearning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOIL
 

Similar to Evaluating machine learning models and their diagnostic value

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdfssuserdce5c21
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedDataminingTools Inc
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learnedweka Content
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptxshuchismitjha2
 
Download the presentation
Download the presentationDownload the presentation
Download the presentationbutest
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres Hernandez
 
An infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docxAn infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docxgreg1eden90113
 
How to Measure Uncertainty
How to Measure UncertaintyHow to Measure Uncertainty
How to Measure UncertaintyRandox
 
Common mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculationsCommon mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculationsGH Yeoh
 
7. evaluation of diagnostic test
7. evaluation of diagnostic test7. evaluation of diagnostic test
7. evaluation of diagnostic testAshok Kulkarni
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdfEmerson Ceras
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validationStéphane Canu
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methodsKrish_ver2
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 

Similar to Evaluating machine learning models and their diagnostic value (20)

Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
Errors2
Errors2Errors2
Errors2
 
Inferences about Two Proportions
 Inferences about Two Proportions Inferences about Two Proportions
Inferences about Two Proportions
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdf
 
WEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been LearnedWEKA: Credibility Evaluating Whats Been Learned
WEKA: Credibility Evaluating Whats Been Learned
 
WEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been LearnedWEKA:Credibility Evaluating Whats Been Learned
WEKA:Credibility Evaluating Whats Been Learned
 
EvaluationMetrics.pptx
EvaluationMetrics.pptxEvaluationMetrics.pptx
EvaluationMetrics.pptx
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
An infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docxAn infinite population has a standard deviation of 10.  A random s.docx
An infinite population has a standard deviation of 10.  A random s.docx
 
Testing a claim about a mean
Testing a claim about a mean  Testing a claim about a mean
Testing a claim about a mean
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
How to Measure Uncertainty
How to Measure UncertaintyHow to Measure Uncertainty
How to Measure Uncertainty
 
Common mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculationsCommon mistakes in measurement uncertainty calculations
Common mistakes in measurement uncertainty calculations
 
7. evaluation of diagnostic test
7. evaluation of diagnostic test7. evaluation of diagnostic test
7. evaluation of diagnostic test
 
QCP user manual EN.pdf
QCP user manual EN.pdfQCP user manual EN.pdf
QCP user manual EN.pdf
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Screening
ScreeningScreening
Screening
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 

More from Gael Varoquaux

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 

More from Gael Varoquaux (20)

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 

Evaluating machine learning models and their diagnostic value

  • 1. 3 The forgotten practicalities of machine learning: Evaluating machine learning models and their diagnostic value Gaël Varoquaux Based on Varoquaux Colliot, hal.archives-ouvertes.fr/hal-03682454
  • 2. Evaluation: the Achilles heel of machine learning practice Brain imaging prediction challenge [Traut... 2022] Autism status prediction 10 000 € incentives Analysts overfit the public set G Varoquaux 1
  • 3. Evaluation: the Achilles heel of machine learning practice Kaggle competitions [Varoquaux and Cheplygina 2022] Observed improvement in score Lung cancer Classification Prize: $1 000 000 Test size: max 1K -0.75 0.0 +0.75 Pneumonia Detection (localization) Prize: $ 30 000 Test size: 3 000 -0.15 0.0 +0.15 Covid 19 Abnormality localization Prize: $ 100 000 Test size: 1 200 -0.01 0.0 +0.01 Nerve Segmentation Prize: $100 000 Test size 5.5K -0.04 0.0 +0.04 Improvement of top model on 10% best Winner gap Evaluation noise between public and private sets Many submis- sions fall within evaluation error G Varoquaux 1
  • 4. Evaluation: the Achilles heel of machine learning practice Deep learning literature [Bouthillier... 2021] 90.0 92.5 95.0 97.5 100.0 cifar10 2012 2014 2016 2018 2020 85 90 95 100 sst2 non-'SOTA' results Significant Non-Significant Year Accuracy Published im- provements often fall within error bars G Varoquaux 1
  • 5. Problem setting: defining objects (x, y) ∈ X × Y drawn from D a function f : X → Y such that f (x) ≈ y Notation: ŷ def = f (x) Evaluate the risk Error function l : Y × Y → Ò captures the cost of errors Expected error Åx,y∼D l(f (x), y) .Training might be done with - another D - another l G Varoquaux 2
  • 6. Outline 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration 2 Evaluation procedures Evaluating a prediction rule Evaluating a training procedure G Varoquaux 3
  • 7. 1 Performance metrics Metrics must capture and reflect application Going further for images analysis [Maier-Hein... 2022]
  • 8. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 9. Summary metric: accuracy Accuracy: counts the fraction of error Simple summary Overly simplified picture Distinguishing false detections from misses eg false detection of a brain tumor? - Very bad if leads to brain surgery - Minor if leads to confirmatory imaging Relevant cost (error metric) dependent on application setting G Varoquaux 6
  • 10. Confusion matrix: beyond accuracy Truth: Disease status D+ D− Predicted: test result T+ TP FP T− FN TN G Varoquaux 7
  • 11. Metrics on positive and negative classes Sensitivity (also called recall): fraction of positive samples actually retrieved Sensitivity = TP TP+FN Specificity: fraction of negative samples actually classified as negative Specificity = TN TN+FP G Varoquaux 8 [Varoquaux and Colliot 2022]
  • 12. Metrics on positive and negative classes Sensitivity (also called recall): fraction of positive samples actually retrieved Sensitivity = TP TP+FN Specificity: fraction of negative samples actually classified as negative Specificity = TN TN+FP Positive predictive value (PPV, also called precision): fraction of the positively classified samples which are indeed positive PPV = TP TP+FP Negative predictive value (NPV): fraction of the negatively classified samples which are indeed negative NPV = TN TN+FN G Varoquaux 8 [Varoquaux and Colliot 2022]
  • 13. Metrics on positive and negative classes T denotes test: classifier output; D denotes diseased status. Sensitivity (also called recall): fraction of positive samples actually retrieved Sensitivity = TP TP+FN Estimates P(T+ | D+) Specificity: fraction of negative samples actually classified as negative Specificity = TN TN+FP Estimates P(T− | D−) Positive predictive value (PPV, also called precision): fraction of the positively classified samples which are indeed positive PPV = TP TP+FP Estimates P(D+ | T+) Negative predictive value (NPV): fraction of the negatively classified samples which are indeed negative NPV = TN TN+FN Estimates P(D− | T−) G Varoquaux 8 [Varoquaux and Colliot 2022]
  • 14. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 15. Accuracy, sensitivity, specificity Truth: Disease status D+ D− Predicted: test result T+ TP FP T− FN TN Accuracy uninformative under class imbalance 90% of class 0 ⇒ predicting only class 0 gives Acc=90% Balanced accuracy: errors on class 0 and class 1 Sensitivity (also called recall): fraction of class 1 retrieved. TP TP + FN Specificity: fraction of class 0 actually classified as 0. TN TN + FP Balanced accuracy: 1 2 (sensitivity + specificity) Sensitivity: P(T+|D+) Specificity: P(T − |D − ) G Varoquaux 10
  • 16. Asking the right question: P(T+|D+) vs P(D+|T+) Positive predictive value (via Bayes’ theorem): P(D+ | T+) = sensitivity × prevalence (1 − specificity) × (1 − prevalence) + sensitivity × prevalence . Diseased Test positive Summary metric: Markedness: PPV + NPV - 1 Drawback: depends on prevalence ⇒ Characterizes not only the classifier, but also the dataset G Varoquaux 11
  • 17. Odds ratios for invariance to sampling Definition: Odds of a O(a) = P(a) 1 − P(a) Likelihood ratio of positive class: LR+ = O(D+|T+) O(D + ) = Sensitivity 1 − Specificity Independent of class prevalence1 Use prevalence on target population to compute O(D+|T+) Useful to extrapolate from case-control study to target population 1Proof in Varoquaux Colliot “Evaluating machine learning models and their diagnostic value” G Varoquaux 12
  • 18. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 19. Multi-threshold metrics: ROC curve 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 0.0 0.2 0.4 0.6 0.8 1.0 True positive rate = sensitivity or recall Excellent, AUC=0.95 Good, AUC=0.88 Poor, AUC=0.75 Chance, AUC=0.50 Classifiers about continuous scores ROC obtained by varying the threshold Summary metric: AUC Area Under the Curve Chance = .5 G Varoquaux 14
  • 20. Multi-threshold metrics: Precision Recall curve 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision = TPR or sensitivity = PPV Excellent, AUC=0.96 Good, AUC=0.92 Poor, AUC=0.78 Chance, AUC=0.57 Better dynamical range than ROC for imbalanced classes Area Under the Curve: Chance depends on class imbalance G Varoquaux 15
  • 21. 1 Performance metrics Basic classification metrics Evaluating imbalanced binary classification Multi-threshold metrics for classification Confidence scores and calibration
  • 22. Interpreting classifier score as a probability? – Calibration Calibration Average error rate for all samples with score s is s Computed in bins on score s ECE: expected calibration error Average error on all bins 0.0 0.2 0.4 0.6 0.8 1.0 Mean confidence score 0.0 0.2 0.4 0.6 0.8 1.0 Observed fraction of positive Underconfident Overconfident Calibration plot Perfectly calibrated GaussianNB .Does not control individual probabilities G Varoquaux 17
  • 23. Classifier score as a probability? – Individual probabilities Does the classifier approach P(y|X)? Metric: Brier score = Õ i (ŝi − yi) 2 Observed (binary) label Confidence score Minimal for ŝ = P(y|X) Scaled version: Brier skill score = 1 − Brier(ŝ, y) Brier(ȳ, y) Class prevalence 1 = perfect prediction 0 = scores not more informative than class prevalence G Varoquaux 18
  • 24. Choose well your metrics Machine learning research chases metrics These should reflect application as well as possible Worry about P(D + |T+) - Accuracy reasonable proxy only for balanced classes - LR+ interesting to keep in mind Worry about uncertainty - Calibration quantifies average errors - Brier scores controls individual uncertainty A single number does not tell the whole story G Varoquaux 19
  • 25. 2 Evaluation procedures Controlled estimation of e = Åx,y∼D l(f (x), y) Corresponding research paper: [Bouthillier... 2021]
  • 26. Definitions: what are we benchmarking? Senario 1: a prediction rule: We are given f : X → Y Senario 2: a training procedure: We are given: a procedure that outputs a prediction rule f̂ from training data (X, y) ∈ (X × Y)n G Varoquaux 21
  • 27. Definitions: what are we benchmarking? Senario 1: a prediction rule: We are given f : X → Y For application claims: eg medicine Senario 2: a training procedure: We are given: a procedure that outputs a prediction rule f̂ from training data (X, y) ∈ (X × Y)n For machine-learning research (claims on algorithms) G Varoquaux 21
  • 28. Model evaluation 101 In-sample error is not meaningful X, y ̸⊥ ⊥ f̂ Å estimated wrong G Varoquaux 22
  • 29. Model evaluation 101 In-sample error is not meaningful X, y ̸⊥ ⊥ f̂ Å estimated wrong Typically: Test set Train set Full data typically 10% trade off better learn- ing vs better estimation G Varoquaux 22
  • 30. 2 Evaluation procedures Evaluating a prediction rule Evaluating a training procedure
  • 31. External validation Before putting in production We are given f : X → Y Xtest different enough from Xtrain - No repeated acquisition of same individual in train test [Little... 2017] - Ideally: show generalization to new site, later in time... Xtest representative of target population Sample Xtest: - To match statistical moments - To minimize a confounding association (shortcuts) [Chyzhyk... 2018] G Varoquaux 24
  • 32. Evaluation error: Sampling noise on test set Evaluation quality is limited by number of test examples Sampling noise1 for ntest = 1000: -10% -5% 0% +5% +10% Binomial distribution of error on test accuracy -2% +2% Confidence intervals ntest = 1 000 interval: 5.7% ntest = 10 000 interval: 1.8% ntest = 100 000 interval: 0.6% 1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and sampling other data will lead to other results. G Varoquaux 25 [Varoquaux 2018]
  • 33. Evaluation error: Sampling noise on test set “in the wild” 102 103 104 105 106 Test set size 0 1 2 3 4 Standard deviation (% acc) In Theory: From a Binomial In Practice: Random splits Binom(n', 0.66) Binom(n', 0.95) Binom(n', 0.91) Glue-RTE BERT (n'=277) Glue-SST2 BERT (n'=872) CIFAR10 VGG11 (n'=10000) G Varoquaux 26 [Bouthillier... 2021]
  • 34. Evaluation noise is not negligible – in Kaggle competitions Lung cancer classification Test size: max 1K Smaller improvements than noise -0.75 0.0 +0.75 Diminishing returns Schizophrenia classification Test size: 120 -0.2 0.0 +0.2 Diminishing returns Lung tumor segmentation Test size: max 6k Poorer score on private set -0.05 0.0 +0.05 Overfit Nerve segmentation Test size 5.5K -0.04 0.0 +0.04 Improvement of top model on 10% best Winner gap Evaluation noise between public and private sets Actual improvement G Varoquaux 27 [Varoquaux and Cheplygina 2022]
  • 35. Confidence intervals statistical testing Evaluate uncertainty to know when to stop, what to trust (diminishing returns, creeping complexity) Confidence interval1: Range of values compatible with the observations Null hypothesis testing: p-value: the chance to observe the results if a null hypothesis were true Typical null: model comparison model p1 and p2 give same expected error 1Technically not making the difference with a credible interval G Varoquaux 28
  • 36. Statistical testing with a given test set Test set Train set Full data Simple distribution of metrics, eg accuracy1: binomial Confidence intervals: N 65% 80% 90% 95% 100 [-9.0% 9.0%] [-8.0% 8.0%] [-6.0% 5.0%] [-5.0% 4.0%] 1000 [-3.0% 2.9%] [-2.5% 2.4%] [-1.9% 1.8%] [-1.4% 1.3%] 10000 [-0.9% 0.9%] [-0.8% 0.8%] [-0.6% 0.6%] [-0.4% 0.4%] 100000 [-0.3% 0.3%] [-0.2% 0.2%] [-0.2% 0.2%] [-0.1% 0.1%] Table from [Varoquaux and Colliot 2022] 1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022] G Varoquaux 29
  • 37. Statistical testing with a given test set Test set Train set Full data Simple distribution of metrics, eg accuracy1: binomial Non-parametric: - Confidence intervals of performance of a prediction rule Bootstrap test set - Statistical testing (compare prediction rules with correlated errors) Permutations [Bandos... 2005] Sample null by randomly switching predictions from p1 and p2. 1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022] G Varoquaux 29
  • 38. 2 Evaluation procedures Evaluating a prediction rule Evaluating a training procedure
  • 39. Benchmarking to conclude on good training procedures We are given: a procedure that outputs a prediction rule f̂ from training data (X, y) ∈ (X × Y)n We want machine-learning research claims (novel frobnicate improves prediction) Many arbitrary components torch.manual seed(3407) ?? [Picard 2021] Useless to tune random seeds (for weights init, dropout, data augmentation) will not carry over to new training data G Varoquaux 31
  • 40. Arbitrary variance in a machine-learning benchmark Variance when rerunning an evaluation, modifying arbitrary elements: 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 eter 0.5 mentation 0 1 vgg 0 1 average relative scale Across various computer vision and NLP tasks G Varoquaux 32 [Bouthillier... 2021]
  • 41. Uncertainty due to test set sampling The test set remains a limited sample of the population The train-test split is an arbitrary choice 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 eter 0.5 mentation 0 1 vgg 0 1 average G Varoquaux 33 Test set Train set Full data [Bouthillier... 2021]
  • 42. Uncertainty due to test set sampling The test set remains a limited sample of the population The train-test split is an arbitrary choice 0 1 Numerical noise Dropout Weights init Data order Data augment Data (bootstrap) Noisy Grid Search Random Search Bayes Opt bert-rte 0 1 bert-sst2 eter 0.5 mentation 0 1 vgg 0 1 average G Varoquaux 33 Test set Train set Full data Better evaluation Sample multiple times these arbitrary choices [Bouthillier... 2021]
  • 43. Better evaluation procedure Cross-validation multiple train-test splits Randomize all arbitrary factors in folds G Varoquaux 34 Test set Train set Full data
  • 44. Better evaluation procedure Cross-validation multiple train-test splits Randomize all arbitrary factors in folds But, a full learning pipeline comes with hyper-parameters These are set to minimize error on left-out data Test set Full data Validation set Train set Make model choices Evaluate model G Varoquaux 34 Test set Train set Full data
  • 45. Benchmarks accounting for hyper-parameter selection Setting hyper-parameters is part of the pipeline It must be evaluated One option: train, validation, and test splits + manual tuning of hyperparameters Test set Full data Validation set Train set Make model choices Evaluate model Rampant overfit of validation set [Makarova... 2021] G Varoquaux 35
  • 46. Benchmarks accounting for hyper-parameter selection Setting hyper-parameters is part of the pipeline It must be evaluated One option: train, validation, and test splits + manual tuning of hyperparameters Test set Full data Validation set Train set Make model choices Evaluate model Rampant overfit of validation set [Makarova... 2021] Multiple splits + automated tuning of hyperparameters Validation set Full data Test set Train set Hyper-parameter tuning Outer loop G Varoquaux 35
  • 47. Hyper-parameter optimization procedures Grid search Vary each hyper-parameter on a grid G Varoquaux 36
  • 48. Hyper-parameter optimization procedures Grid search Random search [Bergstra and Bengio 2012] prefer to grid-search for more than 2 params Sub-optimal hyper- parameters on models routinely lead to invalid conclusions See refs in [Bouthillier... 2021] Region of good hyperparameters Hyperparameter 1 Hyperparameter 2 Grid Search Randomized Search (important hyperparameter) (unimportant hyperparameter) G Varoquaux 36
  • 49. Benchmarking training procedures (eg to compare them) Control arbitrary fluctuations (that will not generalize) Sample all: data sampling Multiple train-test splits (cross-validation) Test set Train set Full data arbitrary choices (seeds) Randomize them all hyper-parameters Hyper-parameter optimization Too expensive to randomize In practice: set hyper parameters once, then randomize model seeds and data splits [Bouthillier... 2021] Counterintuitive: randomization decorrelates errors, improving benchmarks G Varoquaux 37
  • 50. Accounting for variance in conclusions Confidence intervals statistical testing Test set Train set Full data Statistical testing with multiple folds Challenge: folds are not independent1 .t-test/Wilcoxon across folds are not valid Don’t divide std by number of folds Approximate corrections for dependence across folds2 5x2cv: repeat 5 times randomized 2-fold [Dietterich 1998] Corrected resampled t-test statistic [Nadeau and Bengio 2003] 1Train sets overlap, and often test sets also do. 2Does not account for sources of variance beyond data sampling, eg random seeds, hyper parameters. G Varoquaux 38
  • 51. Statistical tests: across datasets (more general claims on training pipelines) Challenge: metrics not comparable across datasets =⇒ Tests based on rank statistics Wilcoxon signed rank test Tests how often p1 outperforms p2 across datasets G Varoquaux 39 [Demšar 2006]
  • 52. Statistical tests: multiple pipelines across datasets Challenge: multiple comparisons1 The Wilcoxon-Holm approach [Demšar 2006] Pairwise comparisons + Bonferroni-Holm correction The Friedman-Nemenyi approach2 [Demšar 2006] 1. Friedman test across all pipelines (omnibus) 2. Critical difference via Nemenyi test 1 2 3 4 5 4.2000 clf13.7667 clf23.5000 clf4 2.0000clf5 1.5333clf3 Accuracy (rank) Critical difference diagrams Replicability analysis [Dror... 2017] Perform dataset-level pairwise tests Combine by testing (partial conjunction multiple-testing): “Does p1 perform better than p2 on at least u datasets?” More powerful for a small number of datasets 1If we do many tests, some will show large differences by chance. 2The Holm approach can be more interesting when considering comparisons to a referent classifier. G Varoquaux 40
  • 53. Statistical tests: beyond null-hypothesis testing Sample size is a problem Across datasets: significance typically requires 15 datasets In a dataset (repeating folds, seeds...): many repetitions makes any difference significant1 .Underpowered experiments bring no evidence Shortcomings of null-hypothesis testing Significance decreases with more comparisons Statistically significance , practical significance 1Though as the total test-set size is limited, they do not bring more evidence for generalization. G Varoquaux 41 [Demšar 2008]
  • 54. Statistical tests: beyond null-hypothesis testing Sample size is a problem Across datasets: significance typically requires 15 datasets In a dataset (repeating folds, seeds...): many repetitions makes any difference significant1 .Underpowered experiments bring no evidence Shortcomings of null-hypothesis testing Significance decreases with more comparisons2 Statistically significance , practical significance 1Though as the total test-set size is limited, they do not bring more evidence for generalization. 2FDR (False Discovery Rate) attempts to solve this. G Varoquaux 41 [Demšar 2008]
  • 55. Statistical tests: accounting for effect sizes Neyman-Pearson view of hypothesis testing Two hypothesis, H0 and H1 H1: p1 outperforms p2 by a margin1 Which is mostly likely? H0 H1 H0 H1 Requires the choice of the margin Related to superiority testing in clinical trials [Lesaffre 2008] 1Related to the rejection region in the Neyman-Pearson lemma. G Varoquaux 42
  • 56. Pragmatic compromises Test on P(p1 p2) δ δ .5: Neyman-Pearson view Evaluate P(p1 p2) by resampling Randomize everything: data splits, seeds,... Gaussian approximation: amounts to comparing differences to standard deviations Not an inference on the expected difference in performance1 1Unlike standard error, standard deviation does not shrink to zero with the number of resampling. G Varoquaux 43 [Bouthillier... 2021]
  • 57. Summary – comparing learning procedures Sample multiple sources of variance: - data split - random seeds - hyper-parameter tuning Null-hypothesis testing: no t-test on cross-validation! Don’t mis-interpret p-value: - Not significant: more data could change that - Significant differences may be trivial Detect practical differences: delta in performance standard deviation G Varoquaux 44
  • 58. Better experimental procedures Crack the black box open A prediction score is seldom insightful Ablation studies: remove/change atomic elements Learning curves Better benchmarking in these Tune hyper-parameters to the same quality Randomize everything Account for variance in conclusions G Varoquaux 45
  • 59. Summary – Better benchmarking procedures Reminder: Your validation measure is intrinsically unreli- able (sampling noise) An arbitrary choice (random seed) may give seemingly-good results that do not generalize Optimizing test accuracy will explore the tails Evaluation is a bottleneck 300 200 100 30 Number of available samples    ­2% +2% ­4% +4% ­5% +5% ­7% +7% ­15% +12% G Varoquaux 46
  • 60. Model evaluation Choice of performance metrics Should be suited to the application setting Machine learning does metric chasing Š Evaluation procedures Account for variance Different sources of variance for applying prediction rules vs learning them Careful benchmarking is crucial Optimistic flukes will not generalize What is our purpose? External validity a G Varoquaux 47 @GaelVaroquaux
  • 61. References I A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873, 2005. J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13:281, 2012. X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021. D. Chyzhyk, G. Varoquaux, B. Thirion, and M. Milham. Controlling a confound in predictive models with a test set minimizing its effect. In 2018 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2018. J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006. J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop on Evaluation Methods for Machine Learning in conjunction with ICML, page 65. Citeseer, 2008. G Varoquaux 48
  • 62. References II T. G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998. R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 2017. E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU hospital for joint diseases, 66(2), 2008. M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording. Using and understanding cross-validation strategies. perspectives on saeb et al. GigaScience, 6(5):1–6, 2017. L. Maier-Hein, A. Reinke, E. Christodoulou, B. Glocker, P. Godau, F. Isensee, J. Kleesiek, M. Kozubek, M. Reyes, M. A. Riegler, ... Metrics reloaded: Pitfalls and recommendations for image analysis validation. arXiv preprint arXiv:2206.01653, 2022. G Varoquaux 49
  • 63. References III A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and C. Archambeau. Overfitting in bayesian optimization: an empirical study and early-stopping solution. arXiv preprint arXiv:2104.08166, 2021. C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3): 239–281, 2003. D. Picard. Torch. manual seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision. arXiv:2109.08203, 2021. N. Traut, K. Heuer, G. Lemaı̂tre, A. Beggiato, D. Germanaud, M. Elmaleh, A. Bethegnies, L. Bonnasse-Gahot, W. Cai, S. Chambon, ... Insights from an autism imaging biomarker challenge: promises and threats to biomarker discovery. NeuroImage, 255:119171, 2022. G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars. Neuroimage, 180:68–77, 2018. G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ digital medicine, 5(1):1–8, 2022. G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022. G Varoquaux 50