Model evaluation is, in my opinion, the most overlooked step of the machine-learning pipeline. Reliably estimating a model's performance for a given purpose is crucial and difficult. In this talk, I first discuss choosing metric informative for the application, stressing the importance of the class prevalence in classification settings. I will then discussing procedures to estimate the generalization performance, drawing a distinction between evaluating a learning procedure or a prediction rule, and discussing how to give confidence intervals to the performance estimates.
Evaluating machine learning models and their diagnostic value
1. 3
The forgotten practicalities of machine learning:
Evaluating machine learning models
and their diagnostic value
Gaël Varoquaux
Based on Varoquaux Colliot,
hal.archives-ouvertes.fr/hal-03682454
2. Evaluation: the Achilles heel of machine learning practice
Brain imaging prediction challenge [Traut... 2022]
Autism status prediction
10 000 € incentives
Analysts overfit the public set
G Varoquaux 1
3. Evaluation: the Achilles heel of machine learning practice
Kaggle competitions [Varoquaux and Cheplygina 2022]
Observed improvement in score
Lung cancer
Classification
Prize: $1 000 000
Test size: max 1K
-0.75 0.0 +0.75
Pneumonia
Detection (localization)
Prize: $ 30 000
Test size: 3 000
-0.15 0.0 +0.15
Covid 19
Abnormality localization
Prize: $ 100 000
Test size: 1 200
-0.01 0.0 +0.01
Nerve
Segmentation
Prize: $100 000
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Many submis-
sions fall within
evaluation error
G Varoquaux 1
4. Evaluation: the Achilles heel of machine learning practice
Deep learning literature [Bouthillier... 2021]
90.0
92.5
95.0
97.5
100.0
cifar10
2012 2014 2016 2018 2020
85
90
95
100
sst2
non-'SOTA' results
Significant
Non-Significant
Year
Accuracy
Published im-
provements often
fall within error
bars
G Varoquaux 1
5. Problem setting: defining objects
(x, y) ∈ X × Y drawn from D
a function f : X → Y such that f (x) ≈ y Notation: ŷ
def
= f (x)
Evaluate the risk
Error function l : Y × Y → Ò captures the cost of errors
Expected error Åx,y∼D
l(f (x), y)
.Training might be done with
- another D
- another l
G Varoquaux 2
6. Outline
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
G Varoquaux 3
7. 1 Performance metrics
Metrics must capture and reflect application
Going further for images analysis [Maier-Hein... 2022]
8. 1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
9. Summary metric: accuracy
Accuracy: counts the fraction of error
Simple summary
Overly simplified picture
Distinguishing false detections from misses
eg false detection of a brain tumor?
- Very bad if leads to brain surgery
- Minor if leads to confirmatory imaging
Relevant cost (error metric) dependent on application setting
G Varoquaux 6
10. Confusion matrix: beyond accuracy
Truth: Disease status
D+ D−
Predicted:
test
result
T+ TP FP
T−
FN TN
G Varoquaux 7
11. Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
G Varoquaux 8
[Varoquaux and Colliot 2022]
12. Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN
G Varoquaux 8
[Varoquaux and Colliot 2022]
13. Metrics on positive and negative classes
T denotes test: classifier output; D denotes diseased status.
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN Estimates P(T+ | D+)
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP Estimates P(T− | D−)
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP Estimates P(D+ | T+)
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN Estimates P(D− | T−)
G Varoquaux 8
[Varoquaux and Colliot 2022]
14. 1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
15. Accuracy, sensitivity, specificity Truth: Disease status
D+ D−
Predicted:
test
result
T+
TP FP
T−
FN TN
Accuracy
uninformative under class imbalance
90% of class 0
⇒ predicting only class 0 gives Acc=90%
Balanced accuracy: errors on class 0 and class 1
Sensitivity (also called recall): fraction of class 1 retrieved. TP
TP + FN
Specificity: fraction of class 0 actually classified as 0. TN
TN + FP
Balanced accuracy: 1
2 (sensitivity + specificity)
Sensitivity: P(T+|D+) Specificity: P(T − |D − )
G Varoquaux 10
16. Asking the right question: P(T+|D+) vs P(D+|T+)
Positive predictive value (via Bayes’ theorem):
P(D+ | T+) =
sensitivity × prevalence
(1 − specificity) × (1 − prevalence) + sensitivity × prevalence
.
Diseased
Test positive
Summary metric: Markedness: PPV + NPV - 1
Drawback: depends on prevalence
⇒ Characterizes not only the classifier, but also the dataset
G Varoquaux 11
17. Odds ratios for invariance to sampling
Definition: Odds of a O(a) =
P(a)
1 − P(a)
Likelihood ratio of positive class:
LR+ =
O(D+|T+)
O(D + )
=
Sensitivity
1 − Specificity
Independent of class prevalence1
Use prevalence on target population to compute O(D+|T+)
Useful to extrapolate from case-control study to target
population
1Proof in Varoquaux Colliot “Evaluating machine learning models and their diagnostic value”
G Varoquaux 12
18. 1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
19. Multi-threshold metrics: ROC curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
True
positive
rate
=
sensitivity
or
recall
Excellent, AUC=0.95
Good, AUC=0.88
Poor, AUC=0.75
Chance, AUC=0.50
Classifiers about continuous
scores
ROC obtained by varying the
threshold
Summary metric: AUC
Area Under the Curve
Chance = .5
G Varoquaux 14
20. Multi-threshold metrics: Precision Recall curve
0.0 0.2 0.4 0.6 0.8 1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
= TPR or sensitivity
=
PPV
Excellent, AUC=0.96
Good, AUC=0.92
Poor, AUC=0.78
Chance, AUC=0.57
Better dynamical range than
ROC for imbalanced classes
Area Under the Curve:
Chance depends on class
imbalance
G Varoquaux 15
21. 1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
22. Interpreting classifier score as a probability? – Calibration
Calibration
Average error rate for all
samples with score s is s
Computed in bins on score s
ECE:
expected calibration error
Average error on all bins
0.0 0.2 0.4 0.6 0.8 1.0
Mean confidence score
0.0
0.2
0.4
0.6
0.8
1.0
Observed
fraction
of
positive
Underconfident
Overconfident
Calibration plot
Perfectly calibrated
GaussianNB
.Does not control individual probabilities
G Varoquaux 17
23. Classifier score as a probability? – Individual probabilities
Does the classifier approach P(y|X)?
Metric:
Brier score =
Õ
i
(ŝi − yi)
2
Observed (binary) label
Confidence score
Minimal for ŝ = P(y|X)
Scaled version:
Brier skill score = 1 −
Brier(ŝ, y)
Brier(ȳ, y)
Class prevalence
1 = perfect prediction
0 = scores not more informative than class prevalence
G Varoquaux 18
24. Choose well your metrics
Machine learning research chases metrics
These should reflect application as well as possible
Worry about P(D + |T+)
- Accuracy reasonable proxy only for balanced classes
- LR+ interesting to keep in mind
Worry about uncertainty
- Calibration quantifies average errors
- Brier scores controls individual uncertainty
A single number does not tell the whole story
G Varoquaux 19
26. Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
G Varoquaux 21
27. Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
For application claims: eg medicine
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
For machine-learning research (claims on algorithms)
G Varoquaux 21
29. Model evaluation 101
In-sample error
is not meaningful
X, y ̸⊥
⊥ f̂
Å estimated wrong
Typically:
Test set
Train set
Full data
typically 10% trade off better learn-
ing vs better estimation
G Varoquaux 22
31. External validation Before putting in production
We are given f : X → Y
Xtest different enough from Xtrain
- No repeated acquisition of same individual in train test
[Little... 2017]
- Ideally: show generalization to new site, later in time...
Xtest representative of target population
Sample Xtest:
- To match statistical moments
- To minimize a confounding association (shortcuts)
[Chyzhyk... 2018]
G Varoquaux 24
32. Evaluation error: Sampling noise on test set
Evaluation quality is limited by number of test examples
Sampling noise1 for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Confidence intervals ntest = 1 000 interval: 5.7%
ntest = 10 000 interval: 1.8%
ntest = 100 000 interval: 0.6%
1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and
sampling other data will lead to other results.
G Varoquaux 25
[Varoquaux 2018]
33. Evaluation error: Sampling noise on test set
“in the wild”
102
103
104
105
106
Test set size
0
1
2
3
4
Standard
deviation
(%
acc) In Theory:
From a Binomial
In Practice:
Random splits
Binom(n', 0.66)
Binom(n', 0.95)
Binom(n', 0.91)
Glue-RTE BERT
(n'=277)
Glue-SST2 BERT
(n'=872)
CIFAR10 VGG11
(n'=10000)
G Varoquaux 26
[Bouthillier... 2021]
34. Evaluation noise is not negligible – in Kaggle competitions
Lung cancer classification
Test size: max 1K
Smaller improvements than noise
-0.75 0.0 +0.75
Diminishing returns
Schizophrenia classification
Test size: 120
-0.2 0.0 +0.2
Diminishing returns
Lung tumor segmentation
Test size: max 6k
Poorer score on private set
-0.05 0.0 +0.05
Overfit
Nerve segmentation
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Actual
improvement
G Varoquaux 27
[Varoquaux and Cheplygina 2022]
35. Confidence intervals statistical testing
Evaluate uncertainty to know when to stop, what to trust
(diminishing returns, creeping complexity)
Confidence interval1: Range of values compatible with the
observations
Null hypothesis testing: p-value: the chance to observe the
results if a null hypothesis were true
Typical null: model comparison
model p1 and p2 give same expected error
1Technically not making the difference with a credible interval
G Varoquaux 28
36. Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Confidence intervals:
N 65% 80% 90% 95%
100 [-9.0% 9.0%] [-8.0% 8.0%] [-6.0% 5.0%] [-5.0% 4.0%]
1000 [-3.0% 2.9%] [-2.5% 2.4%] [-1.9% 1.8%] [-1.4% 1.3%]
10000 [-0.9% 0.9%] [-0.8% 0.8%] [-0.6% 0.6%] [-0.4% 0.4%]
100000 [-0.3% 0.3%] [-0.2% 0.2%] [-0.2% 0.2%] [-0.1% 0.1%]
Table from [Varoquaux and Colliot 2022]
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29
37. Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Non-parametric:
- Confidence intervals of performance of a prediction rule
Bootstrap test set
- Statistical testing (compare prediction rules with correlated errors)
Permutations [Bandos... 2005]
Sample null by randomly switching predictions from p1 and p2.
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29
39. Benchmarking to conclude on good training procedures
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
We want machine-learning research claims
(novel frobnicate improves prediction)
Many arbitrary components
torch.manual seed(3407) ?? [Picard 2021]
Useless to tune random seeds
(for weights init, dropout, data augmentation)
will not carry over to new training data
G Varoquaux 31
40. Arbitrary variance in a machine-learning benchmark
Variance when rerunning an evaluation,
modifying arbitrary elements:
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
relative scale
Across various computer vision and NLP tasks
G Varoquaux 32
[Bouthillier... 2021]
41. Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
[Bouthillier... 2021]
42. Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
Better evaluation
Sample multiple times these arbitrary choices [Bouthillier... 2021]
44. Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
But, a full learning pipeline comes with hyper-parameters
These are set to minimize error
on left-out data
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
G Varoquaux 34
Test set
Train set
Full data
45. Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
G Varoquaux 35
46. Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
Multiple splits
+ automated tuning of hyperparameters Validation set
Full data
Test set
Train set
Hyper-parameter tuning
Outer loop
G Varoquaux 35
48. Hyper-parameter optimization procedures
Grid search
Random search
[Bergstra and Bengio 2012]
prefer to grid-search for
more than 2 params
Sub-optimal hyper-
parameters on models
routinely lead to invalid
conclusions
See refs in [Bouthillier... 2021] Region of good
hyperparameters
Hyperparameter 1
Hyperparameter
2
Grid Search
Randomized
Search
(important hyperparameter)
(unimportant
hyperparameter)
G Varoquaux 36
49. Benchmarking training procedures (eg to compare them)
Control arbitrary fluctuations (that will not generalize)
Sample all:
data sampling
Multiple train-test splits (cross-validation)
Test set
Train set
Full data
arbitrary choices (seeds)
Randomize them all
hyper-parameters
Hyper-parameter optimization Too expensive to randomize
In practice: set hyper parameters once, then randomize model seeds
and data splits [Bouthillier... 2021]
Counterintuitive: randomization decorrelates errors, improving benchmarks
G Varoquaux 37
50. Accounting for variance in conclusions
Confidence intervals statistical testing
Test set
Train set
Full data
Statistical testing with multiple folds
Challenge: folds are not independent1
.t-test/Wilcoxon across folds are not valid
Don’t divide std by number of folds
Approximate corrections for dependence across folds2
5x2cv: repeat 5 times randomized 2-fold [Dietterich 1998]
Corrected resampled t-test statistic [Nadeau and Bengio 2003]
1Train sets overlap, and often test sets also do.
2Does not account for sources of variance beyond data sampling, eg random seeds, hyper parameters.
G Varoquaux 38
51. Statistical tests: across datasets
(more general claims on training pipelines)
Challenge:
metrics not comparable across datasets
=⇒ Tests based on rank statistics
Wilcoxon signed rank test
Tests how often p1 outperforms p2 across datasets
G Varoquaux 39
[Demšar 2006]
52. Statistical tests: multiple pipelines across datasets
Challenge: multiple comparisons1
The Wilcoxon-Holm approach [Demšar 2006]
Pairwise comparisons + Bonferroni-Holm correction
The Friedman-Nemenyi approach2 [Demšar 2006]
1. Friedman test across all pipelines (omnibus)
2. Critical difference via Nemenyi test 1
2
3
4
5
4.2000
clf13.7667
clf23.5000
clf4
2.0000clf5
1.5333clf3
Accuracy (rank)
Critical difference diagrams
Replicability analysis [Dror... 2017]
Perform dataset-level pairwise tests
Combine by testing (partial conjunction multiple-testing):
“Does p1 perform better than p2 on at least u datasets?”
More powerful for a small number of datasets
1If we do many tests, some will show large differences by chance.
2The Holm approach can be more interesting when considering comparisons to a referent classifier.
G Varoquaux 40
53. Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires 15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
G Varoquaux 41
[Demšar 2008]
54. Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires 15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons2
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
2FDR (False Discovery Rate) attempts to solve this.
G Varoquaux 41
[Demšar 2008]
55. Statistical tests: accounting for effect sizes
Neyman-Pearson view of hypothesis testing
Two hypothesis, H0 and H1
H1: p1 outperforms p2 by a margin1
Which is mostly likely? H0 H1
H0 H1
Requires the choice of the margin
Related to superiority testing in clinical trials
[Lesaffre 2008]
1Related to the rejection region in the Neyman-Pearson lemma.
G Varoquaux 42
56. Pragmatic compromises
Test on P(p1 p2) δ
δ .5: Neyman-Pearson view
Evaluate P(p1 p2) by resampling
Randomize everything: data splits, seeds,...
Gaussian approximation: amounts to comparing differences to
standard deviations
Not an inference on the expected difference in performance1
1Unlike standard error, standard deviation does not shrink to zero with the number of resampling.
G Varoquaux 43
[Bouthillier... 2021]
57. Summary – comparing learning procedures
Sample multiple sources of variance:
- data split
- random seeds
- hyper-parameter tuning
Null-hypothesis testing: no t-test on cross-validation!
Don’t mis-interpret p-value:
- Not significant: more data could change that
- Significant differences may be trivial
Detect practical differences:
delta in performance standard deviation
G Varoquaux 44
58. Better experimental procedures
Crack the black box open
A prediction score is seldom insightful
Ablation studies: remove/change atomic elements
Learning curves
Better benchmarking in these
Tune hyper-parameters to the same quality
Randomize everything
Account for variance in conclusions
G Varoquaux 45
59. Summary – Better benchmarking procedures
Reminder: Your validation
measure is intrinsically unreli-
able (sampling noise)
An arbitrary choice (random
seed) may give seemingly-good
results that do not generalize
Optimizing test accuracy will
explore the tails
Evaluation is a bottleneck
300
200
100
30
Number of available samples
2% +2%
4% +4%
5% +5%
7% +7%
15% +12%
G Varoquaux 46
60. Model evaluation
Choice of performance metrics
Should be suited to the application setting
Machine learning does metric chasing Š
Evaluation procedures
Account for variance
Different sources of variance for applying prediction rules vs
learning them
Careful benchmarking is crucial
Optimistic flukes will not generalize
What is our purpose? External validity a
G Varoquaux 47
@GaelVaroquaux
61. References I
A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in
areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873,
2005.
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of
Machine Learning Research, 13:281, 2012.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto,
N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in
machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021.
D. Chyzhyk, G. Varoquaux, B. Thirion, and M. Milham. Controlling a confound in predictive
models with a test set minimizing its effect. In 2018 International Workshop on
Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2018.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of
Machine Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop
on Evaluation Methods for Machine Learning in conjunction with ICML, page 65.
Citeseer, 2008.
G Varoquaux 48
62. References II
T. G. Dietterich. Approximate statistical tests for comparing supervised classification
learning algorithms. Neural computation, 10(7):1895–1923, 1998.
R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language
processing: Testing significance with multiple datasets. Transactions of the
Association for Computational Linguistics, 2017.
E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU
hospital for joint diseases, 66(2), 2008.
M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording.
Using and understanding cross-validation strategies. perspectives on saeb et al.
GigaScience, 6(5):1–6, 2017.
L. Maier-Hein, A. Reinke, E. Christodoulou, B. Glocker, P. Godau, F. Isensee, J. Kleesiek,
M. Kozubek, M. Reyes, M. A. Riegler, ... Metrics reloaded: Pitfalls and recommendations
for image analysis validation. arXiv preprint arXiv:2206.01653, 2022.
G Varoquaux 49
63. References III
A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and
C. Archambeau. Overfitting in bayesian optimization: an empirical study and
early-stopping solution. arXiv preprint arXiv:2104.08166, 2021.
C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3):
239–281, 2003.
D. Picard. Torch. manual seed (3407) is all you need: On the influence of random seeds
in deep learning architectures for computer vision. arXiv:2109.08203, 2021.
N. Traut, K. Heuer, G. Lemaı̂tre, A. Beggiato, D. Germanaud, M. Elmaleh, A. Bethegnies,
L. Bonnasse-Gahot, W. Cai, S. Chambon, ... Insights from an autism imaging biomarker
challenge: promises and threats to biomarker discovery. NeuroImage, 255:119171, 2022.
G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars.
Neuroimage, 180:68–77, 2018.
G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological
failures and recommendations for the future. NPJ digital medicine, 5(1):1–8, 2022.
G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic
value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022.
G Varoquaux 50