Evaluating machine learning models and their diagnostic value

3
The forgotten practicalities of machine learning:
Evaluating machine learning models
and their diagnostic value
Gaël Varoquaux
Based on Varoquaux Colliot,
hal.archives-ouvertes.fr/hal-03682454

Evaluation: the Achilles heel of machine learning practice
Brain imaging prediction challenge [Traut... 2022]
Autism status prediction
10 000 € incentives
Analysts overfit the public set
G Varoquaux 1

Kaggle competitions [Varoquaux and Cheplygina 2022]
Observed improvement in score
Lung cancer
Classification
Prize: $1 000 000
Test size: max 1K
-0.75 0.0 +0.75
Pneumonia
Detection (localization)
Prize: $ 30 000
Test size: 3 000
-0.15 0.0 +0.15
Covid 19
Abnormality localization
Prize: $ 100 000
Test size: 1 200
-0.01 0.0 +0.01
Nerve
Segmentation
Prize: $100 000
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Many submis-
sions fall within
evaluation error
G Varoquaux 1

Deep learning literature [Bouthillier... 2021]
90.0
92.5
95.0
97.5
100.0
cifar10
2012 2014 2016 2018 2020
85
90
95
100
sst2
non-'SOTA' results
Significant
Non-Significant
Year
Accuracy
Published im-
provements often
fall within error
bars
G Varoquaux 1

Problem setting: defining objects
(x, y) ∈ X × Y drawn from D
a function f : X → Y such that f (x) ≈ y Notation: ŷ
def
= f (x)
Evaluate the risk
Error function l : Y × Y → Ò captures the cost of errors
Expected error Åx,y∼D

l(f (x), y)

.Training might be done with
- another D
- another l
G Varoquaux 2

Outline
1 Performance metrics
Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration
2 Evaluation procedures
Evaluating a prediction rule
Evaluating a training procedure
G Varoquaux 3

Metrics must capture and reflect application
Going further for images analysis [Maier-Hein... 2022]

Basic classification metrics
Evaluating imbalanced binary classification
Multi-threshold metrics for classification
Confidence scores and calibration

Summary metric: accuracy
Accuracy: counts the fraction of error
Simple summary
Overly simplified picture
Distinguishing false detections from misses
eg false detection of a brain tumor?
- Very bad if leads to brain surgery
- Minor if leads to confirmatory imaging
Relevant cost (error metric) dependent on application setting
G Varoquaux 6

Confusion matrix: beyond accuracy
Truth: Disease status
D+ D−
Predicted:
test
result
T+ TP FP
T−
FN TN
G Varoquaux 7

Metrics on positive and negative classes
Sensitivity (also called recall): fraction of positive samples actually
retrieved Sensitivity = TP
TP+FN
Specificity: fraction of negative samples actually classified as negative
Specificity = TN
TN+FP
G Varoquaux 8
[Varoquaux and Colliot 2022]

TP+FN
Specificity = TN
TN+FP
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN
G Varoquaux 8

T denotes test: classifier output; D denotes diseased status.
TP+FN Estimates P(T+ | D+)
Specificity = TN
TN+FP Estimates P(T− | D−)
Positive predictive value (PPV, also called precision): fraction
of the positively classified samples which are indeed positive
PPV = TP
TP+FP Estimates P(D+ | T+)
Negative predictive value (NPV): fraction of the negatively
classified samples which are indeed negative
NPV = TN
TN+FN Estimates P(D− | T−)
G Varoquaux 8

Accuracy, sensitivity, specificity Truth: Disease status
D+ D−
Predicted:
test
result
T+
TP FP
T−
FN TN
Accuracy
uninformative under class imbalance
90% of class 0
⇒ predicting only class 0 gives Acc=90%
Balanced accuracy: errors on class 0 and class 1
Sensitivity (also called recall): fraction of class 1 retrieved. TP
TP + FN
Specificity: fraction of class 0 actually classified as 0. TN
TN + FP
Balanced accuracy: 1
2 (sensitivity + specificity)
Sensitivity: P(T+|D+) Specificity: P(T − |D − )
G Varoquaux 10

Asking the right question: P(T+|D+) vs P(D+|T+)
Positive predictive value (via Bayes’ theorem):
P(D+ | T+) =
sensitivity × prevalence
(1 − specificity) × (1 − prevalence) + sensitivity × prevalence
.
Diseased
Test positive
Summary metric: Markedness: PPV + NPV - 1
Drawback: depends on prevalence
⇒ Characterizes not only the classifier, but also the dataset
G Varoquaux 11

Odds ratios for invariance to sampling
Definition: Odds of a O(a) =
P(a)
1 − P(a)
Likelihood ratio of positive class:
LR+ =
O(D+|T+)
O(D + )
=
Sensitivity
1 − Specificity
Independent of class prevalence1
Use prevalence on target population to compute O(D+|T+)
Useful to extrapolate from case-control study to target
population
1Proof in Varoquaux Colliot “Evaluating machine learning models and their diagnostic value”
G Varoquaux 12

Multi-threshold metrics: ROC curve
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
0.0
0.2
0.4
0.6
0.8
1.0
True
positive
rate
=
sensitivity
or
recall
Excellent, AUC=0.95
Good, AUC=0.88
Poor, AUC=0.75
Chance, AUC=0.50
Classifiers about continuous
scores
ROC obtained by varying the
threshold
Summary metric: AUC
Area Under the Curve
Chance = .5
G Varoquaux 14

Multi-threshold metrics: Precision Recall curve
0.0 0.2 0.4 0.6 0.8 1.0
Recall
0.0
0.2
0.4
0.6
0.8
1.0
Precision
= TPR or sensitivity
=
PPV
Excellent, AUC=0.96
Good, AUC=0.92
Poor, AUC=0.78
Chance, AUC=0.57
Better dynamical range than
ROC for imbalanced classes
Area Under the Curve:
Chance depends on class
imbalance
G Varoquaux 15

Interpreting classifier score as a probability? – Calibration
Calibration
Average error rate for all
samples with score s is s
Computed in bins on score s
ECE:
expected calibration error
Average error on all bins
0.0 0.2 0.4 0.6 0.8 1.0
Mean confidence score
0.0
0.2
0.4
0.6
0.8
1.0
Observed
fraction
of
positive
Underconfident
Overconfident
Calibration plot
Perfectly calibrated
GaussianNB
.Does not control individual probabilities
G Varoquaux 17

Classifier score as a probability? – Individual probabilities
Does the classifier approach P(y|X)?
Metric:
Brier score =
Õ
i
(ŝi − yi)
2
Observed (binary) label
Confidence score
Minimal for ŝ = P(y|X)
Scaled version:
Brier skill score = 1 −
Brier(ŝ, y)
Brier(ȳ, y)
Class prevalence
1 = perfect prediction
0 = scores not more informative than class prevalence
G Varoquaux 18

Choose well your metrics
Machine learning research chases metrics
These should reflect application as well as possible
Worry about P(D + |T+)
- Accuracy reasonable proxy only for balanced classes
- LR+ interesting to keep in mind
Worry about uncertainty
- Calibration quantifies average errors
- Brier scores controls individual uncertainty
A single number does not tell the whole story
G Varoquaux 19

Controlled estimation of e = Åx,y∼D

l(f (x), y)

Corresponding research paper: [Bouthillier... 2021]

Definitions: what are we benchmarking?
Senario 1: a prediction rule:
We are given f : X → Y
Senario 2: a training procedure:
We are given: a procedure that outputs a prediction rule f̂
from training data (X, y) ∈ (X × Y)n
G Varoquaux 21

Definitions: what are we benchmarking?
Senario 1: a prediction rule:
For application claims: eg medicine
Senario 2: a training procedure:
For machine-learning research (claims on algorithms)
G Varoquaux 21

Model evaluation 101
In-sample error
is not meaningful
X, y ̸⊥
⊥ f̂
Å estimated wrong
G Varoquaux 22

Model evaluation 101
In-sample error
is not meaningful
X, y ̸⊥
⊥ f̂
Å estimated wrong
Typically:
Test set
Train set
Full data
typically 10% trade off better learn-
ing vs better estimation
G Varoquaux 22

Evaluating a prediction rule
Evaluating a training procedure

External validation Before putting in production
Xtest different enough from Xtrain
- No repeated acquisition of same individual in train test
[Little... 2017]
- Ideally: show generalization to new site, later in time...
Xtest representative of target population
Sample Xtest:
- To match statistical moments
- To minimize a confounding association (shortcuts)
[Chyzhyk... 2018]
G Varoquaux 24

Evaluation error: Sampling noise on test set
Evaluation quality is limited by number of test examples
Sampling noise1 for ntest = 1000:
-10% -5% 0% +5% +10%
Binomial distribution of error on test accuracy
-2% +2%
Confidence intervals ntest = 1 000 interval: 5.7%
ntest = 10 000 interval: 1.8%
ntest = 100 000 interval: 0.6%
1The data at hand (eg the test set) is just a small sample of the full population “in the wild”, and
sampling other data will lead to other results.
G Varoquaux 25
[Varoquaux 2018]

Evaluation error: Sampling noise on test set
“in the wild”
102
103
104
105
106
Test set size
0
1
2
3
4
Standard
deviation
(%
acc) In Theory:
From a Binomial
In Practice:
Random splits
Binom(n', 0.66)
Binom(n', 0.95)
Binom(n', 0.91)
Glue-RTE BERT
(n'=277)
Glue-SST2 BERT
(n'=872)
CIFAR10 VGG11
(n'=10000)
G Varoquaux 26
[Bouthillier... 2021]

Evaluation noise is not negligible – in Kaggle competitions
Lung cancer classification
Test size: max 1K
Smaller improvements than noise
-0.75 0.0 +0.75
Diminishing returns
Schizophrenia classification
Test size: 120
-0.2 0.0 +0.2
Diminishing returns
Lung tumor segmentation
Test size: max 6k
Poorer score on private set
-0.05 0.0 +0.05
Overfit
Nerve segmentation
Test size 5.5K
-0.04 0.0 +0.04
Improvement
of top model
on 10% best
Winner gap
Evaluation noise
between public
and private sets
Actual
improvement
G Varoquaux 27
[Varoquaux and Cheplygina 2022]

Confidence intervals statistical testing
Evaluate uncertainty to know when to stop, what to trust
(diminishing returns, creeping complexity)
Confidence interval1: Range of values compatible with the
observations
Null hypothesis testing: p-value: the chance to observe the
results if a null hypothesis were true
Typical null: model comparison
model p1 and p2 give same expected error
1Technically not making the difference with a credible interval
G Varoquaux 28

Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Confidence intervals:
N 65% 80% 90% 95%
100 [-9.0% 9.0%] [-8.0% 8.0%] [-6.0% 5.0%] [-5.0% 4.0%]
1000 [-3.0% 2.9%] [-2.5% 2.4%] [-1.9% 1.8%] [-1.4% 1.3%]
10000 [-0.9% 0.9%] [-0.8% 0.8%] [-0.6% 0.6%] [-0.4% 0.4%]
100000 [-0.3% 0.3%] [-0.2% 0.2%] [-0.2% 0.2%] [-0.1% 0.1%]
Table from [Varoquaux and Colliot 2022]
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29

Statistical testing with a given test set
Test set
Train set
Full data
Simple distribution of metrics,
eg accuracy1: binomial
Non-parametric:
- Confidence intervals of performance of a prediction rule
Bootstrap test set
- Statistical testing (compare prediction rules with correlated errors)
Permutations [Bandos... 2005]
Sample null by randomly switching predictions from p1 and p2.
1Also for sensitivity, NPV, PPV, see [Varoquaux and Colliot 2022]
G Varoquaux 29

Benchmarking to conclude on good training procedures
We want machine-learning research claims
(novel frobnicate improves prediction)
Many arbitrary components
torch.manual seed(3407) ?? [Picard 2021]
Useless to tune random seeds
(for weights init, dropout, data augmentation)
will not carry over to new training data
G Varoquaux 31

Arbitrary variance in a machine-learning benchmark
Variance when rerunning an evaluation,
modifying arbitrary elements:
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
relative scale
Across various computer vision and NLP tasks
G Varoquaux 32

Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data

Uncertainty due to test set sampling
The test set remains a limited sample
of the population
The train-test split is an arbitrary choice
0 1
Numerical noise
Dropout
Weights init
Data order
Data augment
Data (bootstrap)
Noisy Grid Search
Random Search
Bayes Opt
bert-rte
0 1
bert-sst2
eter
0.5
mentation
0 1
vgg
0 1
average
G Varoquaux 33
Test set
Train set
Full data
Better evaluation
Sample multiple times these arbitrary choices [Bouthillier... 2021]

Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
G Varoquaux 34
Test set
Train set
Full data

Better evaluation procedure
Cross-validation
multiple train-test splits
Randomize all arbitrary factors in folds
But, a full learning pipeline comes with hyper-parameters
These are set to minimize error
on left-out data
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
G Varoquaux 34
Test set
Train set
Full data

Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
G Varoquaux 35

Benchmarks accounting for hyper-parameter selection
Setting hyper-parameters is part of the pipeline
It must be evaluated
One option:
train, validation, and test splits
+ manual tuning of hyperparameters
Test set
Full data
Validation set
Train set
Make model choices
Evaluate model
Rampant overfit of validation set
[Makarova... 2021]
Multiple splits
+ automated tuning of hyperparameters Validation set
Full data
Test set
Train set
Hyper-parameter tuning
Outer loop
G Varoquaux 35

Hyper-parameter optimization procedures
Grid search
Vary each hyper-parameter on a grid
G Varoquaux 36

Hyper-parameter optimization procedures
Grid search
Random search
[Bergstra and Bengio 2012]
prefer to grid-search for
more than 2 params
Sub-optimal hyper-
parameters on models
routinely lead to invalid
conclusions
See refs in [Bouthillier... 2021] Region of good
hyperparameters
Hyperparameter 1
Hyperparameter
2
Grid Search
Randomized
Search
(important hyperparameter)
(unimportant
hyperparameter)
G Varoquaux 36

Benchmarking training procedures (eg to compare them)
Control arbitrary fluctuations (that will not generalize)
Sample all:
data sampling
Multiple train-test splits (cross-validation)
Test set
Train set
Full data
arbitrary choices (seeds)
Randomize them all
hyper-parameters
Hyper-parameter optimization Too expensive to randomize
In practice: set hyper parameters once, then randomize model seeds
and data splits [Bouthillier... 2021]
Counterintuitive: randomization decorrelates errors, improving benchmarks
G Varoquaux 37

Accounting for variance in conclusions
Confidence intervals statistical testing
Test set
Train set
Full data
Statistical testing with multiple folds
Challenge: folds are not independent1
.t-test/Wilcoxon across folds are not valid
Don’t divide std by number of folds
Approximate corrections for dependence across folds2
5x2cv: repeat 5 times randomized 2-fold [Dietterich 1998]
Corrected resampled t-test statistic [Nadeau and Bengio 2003]
1Train sets overlap, and often test sets also do.
2Does not account for sources of variance beyond data sampling, eg random seeds, hyper parameters.
G Varoquaux 38

Statistical tests: across datasets
(more general claims on training pipelines)
Challenge:
metrics not comparable across datasets
=⇒ Tests based on rank statistics
Wilcoxon signed rank test
Tests how often p1 outperforms p2 across datasets
G Varoquaux 39
[Demšar 2006]

Statistical tests: multiple pipelines across datasets
Challenge: multiple comparisons1
The Wilcoxon-Holm approach [Demšar 2006]
Pairwise comparisons + Bonferroni-Holm correction
The Friedman-Nemenyi approach2 [Demšar 2006]
1. Friedman test across all pipelines (omnibus)
2. Critical difference via Nemenyi test 1
2
3
4
5
4.2000
clf13.7667
clf23.5000
clf4
2.0000clf5
1.5333clf3
Accuracy (rank)
Critical difference diagrams
Replicability analysis [Dror... 2017]
Perform dataset-level pairwise tests
Combine by testing (partial conjunction multiple-testing):
“Does p1 perform better than p2 on at least u datasets?”
More powerful for a small number of datasets
1If we do many tests, some will show large differences by chance.
2The Holm approach can be more interesting when considering comparisons to a referent classifier.
G Varoquaux 40

Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires 15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
G Varoquaux 41
[Demšar 2008]

Statistical tests: beyond null-hypothesis testing
Sample size is a problem
Across datasets: significance typically requires 15 datasets
In a dataset (repeating folds, seeds...):
many repetitions makes any difference significant1
.Underpowered experiments bring no evidence
Shortcomings of null-hypothesis testing
Significance decreases with more comparisons2
Statistically significance , practical significance
1Though as the total test-set size is limited, they do not bring more evidence for generalization.
2FDR (False Discovery Rate) attempts to solve this.
G Varoquaux 41
[Demšar 2008]

Statistical tests: accounting for effect sizes
Neyman-Pearson view of hypothesis testing
Two hypothesis, H0 and H1
H1: p1 outperforms p2 by a margin1
Which is mostly likely? H0 H1
H0 H1
Requires the choice of the margin
Related to superiority testing in clinical trials
[Lesaffre 2008]
1Related to the rejection region in the Neyman-Pearson lemma.
G Varoquaux 42

Pragmatic compromises
Test on P(p1 p2) δ
δ .5: Neyman-Pearson view
Evaluate P(p1 p2) by resampling
Randomize everything: data splits, seeds,...
Gaussian approximation: amounts to comparing differences to
standard deviations
Not an inference on the expected difference in performance1
1Unlike standard error, standard deviation does not shrink to zero with the number of resampling.
G Varoquaux 43

Summary – comparing learning procedures
Sample multiple sources of variance:
- data split
- random seeds
- hyper-parameter tuning
Null-hypothesis testing: no t-test on cross-validation!
Don’t mis-interpret p-value:
- Not significant: more data could change that
- Significant differences may be trivial
Detect practical differences:
delta in performance standard deviation
G Varoquaux 44

Better experimental procedures
Crack the black box open
A prediction score is seldom insightful
Ablation studies: remove/change atomic elements
Learning curves
Better benchmarking in these
Tune hyper-parameters to the same quality
Randomize everything
Account for variance in conclusions
G Varoquaux 45

Summary – Better benchmarking procedures
Reminder: Your validation
measure is intrinsically unreli-
able (sampling noise)
An arbitrary choice (random
seed) may give seemingly-good
results that do not generalize
Optimizing test accuracy will
explore the tails
Evaluation is a bottleneck
300
200
100
30
Number of available samples
2% +2%
4% +4%
5% +5%
7% +7%
15% +12%
G Varoquaux 46

Model evaluation
Choice of performance metrics
Should be suited to the application setting
Machine learning does metric chasing Š
Evaluation procedures
Account for variance
Different sources of variance for applying prediction rules vs
learning them
Careful benchmarking is crucial
Optimistic flukes will not generalize
What is our purpose? External validity a
G Varoquaux 47
@GaelVaroquaux

References I
A. I. Bandos, H. E. Rockette, and D. Gur. A permutation test sensitive to differences in
areas for comparing roc curves from a paired design. Statistics in medicine, 24:2873,
2005.
J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of
Machine Learning Research, 13:281, 2012.
X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto,
N. Mohammadi Sepahvand, E. Raff, K. Madan, V. Voleti, ... Accounting for variance in
machine learning benchmarks. Proceedings of Machine Learning and Systems, 3, 2021.
D. Chyzhyk, G. Varoquaux, B. Thirion, and M. Milham. Controlling a confound in predictive
models with a test set minimizing its effect. In 2018 International Workshop on
Pattern Recognition in Neuroimaging (PRNI), pages 1–4. IEEE, 2018.
J. Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of
Machine Learning Research, 7:1–30, 2006.
J. Demšar. On the appropriateness of statistical tests in machine learning. In Workshop
on Evaluation Methods for Machine Learning in conjunction with ICML, page 65.
Citeseer, 2008.
G Varoquaux 48

References II
T. G. Dietterich. Approximate statistical tests for comparing supervised classification
learning algorithms. Neural computation, 10(7):1895–1923, 1998.
R. Dror, B. G., Bogomolov, M., and R. Reichart. Replicability analysis for natural language
processing: Testing significance with multiple datasets. Transactions of the
Association for Computational Linguistics, 2017.
E. Lesaffre. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU
hospital for joint diseases, 66(2), 2008.
M. A. Little, G. Varoquaux, S. Saeb, L. Lonini, A. Jayaraman, D. C. Mohr, and K. P. Kording.
Using and understanding cross-validation strategies. perspectives on saeb et al.
GigaScience, 6(5):1–6, 2017.
L. Maier-Hein, A. Reinke, E. Christodoulou, B. Glocker, P. Godau, F. Isensee, J. Kleesiek,
M. Kozubek, M. Reyes, M. A. Riegler, ... Metrics reloaded: Pitfalls and recommendations
for image analysis validation. arXiv preprint arXiv:2206.01653, 2022.
G Varoquaux 49

References III
A. Makarova, H. Shen, V. Perrone, A. Klein, J. B. Faddoul, A. Krause, M. Seeger, and
C. Archambeau. Overfitting in bayesian optimization: an empirical study and
early-stopping solution. arXiv preprint arXiv:2104.08166, 2021.
C. Nadeau and Y. Bengio. Inference for the generalization error. Machine learning, 52(3):
239–281, 2003.
D. Picard. Torch. manual seed (3407) is all you need: On the influence of random seeds
in deep learning architectures for computer vision. arXiv:2109.08203, 2021.
N. Traut, K. Heuer, G. Lemaı̂tre, A. Beggiato, D. Germanaud, M. Elmaleh, A. Bethegnies,
L. Bonnasse-Gahot, W. Cai, S. Chambon, ... Insights from an autism imaging biomarker
challenge: promises and threats to biomarker discovery. NeuroImage, 255:119171, 2022.
G. Varoquaux. Cross-validation failure: small sample sizes lead to large error bars.
Neuroimage, 180:68–77, 2018.
G. Varoquaux and V. Cheplygina. Machine learning for medical imaging: methodological
failures and recommendations for the future. NPJ digital medicine, 5(1):1–8, 2022.
G. Varoquaux and O. Colliot. Evaluating machine learning models and their diagnostic
value. https://hal.archives-ouvertes.fr/hal-03682454/, 2022.
G Varoquaux 50

Evaluating machine learning models and their diagnostic value

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Evaluating machine learning models and their diagnostic value

Similar to Evaluating machine learning models and their diagnostic value (20)

More from Gael Varoquaux

More from Gael Varoquaux (20)

Recently uploaded

Recently uploaded (20)

Evaluating machine learning models and their diagnostic value