Cross-validation Tutorial: What, how and which?

Pradeep Redddy Raamana
Pradeep Redddy RaamanaMachine learning and Neuroimaging Researcher at Rotman Research Institute at Baycrest
Cross-validation:
what, how and which?
Pradeep Reddy Raamana
raamana.com
“Statistics [from cross-validation] are like bikinis.
what they reveal is suggestive, but what they conceal is vital.”
P. Raamana
Goals for Today
2
P. Raamana
Goals for Today
• What is cross-validation? Training set Test set
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
Training set Test set
≈ℵ≈
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices?
Training set Test set
≈ℵ≈
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices?
Training set Test set
≈ℵ≈
negative bias unbiased positive bias
2
P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
out-of-sample
predictions
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
out-of-sample
predictions
3
avoid 

overfitting
*has a statistical definition
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
Why cross-validate?
Training set Test set
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
cross-validate
to maximize both
5
P. Raamana
Use cases
6
P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
6
P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
6
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability 

(test accuracy)
6
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability 

(test accuracy)
• to pick optimal parameters 

(model selection)
6
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability 

(test accuracy)
• to pick optimal parameters 

(model selection)
• to compare performance 

(model comparison).
6
P. Raamana
Key Aspects of CV
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
2. How often you repeat randomized splits?
•to expose classifier to full variability
•As many as times as you can e.g. 100
≈ℵ≈
time (columns)
samples
(rows)
7
healt
hy
dise
ase
P. Raamana
Validation set
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set Test set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
inner-loop
8*biased towards X —> overfit to X
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
inner-loop
outer-loop
8*biased towards X —> overfit to X
P. Raamana
Terminology
9
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize

hyperparameters
Evaluate fully-optimized
classifier to report
performance
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize

hyperparameters
Evaluate fully-optimized
classifier to report
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classifier or optimize
parameters
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize

hyperparameters
Evaluate fully-optimized
classifier to report
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classifier or optimize
parameters
Alternative
names
Training 

(no confusion)
Validation 

(or tweaking, tuning,
optimization set)
Test set (more
accurately reporting
set)
P. Raamana
K-fold CV
10
P. Raamana
K-fold CV
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: different folds won’t be contiguous. 10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: different folds won’t be contiguous. 10
P. Raamana
Repeated Holdout CV
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Note: there could be overlap among the test sets 

from different trials! Hence large n is recommended.
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
CV has many variations!
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN)
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN)
Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted: 

very small training, large
testing
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted: 

very small training, large
testing
•leave one [unit] out:
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted: 

very small training, large
testing
•leave one [unit] out:
• unit —> sample / pair / tuple
/ condition / task / block out
P. Raamana
Measuring bias
in CV measurements
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
positive bias unbiased negative bias
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
fMRI datasets
14
Dataset Intra- or inter? # samples
# blocks 

(sessions or subjects)
Tasks
Haxby Intra 209 12 seconds various
Duncan Inter 196 49 subjects various
Wager Inter 390 34 subjects various
Cohen Inter 80 24 subjects various
Moran Inter 138 36 subjects various
Henson Inter 286 16 subjects various
Knops Inter 14 19 subjects various
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
P. Raamana
Repeated holdout (10 trials, 20% test)
15
P. Raamana
Repeated holdout (10 trials, 20% test)Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
positively-

biased
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
positively-

biased
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validation
unbiased!
negatively

biased
positively-

biased
15
P. Raamana
CV vs. Validation: real data
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
Simulations:
known ground truth
17
P. Raamana
Simulations:
known ground truth
17
P. Raamana
Simulations:
known ground truth
17
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
Commensurability across folds
19
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
AUC1
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
AUC1
AUC2
AUC3
AUCn
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
L1
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
L2
x1
x2
L1
P. Raamana
Performance Metrics
20
Metric
Commensurate
across folds?
Advantages Disadvantages
Accuracy /
Error rate
Yes
Universally applicable; 

Multi-class;
Sensitive to 

class- and 

cost-imbalance 

Area under
ROC (AUC)
Only when ROC
is computed
within fold
Averages over all ratios
of misclassification costs
Not easily extendable to
multi-class problems
F1 score Yes
Information 

retrieval
Does not take true
negatives into account
P. Raamana
Subtle Sources of Bias in CV
21
Type* Approach
sexy
name I
made up
How to avoid it?
k-hacking
Try many k’s in k-fold CV 

(or different training %) 

and report only the best
k-hacking
Pick k=10, repeat it many times 

(n>200 or as many as possible) and 

report the full distribution (not box plots)
metric-
hacking
Try different performance
metrics (accuracy, AUC, F1,
error rate), and report the best
m-hacking
Choose the most appropriate and
recognized metric for the problem e.g.
AUC for binary classification etc
ROI-
hacking
Assess many ROIs (or their
features, or combinations), but
report only the best
r-hacking
Adopt a whole-brain data-driven approach
to discover best ROIs within an inner CV,
then report their out-of-sample predictive
accuracy
feature- or
dataset-
hacking
Try subsets of feature[s] or
subsamples of dataset[s], but
report only the best
d-hacking
Use and report on everything: all analyses
on all datasets, try inter-dataset CV, run
non-parametric statistical comparisons!
*exact incidence of these hacking approaches is unknown, but non-zero.
P. Raamana
Overfitting
22
P. Raamana
Overfitting
22
P. Raamana
Overfitting
22
Underfit
P. Raamana
Overfitting
22
Overfit
Underfit
P. Raamana
Overfitting
22
Good fit
Overfit
Underfit
P. Raamana
50 shades of overfitting
23
Decade
Population
© mathworks
human 

annihilation?
P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
P. Raamana
“Clever forms of overfitting”
25from http://hunch.net/?p=22
P. Raamana
Is overfitting always bad?
26
some credit card 

fraud detection systems 

use it successfully
Others?
Cross-validation Tutorial: What, how and which?
P. Raamana
Limitations of CV
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
• number of model parameters, exponentially
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
• number of model parameters, exponentially
• to choose the best combination!
28
P. Raamana
Recommendations
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
• Use biggest test set, and large # repetitions when possible
• Not possible with leave-one-sample-out.
29
P. Raamana
Conclusions
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
• Proper splitting is not enough,

proper pooling is needed too.
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
• Proper splitting is not enough,

proper pooling is needed too.
30
• Bad examples:
• just mean: 𝜇%
• std. dev.: 𝜇±𝜎%
• Good examples:
• Using 250 iterations of
10-fold cross-validation,
we obtain the following
distribution of AUC.
P. Raamana
References
• Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo,
A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain
decoders: cross-validation, caveats, and guidelines.
NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038
• Arlot, S., & Celisse, A. (2010). A survey of cross-validation
procedures for model selection. Statistics Surveys, 4, 40–79.
• Forman, G. (2010). Apples-to-apples in cross-validation studies:
pitfalls in classifier performance measurement. ACM SIGKDD
Explorations Newsletter.
31
neuro
predict
github.com/raamana/neuropredict
P. Raamana
Acknowledgements
32
Gael Varoquax
Now, it’s time to cross-validate!
xkcd
Useful tools
Software/
toolbox
Target
audience
Lang-
uage
Number 

of ML
techniques
Neuro-
imaging
oriented?
Coding
required?
Effort
needed
Use case
scikit-learn Generic ML Python Many No Yes High To try many ML techniques
neuro-
predict
Neuroimagers Python
1
(more soon)
Yes No Easy
Quick evaluation of
predictive performance!
nilearn Neuroimagers Python Few Yes Yes Medium
Some image processing is
required
PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab
PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python
Weka Generic ML Java Many No Yes High
GUI to try many
techniques
Shogun Generic ML C++ Many No Yes High Efficient
P. Raamana
Model selection
Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics.
44
P. Raamana
Datasets
• 7 fMRI datasets
• intra-subject
• inter-subject
• OASIS
• gender
discrimination
from VBM
maps
• Simulations
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
47
1 of 160

Recommended

Cross validation by
Cross validationCross validation
Cross validationRidhaAfrawe
3.2K views30 slides
Linear regression by
Linear regressionLinear regression
Linear regressionMartinHogg9
12.2K views48 slides
Machine Learning-Linear regression by
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
3.3K views10 slides
K-Folds Cross Validation Method by
K-Folds Cross Validation MethodK-Folds Cross Validation Method
K-Folds Cross Validation MethodSHUBHAM GUPTA
3.9K views21 slides
Machine learning session4(linear regression) by
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
3.3K views13 slides
Classification Based Machine Learning Algorithms by
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
9.9K views41 slides

More Related Content

What's hot

Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A... by
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...Edureka!
1.6K views36 slides
Decision tree and random forest by
Decision tree and random forestDecision tree and random forest
Decision tree and random forestLippo Group Digital
7K views17 slides
Hyperparameter Tuning by
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
2.9K views87 slides
Understanding random forests by
Understanding random forestsUnderstanding random forests
Understanding random forestsMarc Garcia
1.5K views44 slides
Machine Learning - Accuracy and Confusion Matrix by
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion MatrixAndrew Ferlitsch
3.3K views7 slides
Decision trees in Machine Learning by
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
6.7K views19 slides

What's hot(20)

Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A... by Edureka!
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Naive Bayes Classifier in Python | Naive Bayes Algorithm | Machine Learning A...
Edureka!1.6K views
Hyperparameter Tuning by Jon Lederman
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman2.9K views
Understanding random forests by Marc Garcia
Understanding random forestsUnderstanding random forests
Understanding random forests
Marc Garcia1.5K views
Machine Learning - Accuracy and Confusion Matrix by Andrew Ferlitsch
Machine Learning - Accuracy and Confusion MatrixMachine Learning - Accuracy and Confusion Matrix
Machine Learning - Accuracy and Confusion Matrix
Andrew Ferlitsch3.3K views
Presentation on supervised learning by Tonmoy Bhagawati
Presentation on supervised learningPresentation on supervised learning
Presentation on supervised learning
Tonmoy Bhagawati24.3K views
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio by Marina Santini
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini151.4K views
Logistic Regression | Logistic Regression In Python | Machine Learning Algori... by Simplilearn
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Simplilearn4.2K views
Boosting Algorithms Omar Odibat by omarodibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat
omarodibat1.3K views
Bias and variance trade off by VARUN KUMAR
Bias and variance trade offBias and variance trade off
Bias and variance trade off
VARUN KUMAR712 views
NAIVE BAYES CLASSIFIER by Knoldus Inc.
NAIVE BAYES CLASSIFIERNAIVE BAYES CLASSIFIER
NAIVE BAYES CLASSIFIER
Knoldus Inc.5.9K views
Dimension Reduction: What? Why? and How? by Kazi Toufiq Wadud
Dimension Reduction: What? Why? and How?Dimension Reduction: What? Why? and How?
Dimension Reduction: What? Why? and How?
Kazi Toufiq Wadud1.6K views
Naive Bayes by CloudxLab
Naive BayesNaive Bayes
Naive Bayes
CloudxLab7.5K views
Classification in data mining by Sulman Ahmed
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed8.4K views
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science... by Edureka!
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Edureka!3.3K views

Similar to Cross-validation Tutorial: What, how and which?

crossvalidation.pptx by
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptxPriyadharshiniG41
21 views18 slides
How predictive models help Medicinal Chemists design better drugs_webinar by
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinarAnn-Marie Roche
753 views40 slides
How to Conduct and Interpret Tests of Differences by
How to Conduct and Interpret Tests of DifferencesHow to Conduct and Interpret Tests of Differences
How to Conduct and Interpret Tests of DifferencesStatistics Solutions
97 views22 slides
05-00-ACA-Data-Intro.pdf by
05-00-ACA-Data-Intro.pdf05-00-ACA-Data-Intro.pdf
05-00-ACA-Data-Intro.pdfAlexanderLerch4
4 views31 slides
Meetup_Consumer_Credit_Default_Vers_2_All by
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_AllBernard Ong
419 views21 slides

Similar to Cross-validation Tutorial: What, how and which?(20)

How predictive models help Medicinal Chemists design better drugs_webinar by Ann-Marie Roche
How predictive models help Medicinal Chemists design better drugs_webinarHow predictive models help Medicinal Chemists design better drugs_webinar
How predictive models help Medicinal Chemists design better drugs_webinar
Ann-Marie Roche753 views
Meetup_Consumer_Credit_Default_Vers_2_All by Bernard Ong
Meetup_Consumer_Credit_Default_Vers_2_AllMeetup_Consumer_Credit_Default_Vers_2_All
Meetup_Consumer_Credit_Default_Vers_2_All
Bernard Ong419 views
Barga Data Science lecture 9 by Roger Barga
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga202 views
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx by shamsul2010
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptxLETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
LETS PUBLISH WITH MORE RELIABLE & PRESENTABLE MODELLING.pptx
shamsul20103 views
Progression by Regression: How to increase your A/B Test Velocity by Stitch Fix Algorithms
Progression by Regression: How to increase your A/B Test VelocityProgression by Regression: How to increase your A/B Test Velocity
Progression by Regression: How to increase your A/B Test Velocity
shubhampresentation-180430060134.pptx by ABINASHPADHY6
shubhampresentation-180430060134.pptxshubhampresentation-180430060134.pptx
shubhampresentation-180430060134.pptx
ABINASHPADHY65 views
Stance classification. Presentation at QMUL 11 Nov 2020 by Weverify
Stance classification. Presentation at QMUL 11 Nov 2020Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020
Weverify22 views
Stance classification. Uni Cambridge 22 Jan 2021 by Weverify
Stance classification. Uni Cambridge 22 Jan 2021Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021
Weverify113 views
Mini datathon - Bengaluru by Kunal Jain
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
Kunal Jain435 views
IME 672 - Classifier Evaluation I.pptx by Temp762476
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptx
Temp7624767 views
Workshop on SPSS: Basic to Intermediate Level by Hiram Ting
Workshop on SPSS: Basic to Intermediate LevelWorkshop on SPSS: Basic to Intermediate Level
Workshop on SPSS: Basic to Intermediate Level
Hiram Ting2.6K views
flankr: EPS presentation by JimGrange
flankr: EPS presentationflankr: EPS presentation
flankr: EPS presentation
JimGrange1.8K views

Recently uploaded

shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
5 views14 slides
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptxDataScienceConferenc1
5 views12 slides
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx by
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptx
[DSC Europe 23] Ivana Sesic - Use of AI in Public Health.pptxDataScienceConferenc1
5 views15 slides
Advanced_Recommendation_Systems_Presentation.pptx by
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptxneeharikasingh29
5 views9 slides
Data Journeys Hard Talk workshop final.pptx by
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptxinfo828217
10 views18 slides
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks by
[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks[DSC Europe 23] Aleksandar Tomcic - Adversarial Attacks
[DSC Europe 23] Aleksandar Tomcic - Adversarial AttacksDataScienceConferenc1
5 views20 slides

Recently uploaded(20)

[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20047 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862013 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra17 views
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ... by DataScienceConferenc1
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
[DSC Europe 23] Danijela Horak - The Innovator’s Dilemma: to Build or Not to ...
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9017 views
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ... by DataScienceConferenc1
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
[DSC Europe 23] Predrag Ilic & Simeon Rilling - From Data Lakes to Data Mesh ...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views

Cross-validation Tutorial: What, how and which?

  • 1. Cross-validation: what, how and which? Pradeep Reddy Raamana raamana.com “Statistics [from cross-validation] are like bikinis. what they reveal is suggestive, but what they conceal is vital.”
  • 3. P. Raamana Goals for Today • What is cross-validation? Training set Test set 2
  • 4. P. Raamana Goals for Today • What is cross-validation? • How to perform it? Training set Test set ≈ℵ≈ 2
  • 5. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ 2
  • 6. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ negative bias unbiased positive bias 2
  • 7. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  • 8. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  • 9. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  • 10. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  • 11. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3*has a statistical definition
  • 12. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3 avoid 
 overfitting *has a statistical definition
  • 13. P. Raamana CV helps quantify generalizability 4
  • 14. P. Raamana CV helps quantify generalizability 4
  • 15. P. Raamana CV helps quantify generalizability 4
  • 16. P. Raamana CV helps quantify generalizability 4
  • 18. P. Raamana Why cross-validate? Training set Test set bigger training set better learning 5
  • 19. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set 5
  • 20. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. 5
  • 21. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. 5
  • 22. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  • 23. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  • 24. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! cross-validate to maximize both 5
  • 26. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” 6
  • 27. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: 6
  • 28. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) 6
  • 29. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) 6
  • 30. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) • to compare performance 
 (model comparison). 6
  • 32. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test 7
  • 33. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. 7
  • 34. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7
  • 35. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7 healt hy dise ase
  • 36. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  • 37. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  • 38. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) 2. How often you repeat randomized splits? •to expose classifier to full variability •As many as times as you can e.g. 100 ≈ℵ≈ time (columns) samples (rows) 7 healt hy dise ase
  • 39. P. Raamana Validation set Training set 8*biased towards X —> overfit to X
  • 40. P. Raamana Validation set goodness of fit of the model Training set 8*biased towards X —> overfit to X
  • 41. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set 8*biased towards X —> overfit to X
  • 42. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set 8*biased towards X —> overfit to X
  • 43. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 44. P. Raamana Validation set optimize parameters goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 45. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 46. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 47. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 48. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  • 49. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop 8*biased towards X —> overfit to X
  • 50. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop outer-loop 8*biased towards X —> overfit to X
  • 53. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance
  • 54. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters
  • 55. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters Alternative names Training 
 (no confusion) Validation 
 (or tweaking, tuning, optimization set) Test set (more accurately reporting set)
  • 58. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 59. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 60. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 61. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  • 62. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k 10
  • 63. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  • 64. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  • 65. P. Raamana Repeated Holdout CV Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 66. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 67. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 68. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 69. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 70. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Note: there could be overlap among the test sets 
 from different trials! Hence large n is recommended. Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  • 71. P. Raamana CV has many variations! 12
  • 72. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 12
  • 73. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) 12
  • 74. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 12
  • 75. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified 12
  • 76. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12
  • 77. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc
  • 78. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN)
  • 79. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 80. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 81. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 82. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  • 83. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing
  • 84. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out:
  • 85. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out: • unit —> sample / pair / tuple / condition / task / block out
  • 86. P. Raamana Measuring bias in CV measurements Whole dataset 13
  • 87. P. Raamana Measuring bias in CV measurements Training set Test set Inner-CV Whole dataset 13
  • 88. P. Raamana Measuring bias in CV measurements cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  • 89. P. Raamana Measuring bias in CV measurements Validation set cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  • 90. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  • 91. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ Training set Test set Inner-CV Whole dataset 13
  • 92. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ positive bias unbiased negative bias Training set Test set Inner-CV Whole dataset 13
  • 93. P. Raamana fMRI datasets 14 Dataset Intra- or inter? # samples # blocks 
 (sessions or subjects) Tasks Haxby Intra 209 12 seconds various Duncan Inter 196 49 subjects various Wager Inter 390 34 subjects various Cohen Inter 80 24 subjects various Moran Inter 138 36 subjects various Henson Inter 286 16 subjects various Knops Inter 14 19 subjects various Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
  • 94. P. Raamana Repeated holdout (10 trials, 20% test) 15
  • 95. P. Raamana Repeated holdout (10 trials, 20% test)Classifieraccuracy
 viacross-validation 15
  • 96. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  • 97. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  • 98. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  • 99. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! 15
  • 100. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased 15
  • 101. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  • 102. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  • 103. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  • 104. P. Raamana CV vs. Validation: real data 16 conservative
  • 105. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 106. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 107. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 108. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  • 112. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 113. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 114. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 115. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  • 117. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19
  • 118. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test
  • 119. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1
  • 120. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1 AUC2 AUC3 AUCn
  • 121. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn
  • 122. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2
  • 123. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2 L1
  • 124. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn L2 x1 x2 L1
  • 125. P. Raamana Performance Metrics 20 Metric Commensurate across folds? Advantages Disadvantages Accuracy / Error rate Yes Universally applicable; 
 Multi-class; Sensitive to 
 class- and 
 cost-imbalance 
 Area under ROC (AUC) Only when ROC is computed within fold Averages over all ratios of misclassification costs Not easily extendable to multi-class problems F1 score Yes Information 
 retrieval Does not take true negatives into account
  • 126. P. Raamana Subtle Sources of Bias in CV 21 Type* Approach sexy name I made up How to avoid it? k-hacking Try many k’s in k-fold CV 
 (or different training %) 
 and report only the best k-hacking Pick k=10, repeat it many times 
 (n>200 or as many as possible) and 
 report the full distribution (not box plots) metric- hacking Try different performance metrics (accuracy, AUC, F1, error rate), and report the best m-hacking Choose the most appropriate and recognized metric for the problem e.g. AUC for binary classification etc ROI- hacking Assess many ROIs (or their features, or combinations), but report only the best r-hacking Adopt a whole-brain data-driven approach to discover best ROIs within an inner CV, then report their out-of-sample predictive accuracy feature- or dataset- hacking Try subsets of feature[s] or subsamples of dataset[s], but report only the best d-hacking Use and report on everything: all analyses on all datasets, try inter-dataset CV, run non-parametric statistical comparisons! *exact incidence of these hacking approaches is unknown, but non-zero.
  • 132. P. Raamana 50 shades of overfitting 23 Decade Population © mathworks human 
 annihilation?
  • 133. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  • 134. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  • 135. P. Raamana “Clever forms of overfitting” 25from http://hunch.net/?p=22
  • 136. P. Raamana Is overfitting always bad? 26 some credit card 
 fraud detection systems 
 use it successfully Others?
  • 139. P. Raamana Limitations of CV • Number of CV repetitions increases with 28
  • 140. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: 28
  • 141. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions 28
  • 142. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. 28
  • 143. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially 28
  • 144. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially • to choose the best combination! 28
  • 146. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! 29
  • 147. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets 29
  • 148. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets • Use biggest test set, and large # repetitions when possible • Not possible with leave-one-sample-out. 29
  • 150. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme 30
  • 151. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) 30
  • 152. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates 30
  • 153. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30
  • 154. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30 • Bad examples: • just mean: 𝜇% • std. dev.: 𝜇±𝜎% • Good examples: • Using 250 iterations of 10-fold cross-validation, we obtain the following distribution of AUC.
  • 155. P. Raamana References • Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038 • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. • Forman, G. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter. 31 neuro predict github.com/raamana/neuropredict
  • 157. Now, it’s time to cross-validate! xkcd
  • 158. Useful tools Software/ toolbox Target audience Lang- uage Number 
 of ML techniques Neuro- imaging oriented? Coding required? Effort needed Use case scikit-learn Generic ML Python Many No Yes High To try many ML techniques neuro- predict Neuroimagers Python 1 (more soon) Yes No Easy Quick evaluation of predictive performance! nilearn Neuroimagers Python Few Yes Yes Medium Some image processing is required PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python Weka Generic ML Java Many No Yes High GUI to try many techniques Shogun Generic ML C++ Many No Yes High Efficient
  • 159. P. Raamana Model selection Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics. 44
  • 160. P. Raamana Datasets • 7 fMRI datasets • intra-subject • inter-subject • OASIS • gender discrimination from VBM maps • Simulations Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 47