1. Cross-validation:
what, how and which?
Pradeep Reddy Raamana
raamana.com
“Statistics [from cross-validation] are like bikinis.
what they reveal is suggestive, but what they conceal is vital.”
4. P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
Training set Test set
≈ℵ≈
2
5. P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices?
Training set Test set
≈ℵ≈
2
6. P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices?
Training set Test set
≈ℵ≈
negative bias unbiased positive bias
2
7. P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
8. P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
9. P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on
unseen data (population*)
3*has a statistical definition
10. P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on
unseen data (population*)
3*has a statistical definition
11. P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on
unseen data (population*)
out-of-sample
predictions
3*has a statistical definition
12. P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on
unseen data (population*)
out-of-sample
predictions
3
avoid
overfitting
*has a statistical definition
20. P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
5
21. P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
5
22. P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
5
23. P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
5
24. P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Key: Train & test sets must be disjoint.
And the dataset or sample size is fixed.
They grow at the expense of each other!
cross-validate
to maximize both
5
26. P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
6
27. P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
6
28. P. Raamana
accuracydistribution
fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability
(test accuracy)
6
29. P. Raamana
accuracydistribution
fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability
(test accuracy)
• to pick optimal parameters
(model selection)
6
30. P. Raamana
accuracydistribution
fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cross-validation (CV) is
typically used”
• Use cases:
• to estimate generalizability
(test accuracy)
• to pick optimal parameters
(model selection)
• to compare performance
(model comparison).
6
33. P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between
training and test sets is desired.
7
34. P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between
training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7
35. P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between
training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
samples
(rows)
7
healt
hy
dise
ase
36. P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between
training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase
37. P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between
training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
time (columns)
samples
(rows)
7
healt
hy
dise
ase
38. P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between
training and test sets is desired.
•This split could be
• over samples (e.g. indiv. diagnosis)
• over time (for task prediction in fMRI)
2. How often you repeat randomized splits?
•to expose classifier to full variability
•As many as times as you can e.g. 100
≈ℵ≈
time (columns)
samples
(rows)
7
healt
hy
dise
ase
47. P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
48. P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
8*biased towards X —> overfit to X
49. P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
inner-loop
8*biased towards X —> overfit to X
50. P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the training set
evaluate
generalization
independent of
training or test sets
Whole dataset
Training set Test set Validation set
≈ℵ≈
inner-loop
outer-loop
8*biased towards X —> overfit to X
54. P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize
hyperparameters
Evaluate fully-optimized
classifier to report
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classifier or optimize
parameters
55. P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Optimize
hyperparameters
Evaluate fully-optimized
classifier to report
performance
Don’ts (Invalid use)
Don’t report training error as
the test error!
Don’t do feature selection or
anything supervised on test
set to learn or optimize!
Don’t use it in any way to train
classifier or optimize
parameters
Alternative
names
Training
(no confusion)
Validation
(or tweaking, tuning,
optimization set)
Test set (more
accurately reporting
set)
62. P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
10
63. P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: different folds won’t be contiguous. 10
64. P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: different folds won’t be contiguous. 10
66. P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
67. P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
68. P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
69. P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
70. P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Note: there could be overlap among the test sets
from different trials! Hence large n is recommended.
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
73. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
12
74. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
12
75. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
12
76. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
77. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
78. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN)
79. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
80. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN)
Test Set (CN) Tes
81. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
82. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
83. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted:
very small training, large
testing
84. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted:
very small training, large
testing
•leave one [unit] out:
85. P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63.2, 75, 80, 90
•stratified
• across train/test
• across classes
12
Controls MCIc
Training (MCIc)Training (CN) Test Set (CN) Tes
•inverted:
very small training, large
testing
•leave one [unit] out:
• unit —> sample / pair / tuple
/ condition / task / block out
88. P. Raamana
Measuring bias
in CV measurements
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
89. P. Raamana
Measuring bias
in CV measurements
Validation set
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
90. P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
91. P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
Training set Test set
Inner-CV
Whole dataset
13
92. P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
positive bias unbiased negative bias
Training set Test set
Inner-CV
Whole dataset
13
93. P. Raamana
fMRI datasets
14
Dataset Intra- or inter? # samples
# blocks
(sessions or subjects)
Tasks
Haxby Intra 209 12 seconds various
Duncan Inter 196 49 subjects various
Wager Inter 390 34 subjects various
Cohen Inter 80 24 subjects various
Moran Inter 138 36 subjects various
Henson Inter 286 16 subjects various
Knops Inter 14 19 subjects various
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
117. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
118. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
119. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
AUC1
120. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
19
Train Test
AUC1
AUC2
AUC3
AUCn
121. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
122. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
123. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
x1
x2
L1
124. P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier performance!
• Not all measures across folds are
commensurate!
• e.g. decision scores from SVM
(reference plane and zero are
different!)
• hence they can not be pooled
across folds to construct an ROC!
• Instead, make ROC per fold and
compute AUC per fold, and then
average AUC across folds!
19
Train Test
AUC1
AUC2
AUC3
AUCn
L2
x1
x2
L1
125. P. Raamana
Performance Metrics
20
Metric
Commensurate
across folds?
Advantages Disadvantages
Accuracy /
Error rate
Yes
Universally applicable;
Multi-class;
Sensitive to
class- and
cost-imbalance
Area under
ROC (AUC)
Only when ROC
is computed
within fold
Averages over all ratios
of misclassification costs
Not easily extendable to
multi-class problems
F1 score Yes
Information
retrieval
Does not take true
negatives into account
126. P. Raamana
Subtle Sources of Bias in CV
21
Type* Approach
sexy
name I
made up
How to avoid it?
k-hacking
Try many k’s in k-fold CV
(or different training %)
and report only the best
k-hacking
Pick k=10, repeat it many times
(n>200 or as many as possible) and
report the full distribution (not box plots)
metric-
hacking
Try different performance
metrics (accuracy, AUC, F1,
error rate), and report the best
m-hacking
Choose the most appropriate and
recognized metric for the problem e.g.
AUC for binary classification etc
ROI-
hacking
Assess many ROIs (or their
features, or combinations), but
report only the best
r-hacking
Adopt a whole-brain data-driven approach
to discover best ROIs within an inner CV,
then report their out-of-sample predictive
accuracy
feature- or
dataset-
hacking
Try subsets of feature[s] or
subsamples of dataset[s], but
report only the best
d-hacking
Use and report on everything: all analyses
on all datasets, try inter-dataset CV, run
non-parametric statistical comparisons!
*exact incidence of these hacking approaches is unknown, but non-zero.
133. P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
134. P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
141. P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
28
142. P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
28
143. P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
• number of model parameters, exponentially
28
144. P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of repetitions
• esp. if the model training is computationally
expensive.
• number of model parameters, exponentially
• to choose the best combination!
28
146. P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
29
147. P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
29
148. P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in complicated analyses!
• Use repeated-holdout (10-50% for testing)
• respecting sample/dependency structure
• ensuring independence between train & test sets
• Use biggest test set, and large # repetitions when possible
• Not possible with leave-one-sample-out.
29
152. P. Raamana
Conclusions
• Results could vary considerably
with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
30
153. P. Raamana
Conclusions
• Results could vary considerably
with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
• Proper splitting is not enough,
proper pooling is needed too.
30
154. P. Raamana
Conclusions
• Results could vary considerably
with a different CV scheme
• CV results can have variance (>10%)
• Document CV scheme in detail:
• type of split
• number of repetitions
• Full distribution of estimates
• Proper splitting is not enough,
proper pooling is needed too.
30
• Bad examples:
• just mean: 𝜇%
• std. dev.: 𝜇±𝜎%
• Good examples:
• Using 250 iterations of
10-fold cross-validation,
we obtain the following
distribution of AUC.
155. P. Raamana
References
• Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo,
A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain
decoders: cross-validation, caveats, and guidelines.
NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038
• Arlot, S., & Celisse, A. (2010). A survey of cross-validation
procedures for model selection. Statistics Surveys, 4, 40–79.
• Forman, G. (2010). Apples-to-apples in cross-validation studies:
pitfalls in classifier performance measurement. ACM SIGKDD
Explorations Newsletter.
31
neuro
predict
github.com/raamana/neuropredict
158. Useful tools
Software/
toolbox
Target
audience
Lang-
uage
Number
of ML
techniques
Neuro-
imaging
oriented?
Coding
required?
Effort
needed
Use case
scikit-learn Generic ML Python Many No Yes High To try many ML techniques
neuro-
predict
Neuroimagers Python
1
(more soon)
Yes No Easy
Quick evaluation of
predictive performance!
nilearn Neuroimagers Python Few Yes Yes Medium
Some image processing is
required
PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab
PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python
Weka Generic ML Java Many No Yes High
GUI to try many
techniques
Shogun Generic ML C++ Many No Yes High Efficient
159. P. Raamana
Model selection
Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics.
44
160. P. Raamana
Datasets
• 7 fMRI datasets
• intra-subject
• inter-subject
• OASIS
• gender
discrimination
from VBM
maps
• Simulations
Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B.
(2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
47