Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Cross-validation:
what, how and which?
Pradeep Reddy Raamana
raamana.com
“Statistics [from cross-validation] are like biki...
P. Raamana
Goals for Today
2
P. Raamana
Goals for Today
• What is cross-validation? Training set Test set
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
Training set Test set
≈ℵ≈
2
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices...
P. Raamana
Goals for Today
• What is cross-validation?
• How to perform it?
• What are the effects of
different CV choices...
P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*)
3*has a statistical definition
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
3*has a sta...
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
3*has a sta...
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
out-of-samp...
P. Raamana
What is generalizability?
available
data (sample*) desired: accuracy on 

unseen data (population*)
out-of-samp...
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
CV helps quantify generalizability
4
P. Raamana
Why cross-validate?
Training set Test set
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
5
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Ke...
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Ke...
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Ke...
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Ke...
P. Raamana
Why cross-validate?
Training set Test set
bigger training set
better learning better testing
bigger test set
Ke...
P. Raamana
Use cases
6
P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cr...
P. Raamana
Use cases
• “When setting aside data for parameter
estimation and validation of results can
not be afforded, cr...
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and v...
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and v...
P. Raamana
accuracydistribution

fromrepetitionofCV(%)
Use cases
• “When setting aside data for parameter
estimation and v...
P. Raamana
Key Aspects of CV
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
7
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and tes...
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and tes...
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and tes...
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and tes...
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and tes...
P. Raamana
Key Aspects of CV
1. How you split the dataset into train/test
•maximal independence between 

training and tes...
P. Raamana
Validation set
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
Training set
8*biased towards X —> overfit to X
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set
8*biased towards X —> ...
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set Test set
8*biased towa...
P. Raamana
Validation set
goodness of fit
of the model
biased* towards
the training set
Training set Test set
≈ℵ≈
8*biased ...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased* towards
the training set
Training set Te...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the ...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the ...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the ...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the ...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the ...
P. Raamana
Validation set
optimize
parameters
goodness of fit
of the model
biased towards
the test set
biased* towards
the ...
P. Raamana
Terminology
9
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Op...
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Op...
P. Raamana
Terminology
9
Data split
Training
Testing
Validation
Purpose (Do’s)
Train model to learn its
core parameters
Op...
P. Raamana
K-fold CV
10
P. Raamana
K-fold CV
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
10
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: d...
P. Raamana
K-fold CV
Test sets in different trials are indeed mutually disjoint
Train Test, 4th fold
trial
1
2
…
k
Note: d...
P. Raamana
Repeated Holdout CV
Set aside an independent subsample (e.g. 30%) for testing
whole dataset
11
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole da...
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole da...
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole da...
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Set aside an independent subsample (e.g. 30%) for testing
whole da...
P. Raamana
Repeated Holdout CV
Train Test
trial
1
2
…
n
Note: there could be overlap among the test sets 

from different ...
P. Raamana
CV has many variations!
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
12
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
CV has many variations!
•k-fold, k = 2, 3, 5, 10, 20
•repeated hold-out (random
subsampling)
•train % = 50, 63....
P. Raamana
Measuring bias
in CV measurements
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
cross-validation
accuracy!
Training set Test set
Inner-CV
Whole dataset
13
P. Raamana
Measuring bias
in CV measurements
Validation set
cross-validation
accuracy!
Training set Test set
Inner-CV
Whol...
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy!
Training set T...
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
Training set...
P. Raamana
Measuring bias
in CV measurements
Validation set
validation
accuracy!
cross-validation
accuracy! ≈
positive bia...
P. Raamana
fMRI datasets
14
Dataset Intra- or inter? # samples
# blocks 

(sessions or subjects)
Tasks
Haxby Intra 209 12 ...
P. Raamana
Repeated holdout (10 trials, 20% test)
15
P. Raamana
Repeated holdout (10 trials, 20% test)Classifieraccuracy

viacross-validation
15
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
Repeated holdout (10 trials, 20% test)
Classifier accuracy on validation set
Classifieraccuracy

viacross-validat...
P. Raamana
CV vs. Validation: real data
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
CV vs. Validation: real data
negative bias unbiased positive bias
16
conservative
P. Raamana
Simulations:
known ground truth
17
P. Raamana
Simulations:
known ground truth
17
P. Raamana
Simulations:
known ground truth
17
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
CV vs. Validation
negative bias unbiased positive bias
18
P. Raamana
Commensurability across folds
19
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Commensurability across folds
• It’s not enough to properly split each
fold, and accurately evaluate
classifier ...
P. Raamana
Performance Metrics
20
Metric
Commensurate
across folds?
Advantages Disadvantages
Accuracy /
Error rate
Yes
Uni...
P. Raamana
Subtle Sources of Bias in CV
21
Type* Approach
sexy
name I
made up
How to avoid it?
k-hacking
Try many k’s in k...
P. Raamana
Overfitting
22
P. Raamana
Overfitting
22
P. Raamana
Overfitting
22
Underfit
P. Raamana
Overfitting
22
Overfit
Underfit
P. Raamana
Overfitting
22
Good fit
Overfit
Underfit
P. Raamana
50 shades of overfitting
23
Decade
Population
© mathworks
human 

annihilation?
P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
P...
P. Raamana
50 shades of overfitting
24
Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The
P...
P. Raamana
“Clever forms of overfitting”
25from http://hunch.net/?p=22
P. Raamana
Is overfitting always bad?
26
some credit card 

fraud detection systems 

use it successfully
Others?
P. Raamana
Limitations of CV
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
28
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of re...
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of re...
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of re...
P. Raamana
Limitations of CV
• Number of CV repetitions increases with
• sample size:
• large sample —> large number of re...
P. Raamana
Recommendations
29
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in co...
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in co...
P. Raamana
Recommendations
• Ensure the test set is truly independent of the training set!
• easy to commit mistakes in co...
P. Raamana
Conclusions
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
30
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)...
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)...
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)...
P. Raamana
Conclusions
• Results could vary considerably

with a different CV scheme
• CV results can have variance (>10%)...
P. Raamana
References
• Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo,
A., Schwartz, Y., & Thirion, B. (201...
P. Raamana
Acknowledgements
32
Gael Varoquax
Now, it’s time to cross-validate!
xkcd
Useful tools
Software/
toolbox
Target
audience
Lang-
uage
Number 

of ML
techniques
Neuro-
imaging
oriented?
Coding
requir...
P. Raamana
Model selection
Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Spring...
P. Raamana
Datasets
• 7 fMRI datasets
• intra-subject
• inter-subject
• OASIS
• gender
discrimination
from VBM
maps
• Simu...
Cross-validation Tutorial: What, how and which?
Upcoming SlideShare
Loading in …5
×

7

Share

Download to read offline

Cross-validation Tutorial: What, how and which?

Download to read offline

Cross validation Primer: What, how and which?

Tutorial, workshop, sources of bias, overfitting.

Cross-validation Tutorial: What, how and which?

  1. 1. Cross-validation: what, how and which? Pradeep Reddy Raamana raamana.com “Statistics [from cross-validation] are like bikinis. what they reveal is suggestive, but what they conceal is vital.”
  2. 2. P. Raamana Goals for Today 2
  3. 3. P. Raamana Goals for Today • What is cross-validation? Training set Test set 2
  4. 4. P. Raamana Goals for Today • What is cross-validation? • How to perform it? Training set Test set ≈ℵ≈ 2
  5. 5. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ 2
  6. 6. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ negative bias unbiased positive bias 2
  7. 7. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  8. 8. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  9. 9. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  10. 10. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  11. 11. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3*has a statistical definition
  12. 12. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3 avoid 
 overfitting *has a statistical definition
  13. 13. P. Raamana CV helps quantify generalizability 4
  14. 14. P. Raamana CV helps quantify generalizability 4
  15. 15. P. Raamana CV helps quantify generalizability 4
  16. 16. P. Raamana CV helps quantify generalizability 4
  17. 17. P. Raamana Why cross-validate? Training set Test set 5
  18. 18. P. Raamana Why cross-validate? Training set Test set bigger training set better learning 5
  19. 19. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set 5
  20. 20. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. 5
  21. 21. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. 5
  22. 22. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  23. 23. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  24. 24. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! cross-validate to maximize both 5
  25. 25. P. Raamana Use cases 6
  26. 26. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” 6
  27. 27. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: 6
  28. 28. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) 6
  29. 29. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) 6
  30. 30. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) • to compare performance 
 (model comparison). 6
  31. 31. P. Raamana Key Aspects of CV 7
  32. 32. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test 7
  33. 33. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. 7
  34. 34. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7
  35. 35. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7 healt hy dise ase
  36. 36. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  37. 37. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  38. 38. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) 2. How often you repeat randomized splits? •to expose classifier to full variability •As many as times as you can e.g. 100 ≈ℵ≈ time (columns) samples (rows) 7 healt hy dise ase
  39. 39. P. Raamana Validation set Training set 8*biased towards X —> overfit to X
  40. 40. P. Raamana Validation set goodness of fit of the model Training set 8*biased towards X —> overfit to X
  41. 41. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set 8*biased towards X —> overfit to X
  42. 42. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set 8*biased towards X —> overfit to X
  43. 43. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  44. 44. P. Raamana Validation set optimize parameters goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  45. 45. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  46. 46. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  47. 47. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  48. 48. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  49. 49. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop 8*biased towards X —> overfit to X
  50. 50. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop outer-loop 8*biased towards X —> overfit to X
  51. 51. P. Raamana Terminology 9
  52. 52. P. Raamana Terminology 9 Data split Training Testing Validation
  53. 53. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance
  54. 54. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters
  55. 55. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters Alternative names Training 
 (no confusion) Validation 
 (or tweaking, tuning, optimization set) Test set (more accurately reporting set)
  56. 56. P. Raamana K-fold CV 10
  57. 57. P. Raamana K-fold CV 10
  58. 58. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  59. 59. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  60. 60. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  61. 61. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  62. 62. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k 10
  63. 63. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  64. 64. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  65. 65. P. Raamana Repeated Holdout CV Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  66. 66. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  67. 67. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  68. 68. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  69. 69. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  70. 70. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Note: there could be overlap among the test sets 
 from different trials! Hence large n is recommended. Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  71. 71. P. Raamana CV has many variations! 12
  72. 72. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 12
  73. 73. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) 12
  74. 74. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 12
  75. 75. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified 12
  76. 76. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12
  77. 77. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc
  78. 78. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN)
  79. 79. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  80. 80. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  81. 81. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  82. 82. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  83. 83. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing
  84. 84. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out:
  85. 85. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out: • unit —> sample / pair / tuple / condition / task / block out
  86. 86. P. Raamana Measuring bias in CV measurements Whole dataset 13
  87. 87. P. Raamana Measuring bias in CV measurements Training set Test set Inner-CV Whole dataset 13
  88. 88. P. Raamana Measuring bias in CV measurements cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  89. 89. P. Raamana Measuring bias in CV measurements Validation set cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  90. 90. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  91. 91. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ Training set Test set Inner-CV Whole dataset 13
  92. 92. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ positive bias unbiased negative bias Training set Test set Inner-CV Whole dataset 13
  93. 93. P. Raamana fMRI datasets 14 Dataset Intra- or inter? # samples # blocks 
 (sessions or subjects) Tasks Haxby Intra 209 12 seconds various Duncan Inter 196 49 subjects various Wager Inter 390 34 subjects various Cohen Inter 80 24 subjects various Moran Inter 138 36 subjects various Henson Inter 286 16 subjects various Knops Inter 14 19 subjects various Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
  94. 94. P. Raamana Repeated holdout (10 trials, 20% test) 15
  95. 95. P. Raamana Repeated holdout (10 trials, 20% test)Classifieraccuracy
 viacross-validation 15
  96. 96. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  97. 97. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  98. 98. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  99. 99. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! 15
  100. 100. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased 15
  101. 101. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  102. 102. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  103. 103. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  104. 104. P. Raamana CV vs. Validation: real data 16 conservative
  105. 105. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  106. 106. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  107. 107. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  108. 108. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  109. 109. P. Raamana Simulations: known ground truth 17
  110. 110. P. Raamana Simulations: known ground truth 17
  111. 111. P. Raamana Simulations: known ground truth 17
  112. 112. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  113. 113. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  114. 114. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  115. 115. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  116. 116. P. Raamana Commensurability across folds 19
  117. 117. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19
  118. 118. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test
  119. 119. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1
  120. 120. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1 AUC2 AUC3 AUCn
  121. 121. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn
  122. 122. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2
  123. 123. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2 L1
  124. 124. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn L2 x1 x2 L1
  125. 125. P. Raamana Performance Metrics 20 Metric Commensurate across folds? Advantages Disadvantages Accuracy / Error rate Yes Universally applicable; 
 Multi-class; Sensitive to 
 class- and 
 cost-imbalance 
 Area under ROC (AUC) Only when ROC is computed within fold Averages over all ratios of misclassification costs Not easily extendable to multi-class problems F1 score Yes Information 
 retrieval Does not take true negatives into account
  126. 126. P. Raamana Subtle Sources of Bias in CV 21 Type* Approach sexy name I made up How to avoid it? k-hacking Try many k’s in k-fold CV 
 (or different training %) 
 and report only the best k-hacking Pick k=10, repeat it many times 
 (n>200 or as many as possible) and 
 report the full distribution (not box plots) metric- hacking Try different performance metrics (accuracy, AUC, F1, error rate), and report the best m-hacking Choose the most appropriate and recognized metric for the problem e.g. AUC for binary classification etc ROI- hacking Assess many ROIs (or their features, or combinations), but report only the best r-hacking Adopt a whole-brain data-driven approach to discover best ROIs within an inner CV, then report their out-of-sample predictive accuracy feature- or dataset- hacking Try subsets of feature[s] or subsamples of dataset[s], but report only the best d-hacking Use and report on everything: all analyses on all datasets, try inter-dataset CV, run non-parametric statistical comparisons! *exact incidence of these hacking approaches is unknown, but non-zero.
  127. 127. P. Raamana Overfitting 22
  128. 128. P. Raamana Overfitting 22
  129. 129. P. Raamana Overfitting 22 Underfit
  130. 130. P. Raamana Overfitting 22 Overfit Underfit
  131. 131. P. Raamana Overfitting 22 Good fit Overfit Underfit
  132. 132. P. Raamana 50 shades of overfitting 23 Decade Population © mathworks human 
 annihilation?
  133. 133. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  134. 134. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  135. 135. P. Raamana “Clever forms of overfitting” 25from http://hunch.net/?p=22
  136. 136. P. Raamana Is overfitting always bad? 26 some credit card 
 fraud detection systems 
 use it successfully Others?
  137. 137. P. Raamana Limitations of CV 28
  138. 138. P. Raamana Limitations of CV • Number of CV repetitions increases with 28
  139. 139. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: 28
  140. 140. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions 28
  141. 141. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. 28
  142. 142. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially 28
  143. 143. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially • to choose the best combination! 28
  144. 144. P. Raamana Recommendations 29
  145. 145. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! 29
  146. 146. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets 29
  147. 147. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets • Use biggest test set, and large # repetitions when possible • Not possible with leave-one-sample-out. 29
  148. 148. P. Raamana Conclusions 30
  149. 149. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme 30
  150. 150. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) 30
  151. 151. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates 30
  152. 152. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30
  153. 153. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30 • Bad examples: • just mean: 𝜇% • std. dev.: 𝜇±𝜎% • Good examples: • Using 250 iterations of 10-fold cross-validation, we obtain the following distribution of AUC.
  154. 154. P. Raamana References • Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038 • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. • Forman, G. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter. 31 neuro predict github.com/raamana/neuropredict
  155. 155. P. Raamana Acknowledgements 32 Gael Varoquax
  156. 156. Now, it’s time to cross-validate! xkcd
  157. 157. Useful tools Software/ toolbox Target audience Lang- uage Number 
 of ML techniques Neuro- imaging oriented? Coding required? Effort needed Use case scikit-learn Generic ML Python Many No Yes High To try many ML techniques neuro- predict Neuroimagers Python 1 (more soon) Yes No Easy Quick evaluation of predictive performance! nilearn Neuroimagers Python Few Yes Yes Medium Some image processing is required PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python Weka Generic ML Java Many No Yes High GUI to try many techniques Shogun Generic ML C++ Many No Yes High Efficient
  158. 158. P. Raamana Model selection Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics. 44
  159. 159. P. Raamana Datasets • 7 fMRI datasets • intra-subject • inter-subject • OASIS • gender discrimination from VBM maps • Simulations Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 47
  • myLoveRebecca

    Jun. 23, 2021
  • jainnnitk

    May. 27, 2020
  • ehealthgr

    Nov. 26, 2019
  • ManjinderMannySingh

    Jan. 25, 2019
  • alisoltanshahi1

    Apr. 8, 2018
  • Llyss

    Jul. 18, 2017
  • iraqcst

    Jun. 18, 2017

Cross validation Primer: What, how and which? Tutorial, workshop, sources of bias, overfitting.

Views

Total views

4,446

On Slideshare

0

From embeds

0

Number of embeds

491

Actions

Downloads

148

Shares

0

Comments

0

Likes

7

×