Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cross-validation Tutorial: What, how and which?

2,029 views

Published on

Cross validation Primer: What, how and which?

Tutorial, workshop, sources of bias, overfitting.

Published in: Data & Analytics
  • Be the first to comment

Cross-validation Tutorial: What, how and which?

  1. 1. Cross-validation: what, how and which? Pradeep Reddy Raamana raamana.com “Statistics [from cross-validation] are like bikinis. what they reveal is suggestive, but what they conceal is vital.”
  2. 2. P. Raamana Goals for Today 2
  3. 3. P. Raamana Goals for Today • What is cross-validation? Training set Test set 2
  4. 4. P. Raamana Goals for Today • What is cross-validation? • How to perform it? Training set Test set ≈ℵ≈ 2
  5. 5. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ 2
  6. 6. P. Raamana Goals for Today • What is cross-validation? • How to perform it? • What are the effects of different CV choices? Training set Test set ≈ℵ≈ negative bias unbiased positive bias 2
  7. 7. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  8. 8. P. Raamana What is generalizability? available data (sample*) 3*has a statistical definition
  9. 9. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  10. 10. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) 3*has a statistical definition
  11. 11. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3*has a statistical definition
  12. 12. P. Raamana What is generalizability? available data (sample*) desired: accuracy on 
 unseen data (population*) out-of-sample predictions 3 avoid 
 overfitting *has a statistical definition
  13. 13. P. Raamana CV helps quantify generalizability 4
  14. 14. P. Raamana CV helps quantify generalizability 4
  15. 15. P. Raamana CV helps quantify generalizability 4
  16. 16. P. Raamana CV helps quantify generalizability 4
  17. 17. P. Raamana Why cross-validate? Training set Test set 5
  18. 18. P. Raamana Why cross-validate? Training set Test set bigger training set better learning 5
  19. 19. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set 5
  20. 20. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. 5
  21. 21. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. 5
  22. 22. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  23. 23. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! 5
  24. 24. P. Raamana Why cross-validate? Training set Test set bigger training set better learning better testing bigger test set Key: Train & test sets must be disjoint. And the dataset or sample size is fixed. They grow at the expense of each other! cross-validate to maximize both 5
  25. 25. P. Raamana Use cases 6
  26. 26. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” 6
  27. 27. P. Raamana Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: 6
  28. 28. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) 6
  29. 29. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) 6
  30. 30. P. Raamana accuracydistribution
 fromrepetitionofCV(%) Use cases • “When setting aside data for parameter estimation and validation of results can not be afforded, cross-validation (CV) is typically used” • Use cases: • to estimate generalizability 
 (test accuracy) • to pick optimal parameters 
 (model selection) • to compare performance 
 (model comparison). 6
  31. 31. P. Raamana Key Aspects of CV 7
  32. 32. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test 7
  33. 33. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. 7
  34. 34. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7
  35. 35. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) samples (rows) 7 healt hy dise ase
  36. 36. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  37. 37. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) time (columns) samples (rows) 7 healt hy dise ase
  38. 38. P. Raamana Key Aspects of CV 1. How you split the dataset into train/test •maximal independence between 
 training and test sets is desired. •This split could be • over samples (e.g. indiv. diagnosis) • over time (for task prediction in fMRI) 2. How often you repeat randomized splits? •to expose classifier to full variability •As many as times as you can e.g. 100 ≈ℵ≈ time (columns) samples (rows) 7 healt hy dise ase
  39. 39. P. Raamana Validation set Training set 8*biased towards X —> overfit to X
  40. 40. P. Raamana Validation set goodness of fit of the model Training set 8*biased towards X —> overfit to X
  41. 41. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set 8*biased towards X —> overfit to X
  42. 42. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set 8*biased towards X —> overfit to X
  43. 43. P. Raamana Validation set goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  44. 44. P. Raamana Validation set optimize parameters goodness of fit of the model biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  45. 45. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set ≈ℵ≈ 8*biased towards X —> overfit to X
  46. 46. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  47. 47. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  48. 48. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ 8*biased towards X —> overfit to X
  49. 49. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop 8*biased towards X —> overfit to X
  50. 50. P. Raamana Validation set optimize parameters goodness of fit of the model biased towards the test set biased* towards the training set evaluate generalization independent of training or test sets Whole dataset Training set Test set Validation set ≈ℵ≈ inner-loop outer-loop 8*biased towards X —> overfit to X
  51. 51. P. Raamana Terminology 9
  52. 52. P. Raamana Terminology 9 Data split Training Testing Validation
  53. 53. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance
  54. 54. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters
  55. 55. P. Raamana Terminology 9 Data split Training Testing Validation Purpose (Do’s) Train model to learn its core parameters Optimize
 hyperparameters Evaluate fully-optimized classifier to report performance Don’ts (Invalid use) Don’t report training error as the test error! Don’t do feature selection or anything supervised on test set to learn or optimize! Don’t use it in any way to train classifier or optimize parameters Alternative names Training 
 (no confusion) Validation 
 (or tweaking, tuning, optimization set) Test set (more accurately reporting set)
  56. 56. P. Raamana K-fold CV 10
  57. 57. P. Raamana K-fold CV 10
  58. 58. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  59. 59. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  60. 60. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  61. 61. P. Raamana K-fold CV Train Test, 4th fold trial 1 2 … k 10
  62. 62. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k 10
  63. 63. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  64. 64. P. Raamana K-fold CV Test sets in different trials are indeed mutually disjoint Train Test, 4th fold trial 1 2 … k Note: different folds won’t be contiguous. 10
  65. 65. P. Raamana Repeated Holdout CV Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  66. 66. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  67. 67. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  68. 68. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  69. 69. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  70. 70. P. Raamana Repeated Holdout CV Train Test trial 1 2 … n Note: there could be overlap among the test sets 
 from different trials! Hence large n is recommended. Set aside an independent subsample (e.g. 30%) for testing whole dataset 11
  71. 71. P. Raamana CV has many variations! 12
  72. 72. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 12
  73. 73. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) 12
  74. 74. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 12
  75. 75. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified 12
  76. 76. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12
  77. 77. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc
  78. 78. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN)
  79. 79. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  80. 80. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  81. 81. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  82. 82. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes
  83. 83. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing
  84. 84. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out:
  85. 85. P. Raamana CV has many variations! •k-fold, k = 2, 3, 5, 10, 20 •repeated hold-out (random subsampling) •train % = 50, 63.2, 75, 80, 90 •stratified • across train/test • across classes 12 Controls MCIc Training (MCIc)Training (CN) Test Set (CN) Tes •inverted: 
 very small training, large testing •leave one [unit] out: • unit —> sample / pair / tuple / condition / task / block out
  86. 86. P. Raamana Measuring bias in CV measurements Whole dataset 13
  87. 87. P. Raamana Measuring bias in CV measurements Training set Test set Inner-CV Whole dataset 13
  88. 88. P. Raamana Measuring bias in CV measurements cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  89. 89. P. Raamana Measuring bias in CV measurements Validation set cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  90. 90. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! Training set Test set Inner-CV Whole dataset 13
  91. 91. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ Training set Test set Inner-CV Whole dataset 13
  92. 92. P. Raamana Measuring bias in CV measurements Validation set validation accuracy! cross-validation accuracy! ≈ positive bias unbiased negative bias Training set Test set Inner-CV Whole dataset 13
  93. 93. P. Raamana fMRI datasets 14 Dataset Intra- or inter? # samples # blocks 
 (sessions or subjects) Tasks Haxby Intra 209 12 seconds various Duncan Inter 196 49 subjects various Wager Inter 390 34 subjects various Cohen Inter 80 24 subjects various Moran Inter 138 36 subjects various Henson Inter 286 16 subjects various Knops Inter 14 19 subjects various Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage.
  94. 94. P. Raamana Repeated holdout (10 trials, 20% test) 15
  95. 95. P. Raamana Repeated holdout (10 trials, 20% test)Classifieraccuracy
 viacross-validation 15
  96. 96. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  97. 97. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  98. 98. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation 15
  99. 99. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! 15
  100. 100. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased 15
  101. 101. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  102. 102. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  103. 103. P. Raamana Repeated holdout (10 trials, 20% test) Classifier accuracy on validation set Classifieraccuracy
 viacross-validation unbiased! negatively
 biased positively-
 biased 15
  104. 104. P. Raamana CV vs. Validation: real data 16 conservative
  105. 105. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  106. 106. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  107. 107. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  108. 108. P. Raamana CV vs. Validation: real data negative bias unbiased positive bias 16 conservative
  109. 109. P. Raamana Simulations: known ground truth 17
  110. 110. P. Raamana Simulations: known ground truth 17
  111. 111. P. Raamana Simulations: known ground truth 17
  112. 112. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  113. 113. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  114. 114. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  115. 115. P. Raamana CV vs. Validation negative bias unbiased positive bias 18
  116. 116. P. Raamana Commensurability across folds 19
  117. 117. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19
  118. 118. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test
  119. 119. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1
  120. 120. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! 19 Train Test AUC1 AUC2 AUC3 AUCn
  121. 121. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn
  122. 122. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2
  123. 123. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn x1 x2 L1
  124. 124. P. Raamana Commensurability across folds • It’s not enough to properly split each fold, and accurately evaluate classifier performance! • Not all measures across folds are commensurate! • e.g. decision scores from SVM (reference plane and zero are different!) • hence they can not be pooled across folds to construct an ROC! • Instead, make ROC per fold and compute AUC per fold, and then average AUC across folds! 19 Train Test AUC1 AUC2 AUC3 AUCn L2 x1 x2 L1
  125. 125. P. Raamana Performance Metrics 20 Metric Commensurate across folds? Advantages Disadvantages Accuracy / Error rate Yes Universally applicable; 
 Multi-class; Sensitive to 
 class- and 
 cost-imbalance 
 Area under ROC (AUC) Only when ROC is computed within fold Averages over all ratios of misclassification costs Not easily extendable to multi-class problems F1 score Yes Information 
 retrieval Does not take true negatives into account
  126. 126. P. Raamana Subtle Sources of Bias in CV 21 Type* Approach sexy name I made up How to avoid it? k-hacking Try many k’s in k-fold CV 
 (or different training %) 
 and report only the best k-hacking Pick k=10, repeat it many times 
 (n>200 or as many as possible) and 
 report the full distribution (not box plots) metric- hacking Try different performance metrics (accuracy, AUC, F1, error rate), and report the best m-hacking Choose the most appropriate and recognized metric for the problem e.g. AUC for binary classification etc ROI- hacking Assess many ROIs (or their features, or combinations), but report only the best r-hacking Adopt a whole-brain data-driven approach to discover best ROIs within an inner CV, then report their out-of-sample predictive accuracy feature- or dataset- hacking Try subsets of feature[s] or subsamples of dataset[s], but report only the best d-hacking Use and report on everything: all analyses on all datasets, try inter-dataset CV, run non-parametric statistical comparisons! *exact incidence of these hacking approaches is unknown, but non-zero.
  127. 127. P. Raamana Overfitting 22
  128. 128. P. Raamana Overfitting 22
  129. 129. P. Raamana Overfitting 22 Underfit
  130. 130. P. Raamana Overfitting 22 Overfit Underfit
  131. 131. P. Raamana Overfitting 22 Good fit Overfit Underfit
  132. 132. P. Raamana 50 shades of overfitting 23 Decade Population © mathworks human 
 annihilation?
  133. 133. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  134. 134. P. Raamana 50 shades of overfitting 24 Reference: David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science, 14 March, 343: 1203-1205.
  135. 135. P. Raamana “Clever forms of overfitting” 25from http://hunch.net/?p=22
  136. 136. P. Raamana Is overfitting always bad? 26 some credit card 
 fraud detection systems 
 use it successfully Others?
  137. 137. P. Raamana Limitations of CV 28
  138. 138. P. Raamana Limitations of CV • Number of CV repetitions increases with 28
  139. 139. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: 28
  140. 140. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions 28
  141. 141. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. 28
  142. 142. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially 28
  143. 143. P. Raamana Limitations of CV • Number of CV repetitions increases with • sample size: • large sample —> large number of repetitions • esp. if the model training is computationally expensive. • number of model parameters, exponentially • to choose the best combination! 28
  144. 144. P. Raamana Recommendations 29
  145. 145. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! 29
  146. 146. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets 29
  147. 147. P. Raamana Recommendations • Ensure the test set is truly independent of the training set! • easy to commit mistakes in complicated analyses! • Use repeated-holdout (10-50% for testing) • respecting sample/dependency structure • ensuring independence between train & test sets • Use biggest test set, and large # repetitions when possible • Not possible with leave-one-sample-out. 29
  148. 148. P. Raamana Conclusions 30
  149. 149. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme 30
  150. 150. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) 30
  151. 151. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates 30
  152. 152. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30
  153. 153. P. Raamana Conclusions • Results could vary considerably
 with a different CV scheme • CV results can have variance (>10%) • Document CV scheme in detail: • type of split • number of repetitions • Full distribution of estimates • Proper splitting is not enough,
 proper pooling is needed too. 30 • Bad examples: • just mean: 𝜇% • std. dev.: 𝜇±𝜎% • Good examples: • Using 250 iterations of 10-fold cross-validation, we obtain the following distribution of AUC.
  154. 154. P. Raamana References • Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. http://doi.org/10.1016/j.neuroimage.2016.10.038 • Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79. • Forman, G. (2010). Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. ACM SIGKDD Explorations Newsletter. 31 neuro predict github.com/raamana/neuropredict
  155. 155. P. Raamana Acknowledgements 32 Gael Varoquax
  156. 156. Now, it’s time to cross-validate! xkcd
  157. 157. Useful tools Software/ toolbox Target audience Lang- uage Number 
 of ML techniques Neuro- imaging oriented? Coding required? Effort needed Use case scikit-learn Generic ML Python Many No Yes High To try many ML techniques neuro- predict Neuroimagers Python 1 (more soon) Yes No Easy Quick evaluation of predictive performance! nilearn Neuroimagers Python Few Yes Yes Medium Some image processing is required PRoNTo Neuroimagers Matlab Few Yes Yes High Integration with matlab PyMVPA Neuroimagers Python Few Yes Yes High Integration with Python Weka Generic ML Java Many No Yes High GUI to try many techniques Shogun Generic ML C++ Many No Yes High Efficient
  158. 158. P. Raamana Model selection Friedman, J., Hastie, T., & Tibshirani, R. (2008). The elements of statistical learning. Springer, Berlin: Springer series in statistics. 44
  159. 159. P. Raamana Datasets • 7 fMRI datasets • intra-subject • inter-subject • OASIS • gender discrimination from VBM maps • Simulations Reference: Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y., & Thirion, B. (2016). Assessing and tuning brain decoders: cross-validation, caveats, and guidelines. NeuroImage. 47

×