Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cross-validation to assess decoder performance: the good, the bad, and the ugly

1,582 views

Published on

Decoding, MVPA, predictive models for neuroimaging diagnosis or prognosis all rely on cross-validation to measure the predictive accuracy of the model, and optionally to tune the decoder. Cross-validation relies on testing predictive power on left-out data unseen by the training of the predictive model. It is appealing as it is non-parametric and asymptotically unbiased. Common practice in neuroimaging relies on leave-one-out, yet statistical theory [1] suggests that this is suboptimal as the small test set leads to large variance and is easily biased by sample correlations.

Decoders usually come with a hyper-parameter that controls the regularization, ie a bias/variance tradeoff. In machine learning, this tradeoff is typically adjusted to the signal-to-noise ratio of the data using cross-validation to maximize predictive accuracy. In this case, the accuracy of the decoding must be done in an independent "validation set", using a "nested cross-validation.

Here we assess empirically these practices on neuroimaging data, to yield guidelines.


# Methods

Given 8 open datasets from openfMRI [2], we assess cross-validation on 35 decoding tasks, 15 of which within subject. We leave out a large validation untouched and perform nested cross-validation in the rest of the data. In a first experiment we compared the accuracy of the decoder as measured by cross-validation with that measured on the left-out data. In a second experiment, we use the nested cross-validation to tune the decoders either by refitting on with the best parameter or by averaging the best models. We used standard linear decoders: SVM and logistic regression, both sparse (l1 penalty) and non-sparse (l2 penatly).

We assess a variety of cross-validation strategy: leaving out single samples, leaving out full sessions or subjects, and repeated random splits leaving out 20% of the sessions of subjects.

# Conclusions

The first finding is a confirmation of the theory that repeated random splits should be preferred to leave-one-sample-out: they are less fragile, and less computationally costly.

Second, we find large error bars on cross-validation estimates of predictive power, 10% or more, particularly on within-subject analysis, likely because of marked sample inhomogeneities.

Finally, we find that setting decoder parameters by nested cross-validation does not lead to much prediction gain, in particular in the case of non-sparse models. This is probably a consequence of our second finding.

These conclusions are crucial for decoding and information mapping, that rely on the measure of the prediction accuracy. This measure is more fragile than practitioners often assume.

Published in: Technology
  • Be the first to comment

Cross-validation to assess decoder performance: the good, the bad, and the ugly

  1. 1. Cross-validation to assess decoder performance: the good, the bad, and the ugly Gaël Varoquaux https://hal.archives-ouvertes.fr/hal-01332785
  2. 2. Measuring prediction accuracy To find the best method (computer scientists) For information mapping = omnibus test (cognitive neuroimaging) Cross-validation asymptotically unbiased non parametric G Varoquaux 2
  3. 3. 1 Some theory 2 Empirical results on brain imaging G Varoquaux 3
  4. 4. 1 Some theory Test setTrain set Full data G Varoquaux 4
  5. 5. 1 Cross-validation Test on independent data Train set Validation set G Varoquaux 5
  6. 6. 1 Cross-validation Test on independent data Train set Validation set Loop Test setTrain set Full data Measures prediction accuracy G Varoquaux 5
  7. 7. 1 Choice of cross-validation strategy Test on independent data Be robust to confounding dependences Leave subjects out, or sessions out Loop More loop = more data points Need to balance error in training model / error on test G Varoquaux 6
  8. 8. 1 Choice of cross-validation strategy: theory Negative bias (underestimate performance) decreasing with the size of the training set [Arlot... 2010] sec.5.1 Variance decreases with the size of the test set [Arlot... 2010] sec.5.2 Fraction of data left out: 10–20% Many random splits of the data respecting dependency structure G Varoquaux 7
  9. 9. 1 Tuning hyper-parameters Computer scientist says: You need to set C in your SVM G Varoquaux 8
  10. 10. 1 Tuning hyper-parameters Computer scientist says: You need to set C in your SVM 10-4 10-3 10-2 10-1 100 101 102 103 104 Parameter tuning: C Training set Validation set G Varoquaux 8
  11. 11. 1 Nested cross-validation Test on independent data Train set Validation set Two loops Validation set Full data Test setTrain set Nested loop Outer loop G Varoquaux 9
  12. 12. 2 Empirical results on brain imaging Validation set Full data Test setTrain set Nested loop Outer loop G Varoquaux 10
  13. 13. 2 Datasets and tasks 7 fMRI datasets (6 from openfMRI) Haxby: 5 subjects, 15 inter-subject predictions Inter-subject predictions on 6 studies OASIS VBM, gender discrimination HCP MEG task, intra-subject, working memory # samples: ∼ 200 (min 80, max 400) accuracy min 62%, max 96% G Varoquaux 11
  14. 14. 2 Experiment 1: measuring cross-validation error Leave out a large validation set Measure error by cross-validation on the rest Compare Validation set Full data Test setTrain set Nested loop Outer loop G Varoquaux 12
  15. 15. 2 Cross-validated measure versus validation set 50.0% 60.0% 70.0% 80.0% 90.0% 100.0% Accuracy on validation set 50.0% 60.0% 70.0% 80.0% 90.0% 100.0% Accuracy  measured by cross­validation Intra subject Inter subject G Varoquaux 13
  16. 16. 2 Different cross-validation strategies Cross-validation Difference in accuracy measured strategy by cross-validation and on validation set 40% 20% 10% 0% +10% +20% +40% Leave one sample out 22% +19% +3% +43% Intra subject Inter subject G Varoquaux 14
  17. 17. 2 Different cross-validation strategies Cross-validation Difference in accuracy measured strategy by cross-validation and on validation set 40% 20% 10% 0% +10% +20% +40% Leave one sample out Leave one subject/session 22% +19% +3% +43% 10% +10% 21% +17% Intra subject Inter subject G Varoquaux 14
  18. 18. 2 Different cross-validation strategies Cross-validation Difference in accuracy measured strategy by cross-validation and on validation set 40% 20% 10% 0% +10% +20% +40% Leave one sample out Leave one subject/session 20% left out, 3 splits 22% +19% +3% +43% 10% +10% 21% +17% 11% +11% 24% +16% Intra subject Inter subject G Varoquaux 14
  19. 19. 2 Different cross-validation strategies Cross-validation Difference in accuracy measured strategy by cross-validation and on validation set 40% 20% 10% 0% +10% +20% +40% Leave one sample out Leave one subject/session 20% left out, 3 splits 20% left out, 10 splits 20% left out, 50 splits 22% +19% +3% +43% 10% +10% 21% +17% 11% +11% 24% +16% 9% +9% 24% +14% 9% +8% 23% +13% Intra subject Inter subject G Varoquaux 14
  20. 20. 2 Simple simulations X1 X2 time X1 2 Gaussian-separated clouds Auto-correlated noise 200 decoding samples 10 000 validation samples ⇒ Validation = assymptotics G Varoquaux 15
  21. 21. 2 Simple simulations X1 X2 time X1 X1 X2 time X1 G Varoquaux 15
  22. 22. 2 Different cross-validation strategies Cross-validation Difference in accuracy measured strategy by cross-validation and on validation set ­40% ­20% ­10%  0% +10% +20% +40% Leave one sample out Leave one block out 20% left­out,   3 splits 20% left­out,   10 splits 20% left­out,   50 splits ­16% +14% +4% +33% ­15% +13% ­8% +8% ­15% +12% ­10% +11% ­13% +10% ­8% +8% ­12% +10% ­7% +7% MEG data Simulations G Varoquaux 16
  23. 23. 2 Experiment 2: parameter-tuning Compare different strategies on validation set: 1. Use the default C = 1 2. Use C = 1000 3. Choose best C by cross-validation and refit 3. Average best models in cross-validation Validation set Full data Test setTrain set Nested loop Outer loop G Varoquaux 17
  24. 24. 2 Experiment 2: parameter-tuning Compare different strategies on validation set: 1. Use the default C = 1 2. Use C = 1000 3. Choose best C by cross-validation and refit 3. Average best models in cross-validation Validation set Full data Test setTrain set Nested loop Outer loop Non-sparse decoders SVM 2 Log-reg 2 Sparse decoders SVM 1 Log-reg 1 G Varoquaux 17
  25. 25. 2 Cross-validation for tuning? CV +  averaging CV +  refitting C=1 C=1000 ­8% ­4% ­2% 0% +2% +4% +8% Impact on prediction accuracy SVM log­reg ⇓ CV +  averaging CV +  refitting C=1 C=1000 ­8% ­4% ­2% 0% +2% +4% +8% Impact on prediction accuracy SVM log­reg ⇑ Non-sparse models Sparse models G Varoquaux 18
  26. 26. @GaelVaroquaux Cross-validation: lessons learned Don’t use Leave One Out Random 10-20% splits respecting sample structure
  27. 27. @GaelVaroquaux Cross-validation: lessons learned Don’t use Leave One Out Random 10-20% splits respecting sample structure Cross-validation has error bars of ±10%
  28. 28. @GaelVaroquaux Cross-validation: lessons learned Don’t use Leave One Out Random 10-20% splits respecting sample structure Cross-validation has error bars of ±10% Cross-validation is inefficient for parameter tuning - C = 1 for SVM- 2 - model averaging for SVM- 1
  29. 29. @GaelVaroquaux Cross-validation: lessons learned Don’t use Leave One Out Random 10-20% splits respecting sample structure Cross-validation has error bars of ±10% Cross-validation is inefficient for parameter tuning - C = 1 for SVM- 2 - model averaging for SVM- 1 https://hal.archives-ouvertes.fr/hal-01332785 ni
  30. 30. References I S. Arlot, A. Celisse, ... A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79, 2010.

×