Cross-Validation

2,423 views

Published on

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,423
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
89
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Cross-Validation

  1. 1. Cross-validation for detecting and preventing overfitting Andrew W. Moore Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in Professor giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint School of Computer Science originals are available. If you make use of a significant portion of these slides in your own lecture, please include this Carnegie Mellon University message, or the following link to the source repository of Andrew’s tutorials: www.cs.cmu.edu/~awm http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully awm@cs.cmu.edu received. 412-268-7599 Copyright © Andrew W. Moore Slide 1
  2. 2. A Regression Problem y = f(x) + noise Can we learn f from this data? y Let’s consider three methods… x Copyright © Andrew W. Moore Slide 2
  3. 3. Linear Regression y x Copyright © Andrew W. Moore Slide 3
  4. 4. Linear Regression Univariate Linear regression with a constant term: X= 3 y= 7 X Y Originally 3 7 1 3 discussed in the previous Andrew 1 3 : : Lecture: “Neural Nets” : : x1=(3).. y1=7.. Copyright © Andrew W. Moore Slide 4
  5. 5. Linear Regression Univariate Linear regression with a constant term: X= 3 y= 7 X Y 3 7 1 3 1 3 : : : : x1=(3).. y1=7.. Z= 1 y= 3 7 1 1 3 : : z1=(1,3).. y1=7.. zk=(1,xk) Copyright © Andrew W. Moore Slide 5
  6. 6. Linear Regression Univariate Linear regression with a constant term: X= 3 y= 7 X Y 3 7 1 3 1 3 : : : : x1=(3).. y1=7.. Z= 1 y= 3 7 1 1 3 β=(ZTZ)-1(ZTy) : : z1=(1,3).. y1=7.. yest = β0+ β1 x zk=(1,xk) Copyright © Andrew W. Moore Slide 6
  7. 7. Quadratic Regression y x Copyright © Andrew W. Moore Slide 7
  8. 8. Quadratic Regression Much more about X= 3 y= 7 X Y this in the future Andrew Lecture: 3 7 1 3 “Favorite Regression 1 3 : : Algorithms” : : x1=(3,2).. y1=7.. y= 1 3 9 Z= 7 1 1 1 3 β=(ZTZ)-1(ZTy) : : yest = β0+ β1 x+ β2 x2 z=(1 , x, x2,) Copyright © Andrew W. Moore Slide 8
  9. 9. Join-the-dots Also known as piecewise linear nonparametric regression if that makes you feel better y x Copyright © Andrew W. Moore Slide 9
  10. 10. Which is best? y y x x Why not choose the method with the best fit to the data? Copyright © Andrew W. Moore Slide 10
  11. 11. What do we really want? y y x x Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?” Copyright © Andrew W. Moore Slide 11
  12. 12. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y x Copyright © Andrew W. Moore Slide 12
  13. 13. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set (Linear regression example) Copyright © Andrew W. Moore Slide 13
  14. 14. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set 4. Estimate your future (Linear regression example) performance with the test Mean Squared Error = 2.4 set Copyright © Andrew W. Moore Slide 14
  15. 15. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set (Quadratic regression example) 4. Estimate your future performance with the test Mean Squared Error = 0.9 set Copyright © Andrew W. Moore Slide 15
  16. 16. The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y 3. Perform your regression on the training x set 4. Estimate your future (Join the dots example) performance with the test Mean Squared Error = 2.2 set Copyright © Andrew W. Moore Slide 16
  17. 17. The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •What’s the downside? Copyright © Andrew W. Moore Slide 17
  18. 18. The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: We say the •Wastes data: we get an estimate of the “test-set best method to apply to 30% less data estimator of performance •If we don’t have much data, our test-set has high variance” might just be lucky or unlucky Copyright © Andrew W. Moore Slide 18
  19. 19. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record y x Copyright © Andrew W. Moore Slide 19
  20. 20. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset y x Copyright © Andrew W. Moore Slide 20
  21. 21. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints x Copyright © Andrew W. Moore Slide 21
  22. 22. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints 4. Note your error (xk,yk) x Copyright © Andrew W. Moore Slide 22
  23. 23. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints 4. Note your error (xk,yk) When you’ve done all points, x report the mean error. Copyright © Andrew W. Moore Slide 23
  24. 24. LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily y y y remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y y y When you’ve done all points, report the mean x x x error. MSELOOCV = 2.12 y y y x x x Copyright © Andrew W. Moore Slide 24
  25. 25. LOOCV for Quadratic Regression For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily y y y remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y y y When you’ve done all points, report the mean x x x error. MSELOOCV =0.962 y y y x x x Copyright © Andrew W. Moore Slide 25
  26. 26. LOOCV for Join The Dots For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily y y y remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y y y When you’ve done all points, report the mean x x x error. MSELOOCV =3.33 y y y x x x Copyright © Andrew W. Moore Slide 26
  27. 27. Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance Leave- Expensive. Doesn’t one-out waste data Has some weird behavior ..can we get the best of both worlds? Copyright © Andrew W. Moore Slide 27
  28. 28. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) y x Copyright © Andrew W. Moore Slide 28
  29. 29. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. y x Copyright © Andrew W. Moore Slide 29
  30. 30. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. x Copyright © Andrew W. Moore Slide 30
  31. 31. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Copyright © Andrew W. Moore Slide 31
  32. 32. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Linear Regression MSE3FOLD=2.05 Then report the mean error Copyright © Andrew W. Moore Slide 32
  33. 33. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Quadratic Regression MSE3FOLD=1.11 Then report the mean error Copyright © Andrew W. Moore Slide 33
  34. 34. k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Joint-the-dots MSE3FOLD=2.93 Then report the mean error Copyright © Andrew W. Moore Slide 34
  35. 35. Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance Leave- Expensive. Doesn’t waste data one-out Has some weird behavior 10-fold Wastes 10% of the data. Only wastes 10%. Only 10 times more expensive 10 times more expensive than test set instead of R times. 3-fold Wastier than 10-fold. Slightly better than test- Expensivier than test set set R-fold Identical to Leave-one-out Copyright © Andrew W. Moore Slide 35
  36. 36. Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance But note: One of Leave- Expensive. Andrew’s joys in data Doesn’t waste life is one-out Has some weird behavioralgorithmic tricks for Wastes 10% of the data. makingwastescheap Only Only these 10%. 10-fold 10 times more expensive 10 times more expensive than testset instead of R times. 3-fold Wastier than 10-fold. Slightly better than test- Expensivier than testset set R-fold Identical to Leave-one-out Copyright © Andrew W. Moore Slide 36
  37. 37. CV-based Model Selection • We’re trying to decide which algorithm to use. • We train each machine and make a table… TRAINERR i fi 10-FOLD-CV-ERR Choice 1 f1 2 f2 ⌦ 3 f3 4 f4 5 f5 6 f6 Copyright © Andrew W. Moore Slide 37
  38. 38. CV-based Model Selection • Example: Choosing number of hidden units in a one- hidden-layer neural net. • Step 1: Compute 10-fold CV error for six different model classes: Algorithm TRAINERR 10-FOLD-CV-ERR Choice 0 hidden units 1 hidden units ⌦ 2 hidden units 3 hidden units 4 hidden units 5 hidden units • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 38
  39. 39. CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. • Step 1: Compute LOOCV error for six different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice K=1 K=2 K=3 ⌦ K=4 K=5 K=6 • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 39
  40. 40. CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. The reason is Computational. For k- • Step 1: Compute LOOCV error for six different model NN (and all other nonparametric methods) LOOCV happens to be as classes: Why did we use 10-fold-CV for cheap as regular predictions. neural nets and LOOCV for k- nearest neighbor? No good reason, except it looked like things were getting worse as K Algorithm TRAINERR LOOCV-ERR Choice was increasing And why stop at K=6 K=1 Are we guaranteed that a local Sadly, no. And in fact, the relationship can be very bumpy. K=2 optimum of K vs LOOCV will be the global optimum? K=3 What should we do if we are ⌦ K=4 depressed at the expense of Idea One: K=1, K=2, K=4, K=8, K=5 K=16, K=32, K=64 … K=1024 doing LOOCV for K= 1 through Idea Two: Hillclimbing from an initial 1000? K=6 guess at K • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 40
  41. 41. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? Copyright © Andrew W. Moore Slide 41
  42. 42. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do? Copyright © Andrew W. Moore Slide 42
  43. 43. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve Idea One: Consider a discrete set of values (often best to consider a set of values with choosing the value of a exponentially increasing gaps, as in the K-NN real-valued parameter. example). ∂ LOOCV What should we do? Idea Two: Compute and then ∂ Parameter do gradianet descent. Copyright © Andrew W. Moore Slide 43
  44. 44. CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression tors of a non- fac : The scale • The Bayesian Prior in Bayesian Regression distance metric Also parametric These involve Idea One: Consider a discrete set of values (often best to consider a set of values with choosing the value of a exponentially increasing gaps, as in the K-NN real-valued parameter. example). ∂ LOOCV What should we do? Idea Two: Compute and then ∂ Parameter do gradianet descent. Copyright © Andrew W. Moore Slide 44
  45. 45. CV-based Algorithm Choice • Example: Choosing which regression algorithm to use • Step 1: Compute 10-fold-CV error for six different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice 1-NN 10-NN Linear Reg’n ⌦ Quad reg’n LWR, KW=0.1 LWR, KW=0.5 • Step 2: Whichever algorithm gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Copyright © Andrew W. Moore Slide 45
  46. 46. Alternatives to CV-based model selection • Model selection methods: 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Only directly applicable to choosing classifiers Described in a future Lecture Copyright © Andrew W. Moore Slide 46
  47. 47. Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension • AIC, BIC and SRMVC advantage: you only need the training error. • CV error might have more variance • SRMVC is wildly conservative • Asymptotically AIC and Leave-one-out CV should be the same • Asymptotically BIC and carefully chosen k-fold should be same • You want BIC if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) • Many alternatives---including proper Bayesian approaches. • It’s an emotional issue. Copyright © Andrew W. Moore Slide 47
  48. 48. Other Cross-validation issues • Can do “leave all pairs out” or “leave-all- ntuples-out” if feeling resourceful. • Some folks do k-folds in which each fold is an independently-chosen subset of the data • Do you know what AIC and BIC are? If so… • LOOCV behaves like AIC asymptotically. • k-fold behaves like BIC if you choose k carefully If not… • Nyardely nyardely nyoo nyoo Copyright © Andrew W. Moore Slide 48
  49. 49. Cross-Validation for regression • Choosing the number of hidden units in a neural net • Feature selection (see later) • Choosing a polynomial degree • Choosing which regressor to use Copyright © Andrew W. Moore Slide 49
  50. 50. Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too many hidden units. It will overfit. • As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Training Set Use the weights you Mean Squared found on this iteration Test Set Error Iteration of Gradient Descent Copyright © Andrew W. Moore Slide 50
  51. 51. Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too Relies on an intuition that a not-fully- many hiddenminimized set of overfit. somewhat like units. It will weights is • As gradient descent progresses, maintain a having fewer parameters. graph of MSE-testset-error vs. Iteration Works pretty well in practice, apparently Training Set Use the weights you Mean Squared found on this iteration Test Set Error Iteration of Gradient Descent Copyright © Andrew W. Moore Slide 51
  52. 52. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… Copyright © Andrew W. Moore Slide 52
  53. 53. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. Copyright © Andrew W. Moore Slide 53
  54. 54. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • What’s LOOCV of 1-NN? • What’s LOOCV of 3-NN? • What’s LOOCV of 22-NN? Copyright © Andrew W. Moore Slide 54
  55. 55. Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • But there’s a more sensitive alternative: Compute log P(all test outputs|all test inputs, your model) Copyright © Andrew W. Moore Slide 55
  56. 56. Cross-Validation for classification • Choosing the pruning parameter for decision trees • Feature selection (see later) • What kind of Gaussian to use in a Gaussian- based Bayes Classifier • Choosing which classifier to use Copyright © Andrew W. Moore Slide 56
  57. 57. Cross-Validation for density estimation • Compute the sum of log-likelihoods of test points Example uses: • Choosing what kind of Gaussian assumption to use • Choose the density estimator • NOT Feature selection (testset density will almost always look better with fewer features) Copyright © Andrew W. Moore Slide 57
  58. 58. Feature Selection • Suppose you have a learning algorithm LA and a set of input attributes { X1 , X2 .. Xm } • You expect that LA will only find some subset of the attributes useful. • Question: How can we use cross-validation to find a useful subset? • Four ideas: Another fun area in which • Forward selection Andrew has spent a lot of his • Backward elimination wild youth • Hill Climbing • Stochastic search (Simulated Annealing or GAs) Copyright © Andrew W. Moore Slide 58
  59. 59. Very serious warning • Intensive use of cross validation can overfit. • How? • What can be done about it? Copyright © Andrew W. Moore Slide 59
  60. 60. Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • What can be done about it? Copyright © Andrew W. Moore Slide 60
  61. 61. Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • What can be done about it? Copyright © Andrew W. Moore Slide 61
  62. 62. Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • But you realize it would have looked good even if the output had been purely random! • What can be done about it? • Hold out an additional testset before doing any model selection. Check the best model performs well even on the additional testset. • Or: Randomization Testing Copyright © Andrew W. Moore Slide 62
  63. 63. What you should know • Why you can’t use “training-set-error” to estimate the quality of your learning algorithm on your data. • Why you can’t use “training set error” to choose the learning algorithm • Test-set cross-validation • Leave-one-out cross-validation • k-fold cross-validation • Feature selection methods • CV for classification, regression & densities Copyright © Andrew W. Moore Slide 63

×