Copyright © Andrew W. Moore Slide 1
Cross-validation for
detecting and preventing
overfitting
Andrew W. Moore
Professor
Sc...
Copyright © Andrew W. Moore Slide 2
A Regression Problem
x
y
y = f(x) + noise
Can we learn f from this data?
Let’s conside...
Copyright © Andrew W. Moore Slide 3
Linear Regression
x
y
Copyright © Andrew W. Moore Slide 4
Linear Regression
Univariate Linear regression with a constant term:
::
31
73
YX
:
1
3...
Copyright © Andrew W. Moore Slide 5
Linear Regression
Univariate Linear regression with a constant term:
::
31
73
YX
:
1
3...
Copyright © Andrew W. Moore Slide 6
Linear Regression
Univariate Linear regression with a constant term:
::
31
73
YX
:
1
3...
Copyright © Andrew W. Moore Slide 7
Quadratic Regression
x
y
Copyright © Andrew W. Moore Slide 8
Quadratic Regression
::
31
73
YX
:
1
3
:
3
7X= y=
x1=(3,2).. y1=7..
1
9
1
3
:
1
1
:
3
...
Copyright © Andrew W. Moore Slide 9
Join-the-dots
x
y
Also known as piecewise
linear nonparametric
regression if that make...
Copyright © Andrew W. Moore Slide 10
Which is best?
x
y
x
y
Why not choose the method with the
best fit to the data?
Copyright © Andrew W. Moore Slide 11
What do we really want?
x
y
x
y
Why not choose the method with the
best fit to the da...
Copyright © Andrew W. Moore Slide 12
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The...
Copyright © Andrew W. Moore Slide 13
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The...
Copyright © Andrew W. Moore Slide 14
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The...
Copyright © Andrew W. Moore Slide 15
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The...
Copyright © Andrew W. Moore Slide 16
The test set method
x
y
1. Randomly choose
30% of the data to be in a
test set
2. The...
Copyright © Andrew W. Moore Slide 17
The test set method
Good news:
•Very very simple
•Can then simply choose the method w...
Copyright © Andrew W. Moore Slide 18
The test set method
Good news:
•Very very simple
•Can then simply choose the method w...
Copyright © Andrew W. Moore Slide 19
LOOCV (Leave-one-out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the kth rec...
Copyright © Andrew W. Moore Slide 20
LOOCV (Leave-one-out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the kth rec...
Copyright © Andrew W. Moore Slide 21
LOOCV (Leave-one-out Cross Validation)
x
y
For k=1 to R
1. Let (xk,yk) be the kth rec...
Copyright © Andrew W. Moore Slide 22
LOOCV (Leave-one-out Cross Validation)
For k=1 to R
1. Let (xk,yk) be the kth record
...
Copyright © Andrew W. Moore Slide 23
LOOCV (Leave-one-out Cross Validation)
For k=1 to R
1. Let (xk,yk) be the kth record
...
Copyright © Andrew W. Moore Slide 24
LOOCV (Leave-one-out Cross Validation)
For k=1 to R
1. Let (xk,yk) be
the kth
record
...
Copyright © Andrew W. Moore Slide 25
LOOCV for Quadratic Regression
For k=1 to R
1. Let (xk,yk) be
the kth
record
2. Tempo...
Copyright © Andrew W. Moore Slide 26
LOOCV for Join The Dots
For k=1 to R
1. Let (xk,yk) be
the kth
record
2. Temporarily
...
Copyright © Andrew W. Moore Slide 27
Which kind of Cross Validation?
Doesn’t
waste data
Expensive.
Has some weird
behavior...
Copyright © Andrew W. Moore Slide 28
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 29
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 30
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 31
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 32
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 33
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 34
k-fold Cross
Validation
x
y
Randomly break the dataset into k
partitions (in our exam...
Copyright © Andrew W. Moore Slide 35
Which kind of Cross Validation?
Doesn’t waste dataExpensive.
Has some weird behavior
...
Copyright © Andrew W. Moore Slide 36
Which kind of Cross Validation?
Doesn’t waste dataExpensive.
Has some weird behavior
...
Copyright © Andrew W. Moore Slide 37
CV-based Model Selection
• We’re trying to decide which algorithm to use.
• We train ...
Copyright © Andrew W. Moore Slide 38
CV-based Model Selection
• Example: Choosing number of hidden units in a one-
hidden-...
Copyright © Andrew W. Moore Slide 39
CV-based Model Selection
• Example: Choosing “k” for a k-nearest-neighbor regression....
Copyright © Andrew W. Moore Slide 40
⌦K=4
K=5
K=6
K=3
K=2
K=1
ChoiceLOOCV-ERRTRAINERRAlgorithm
CV-based Model Selection
• ...
Copyright © Andrew W. Moore Slide 41
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validati...
Copyright © Andrew W. Moore Slide 42
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validati...
Copyright © Andrew W. Moore Slide 43
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validati...
Copyright © Andrew W. Moore Slide 44
CV-based Model Selection
• Can you think of other decisions we can ask Cross
Validati...
Copyright © Andrew W. Moore Slide 45
⌦Quad reg’n
LWR, KW=0.1
LWR, KW=0.5
Linear Reg’n
10-NN
1-NN
Choice10-fold-CV-ERRTRAIN...
Copyright © Andrew W. Moore Slide 46
• Model selection methods:
1. Cross-validation
2. AIC (Akaike Information Criterion)
...
Copyright © Andrew W. Moore Slide 47
Which model selection method is best?
1. (CV) Cross-validation
2. AIC (Akaike Informa...
Copyright © Andrew W. Moore Slide 48
Other Cross-validation issues
• Can do “leave all pairs out” or “leave-all-
ntuples-o...
Copyright © Andrew W. Moore Slide 49
Cross-Validation for regression
• Choosing the number of hidden units in a
neural net...
Copyright © Andrew W. Moore Slide 50
Supervising Gradient Descent
• This is a weird but common use of Test-set
validation
...
Copyright © Andrew W. Moore Slide 51
Supervising Gradient Descent
• This is a weird but common use of Test-set
validation
...
Copyright © Andrew W. Moore Slide 52
Cross-validation for classification
• Instead of computing the sum squared
errors on ...
Copyright © Andrew W. Moore Slide 53
Cross-validation for classification
• Instead of computing the sum squared
errors on ...
Copyright © Andrew W. Moore Slide 54
Cross-validation for classification
• Instead of computing the sum squared
errors on ...
Copyright © Andrew W. Moore Slide 55
Cross-validation for classification
• Instead of computing the sum squared
errors on ...
Copyright © Andrew W. Moore Slide 56
Cross-Validation for classification
• Choosing the pruning parameter for decision
tre...
Copyright © Andrew W. Moore Slide 57
Cross-Validation for density
estimation
• Compute the sum of log-likelihoods of test
...
Copyright © Andrew W. Moore Slide 58
Feature Selection
• Suppose you have a learning algorithm LA
and a set of input attri...
Copyright © Andrew W. Moore Slide 59
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• What c...
Copyright © Andrew W. Moore Slide 60
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• Imagin...
Copyright © Andrew W. Moore Slide 61
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• Imagin...
Copyright © Andrew W. Moore Slide 62
Very serious warning
• Intensive use of cross validation can overfit.
• How?
• Imagin...
Copyright © Andrew W. Moore Slide 63
What you should know
• Why you can’t use “training-set-error” to
estimate the quality...
Upcoming SlideShare
Loading in …5
×

Overfit10

570 views

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
570
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Overfit10

  1. 1. Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
  2. 2. Copyright © Andrew W. Moore Slide 2 A Regression Problem x y y = f(x) + noise Can we learn f from this data? Let’s consider three methods…
  3. 3. Copyright © Andrew W. Moore Slide 3 Linear Regression x y
  4. 4. Copyright © Andrew W. Moore Slide 4 Linear Regression Univariate Linear regression with a constant term: :: 31 73 YX : 1 3 : 3 7X= y= x1=(3).. y1=7.. Originally discussed in the previous Andrew Lecture: “Neural Nets”
  5. 5. Copyright © Andrew W. Moore Slide 5 Linear Regression Univariate Linear regression with a constant term: :: 31 73 YX : 1 3 : 3 7X= y= x1=(3).. y1=7.. 1 3 : 1 1 : 3 7 Z= y= z1=(1,3).. zk=(1,xk) y1=7..
  6. 6. Copyright © Andrew W. Moore Slide 6 Linear Regression Univariate Linear regression with a constant term: :: 31 73 YX : 1 3 : 3 7X= y= x1=(3).. y1=7.. 1 3 : 1 1 : 3 7 Z= y= z1=(1,3).. zk=(1,xk) y1=7.. β=(ZTZ)-1(ZTy) yest = β0+ β1 x
  7. 7. Copyright © Andrew W. Moore Slide 7 Quadratic Regression x y
  8. 8. Copyright © Andrew W. Moore Slide 8 Quadratic Regression :: 31 73 YX : 1 3 : 3 7X= y= x1=(3,2).. y1=7.. 1 9 1 3 : 1 1 : 3 7 Z= y= z=(1 , x, x2 ,) β=(ZTZ)-1(ZTy) yest = β0+ β1 x+ β2 x2 Much more about this in the future Andrew Lecture: “Favorite Regression Algorithms”
  9. 9. Copyright © Andrew W. Moore Slide 9 Join-the-dots x y Also known as piecewise linear nonparametric regression if that makes you feel better
  10. 10. Copyright © Andrew W. Moore Slide 10 Which is best? x y x y Why not choose the method with the best fit to the data?
  11. 11. Copyright © Andrew W. Moore Slide 11 What do we really want? x y x y Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?”
  12. 12. Copyright © Andrew W. Moore Slide 12 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set
  13. 13. Copyright © Andrew W. Moore Slide 13 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set (Linear regression example)
  14. 14. Copyright © Andrew W. Moore Slide 14 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set (Linear regression example) Mean Squared Error = 2.4
  15. 15. Copyright © Andrew W. Moore Slide 15 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set (Quadratic regression example) Mean Squared Error = 0.9
  16. 16. Copyright © Andrew W. Moore Slide 16 The test set method x y 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set (Join the dots example) Mean Squared Error = 2.2
  17. 17. Copyright © Andrew W. Moore Slide 17 The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •What’s the downside?
  18. 18. Copyright © Andrew W. Moore Slide 18 The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •Wastes data: we get an estimate of the best method to apply to 30% less data •If we don’t have much data, our test-set might just be lucky or unlucky We say the “test-set estimator of performance has high variance”
  19. 19. Copyright © Andrew W. Moore Slide 19 LOOCV (Leave-one-out Cross Validation) x y For k=1 to R 1. Let (xk,yk) be the kth record
  20. 20. Copyright © Andrew W. Moore Slide 20 LOOCV (Leave-one-out Cross Validation) x y For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset
  21. 21. Copyright © Andrew W. Moore Slide 21 LOOCV (Leave-one-out Cross Validation) x y For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints
  22. 22. Copyright © Andrew W. Moore Slide 22 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) x y
  23. 23. Copyright © Andrew W. Moore Slide 23 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y
  24. 24. Copyright © Andrew W. Moore Slide 24 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y x y x y x y x y x y x y x y x y MSELOOCV = 2.12
  25. 25. Copyright © Andrew W. Moore Slide 25 LOOCV for Quadratic Regression For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y x y x y x y x y x y x y x y x y MSELOOCV =0.962
  26. 26. Copyright © Andrew W. Moore Slide 26 LOOCV for Join The Dots For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y x y x y x y x y x y x y x y x y MSELOOCV =3.33
  27. 27. Copyright © Andrew W. Moore Slide 27 Which kind of Cross Validation? Doesn’t waste data Expensive. Has some weird behavior Leave- one-out CheapVariance: unreliable estimate of future performance Test-set UpsideDownside ..can we get the best of both worlds?
  28. 28. Copyright © Andrew W. Moore Slide 28 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue)
  29. 29. Copyright © Andrew W. Moore Slide 29 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points.
  30. 30. Copyright © Andrew W. Moore Slide 30 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points.
  31. 31. Copyright © Andrew W. Moore Slide 31 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points.
  32. 32. Copyright © Andrew W. Moore Slide 32 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Linear Regression MSE3FOLD=2.05
  33. 33. Copyright © Andrew W. Moore Slide 33 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Quadratic Regression MSE3FOLD=1.11
  34. 34. Copyright © Andrew W. Moore Slide 34 k-fold Cross Validation x y Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error Joint-the-dots MSE3FOLD=2.93
  35. 35. Copyright © Andrew W. Moore Slide 35 Which kind of Cross Validation? Doesn’t waste dataExpensive. Has some weird behavior Leave- one-out Only wastes 10%. Only 10 times more expensive instead of R times. Wastes 10% of the data. 10 times more expensive than test set 10-fold Slightly better than test- set Wastier than 10-fold. Expensivier than test set 3-fold Identical to Leave-one-outR-fold CheapVariance: unreliable estimate of future performance Test-set UpsideDownside
  36. 36. Copyright © Andrew W. Moore Slide 36 Which kind of Cross Validation? Doesn’t waste dataExpensive. Has some weird behavior Leave- one-out Only wastes 10%. Only 10 times more expensive instead of R times. Wastes 10% of the data. 10 times more expensive than testset 10-fold Slightly better than test- set Wastier than 10-fold. Expensivier than testset 3-fold Identical to Leave-one-outR-fold CheapVariance: unreliable estimate of future performance Test-set UpsideDownside But note: One of Andrew’s joys in life is algorithmic tricks for making these cheap
  37. 37. Copyright © Andrew W. Moore Slide 37 CV-based Model Selection • We’re trying to decide which algorithm to use. • We train each machine and make a table… f44 f55 f66 ⌦f33 f22 f11 Choice10-FOLD-CV-ERRTRAINERRfii
  38. 38. Copyright © Andrew W. Moore Slide 38 CV-based Model Selection • Example: Choosing number of hidden units in a one- hidden-layer neural net. • Step 1: Compute 10-fold CV error for six different model classes: 3 hidden units 4 hidden units 5 hidden units ⌦2 hidden units 1 hidden units 0 hidden units Choice10-FOLD-CV-ERRTRAINERRAlgorithm • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use.
  39. 39. Copyright © Andrew W. Moore Slide 39 CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. • Step 1: Compute LOOCV error for six different model classes: • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. ⌦K=4 K=5 K=6 K=3 K=2 K=1 Choice10-fold-CV-ERRTRAINERRAlgorithm
  40. 40. Copyright © Andrew W. Moore Slide 40 ⌦K=4 K=5 K=6 K=3 K=2 K=1 ChoiceLOOCV-ERRTRAINERRAlgorithm CV-based Model Selection • Example: Choosing “k” for a k-nearest-neighbor regression. • Step 1: Compute LOOCV error for six different model classes: • Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use. Why did we use 10-fold-CV for neural nets and LOOCV for k- nearest neighbor? And why stop at K=6 Are we guaranteed that a local optimum of K vs LOOCV will be the global optimum? What should we do if we are depressed at the expense of doing LOOCV for K= 1 through 1000? The reason is Computational. For k- NN (and all other nonparametric methods) LOOCV happens to be as cheap as regular predictions. No good reason, except it looked like things were getting worse as K was increasing Sadly, no. And in fact, the relationship can be very bumpy. Idea One: K=1, K=2, K=4, K=8, K=16, K=32, K=64 … K=1024 Idea Two: Hillclimbing from an initial guess at K
  41. 41. Copyright © Andrew W. Moore Slide 41 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far?
  42. 42. Copyright © Andrew W. Moore Slide 42 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do?
  43. 43. Copyright © Andrew W. Moore Slide 43 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with exponentially increasing gaps, as in the K-NN example). Idea Two: Compute and then do gradianet descent. Parameter LOOCV ∂ ∂
  44. 44. Copyright © Andrew W. Moore Slide 44 CV-based Model Selection • Can you think of other decisions we can ask Cross Validation to make for us, based on other machine learning algorithms in the class so far? • Degree of polynomial in polynomial regression • Whether to use full, diagonal or spherical Gaussians in a Gaussian Bayes Classifier. • The Kernel Width in Kernel Regression • The Kernel Width in Locally Weighted Regression • The Bayesian Prior in Bayesian Regression These involve choosing the value of a real-valued parameter. What should we do? Idea One: Consider a discrete set of values (often best to consider a set of values with exponentially increasing gaps, as in the K-NN example). Idea Two: Compute and then do gradianet descent. Parameter LOOCV ∂ ∂ Also: The scale factors of a non- parametric distance metric
  45. 45. Copyright © Andrew W. Moore Slide 45 ⌦Quad reg’n LWR, KW=0.1 LWR, KW=0.5 Linear Reg’n 10-NN 1-NN Choice10-fold-CV-ERRTRAINERRAlgorithm CV-based Algorithm Choice • Example: Choosing which regression algorithm to use • Step 1: Compute 10-fold-CV error for six different model classes: • Step 2: Whichever algorithm gave best CV score: train it with all the data, and that’s the predictive model you’ll use.
  46. 46. Copyright © Andrew W. Moore Slide 46 • Model selection methods: 1. Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. VC-dimension (Vapnik-Chervonenkis Dimension) Only directly applicable to choosing classifiers Described in a future Lecture Alternatives to CV-based model selection
  47. 47. Copyright © Andrew W. Moore Slide 47 Which model selection method is best? 1. (CV) Cross-validation 2. AIC (Akaike Information Criterion) 3. BIC (Bayesian Information Criterion) 4. (SRMVC) Structural Risk Minimize with VC-dimension • AIC, BIC and SRMVC advantage: you only need the training error. • CV error might have more variance • SRMVC is wildly conservative • Asymptotically AIC and Leave-one-out CV should be the same • Asymptotically BIC and carefully chosen k-fold should be same • You want BIC if you want the best structure instead of the best predictor (e.g. for clustering or Bayes Net structure finding) • Many alternatives---including proper Bayesian approaches. • It’s an emotional issue.
  48. 48. Copyright © Andrew W. Moore Slide 48 Other Cross-validation issues • Can do “leave all pairs out” or “leave-all- ntuples-out” if feeling resourceful. • Some folks do k-folds in which each fold is an independently-chosen subset of the data • Do you know what AIC and BIC are? If so… • LOOCV behaves like AIC asymptotically. • k-fold behaves like BIC if you choose k carefully If not… • Nyardely nyardely nyoo nyoo
  49. 49. Copyright © Andrew W. Moore Slide 49 Cross-Validation for regression • Choosing the number of hidden units in a neural net • Feature selection (see later) • Choosing a polynomial degree • Choosing which regressor to use
  50. 50. Copyright © Andrew W. Moore Slide 50 Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too many hidden units. It will overfit. • As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Iteration of Gradient Descent MeanSquared Error Training Set Test Set Use the weights you found on this iteration
  51. 51. Copyright © Andrew W. Moore Slide 51 Supervising Gradient Descent • This is a weird but common use of Test-set validation • Suppose you have a neural net with too many hidden units. It will overfit. • As gradient descent progresses, maintain a graph of MSE-testset-error vs. Iteration Iteration of Gradient Descent MeanSquared Error Training Set Test Set Use the weights you found on this iteration Relies on an intuition that a not-fully- minimized set of weights is somewhat like having fewer parameters. Works pretty well in practice, apparently
  52. 52. Copyright © Andrew W. Moore Slide 52 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute…
  53. 53. Copyright © Andrew W. Moore Slide 53 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset.
  54. 54. Copyright © Andrew W. Moore Slide 54 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • What’s LOOCV of 1-NN? • What’s LOOCV of 3-NN? • What’s LOOCV of 22-NN?
  55. 55. Copyright © Andrew W. Moore Slide 55 Cross-validation for classification • Instead of computing the sum squared errors on a test set, you should compute… The total number of misclassifications on a testset. • But there’s a more sensitive alternative: Compute log P(all test outputs|all test inputs, your model)
  56. 56. Copyright © Andrew W. Moore Slide 56 Cross-Validation for classification • Choosing the pruning parameter for decision trees • Feature selection (see later) • What kind of Gaussian to use in a Gaussian- based Bayes Classifier • Choosing which classifier to use
  57. 57. Copyright © Andrew W. Moore Slide 57 Cross-Validation for density estimation • Compute the sum of log-likelihoods of test points Example uses: • Choosing what kind of Gaussian assumption to use • Choose the density estimator • NOT Feature selection (testset density will almost always look better with fewer features)
  58. 58. Copyright © Andrew W. Moore Slide 58 Feature Selection • Suppose you have a learning algorithm LA and a set of input attributes { X1 , X2 .. Xm } • You expect that LA will only find some subset of the attributes useful. • Question: How can we use cross-validation to find a useful subset? • Four ideas: • Forward selection • Backward elimination • Hill Climbing • Stochastic search (Simulated Annealing or GAs) Another fun area in which Andrew has spent a lot of his wild youth
  59. 59. Copyright © Andrew W. Moore Slide 59 Very serious warning • Intensive use of cross validation can overfit. • How? • What can be done about it?
  60. 60. Copyright © Andrew W. Moore Slide 60 Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • What can be done about it?
  61. 61. Copyright © Andrew W. Moore Slide 61 Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • What can be done about it?
  62. 62. Copyright © Andrew W. Moore Slide 62 Very serious warning • Intensive use of cross validation can overfit. • How? • Imagine a dataset with 50 records and 1000 attributes. • You try 1000 linear regression models, each one using one of the attributes. • The best of those 1000 looks good! • But you realize it would have looked good even if the output had been purely random! • What can be done about it? • Hold out an additional testset before doing any model selection. Check the best model performs well even on the additional testset. • Or: Randomization Testing
  63. 63. Copyright © Andrew W. Moore Slide 63 What you should know • Why you can’t use “training-set-error” to estimate the quality of your learning algorithm on your data. • Why you can’t use “training set error” to choose the learning algorithm • Test-set cross-validation • Leave-one-out cross-validation • k-fold cross-validation • Feature selection methods • CV for classification, regression & densities

×