Your SlideShare is downloading. ×
0
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
7 - Model Assessment and Selection
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

7 - Model Assessment and Selection

874

Published on

Slides of a report on Machine Learning Seminar Series'11 at Kazan (Volga Region) Federal University. See http://cll.niimm.ksu.ru/cms/main/seminars/mlseminar

Slides of a report on Machine Learning Seminar Series'11 at Kazan (Volga Region) Federal University. See http://cll.niimm.ksu.ru/cms/main/seminars/mlseminar

Published in: Technology, Business
2 Comments
0 Likes
Statistics
Notes
  • Slideshare is not a tool at all. Just popular site for publishing slides. That's all.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • 'Model selection and validation : Package e1071 has function tune() for hyper parameter tuning and function errorest() (ipred) can be used for error rate estimation. The cost parameter C for support vector machines can be chosen utilizing the functionality of package svmpath. Functions for ROC analysis and other visualisation techniques for comparing candidate classifiers are available from package ROCR. Package caret provides miscellaneous functions for building predictive models, including parameter tuning and variable importance measures. The package can be used with various parallel implementations (e.g. MPI, NWS etc).' [Taken from http://cran.r-project.org/web/views/MachineLearning.html]'
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
874
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
12
Comments
2
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Model Assessment and Selection Machine Learning Seminar Series11 Nikita Zhiltsov Kazan (Volga Region) Federal University, Russia 18 November 2011 1 / 34
  • 2. Outline1 Bias, Variance and Model Complexity2 Nature of Prediction Error3 Error Estimation: Analytical methods AIC BIC SRM Approach4 Error Estimation: Sample re-use Cross-validation Bootstrapping5 Model Assessment in R 2 / 34
  • 3. Outline1 Bias, Variance and Model Complexity2 Nature of Prediction Error3 Error Estimation: Analytical methods AIC BIC SRM Approach4 Error Estimation: Sample re-use Cross-validation Bootstrapping5 Model Assessment in R 3 / 34
  • 4. Notation x = (x1 , . . . , xD ) ∈ X a vector of inputs t ∈ T a target variable y(x) a prediction model L(t, y(x)) the loss function for measuring errors. Usual choices for regression: (y(x) − t)2 squared error L(t, y(x)) = |y(x) − t| absolute error ... and classication: I(y(x) = t) 0-1 loss L(t, y(x)) = −2 log pt (x) log-likelihood loss 4 / 34
  • 5. Notation (cont.) 1 N err = N i=1 L(ti , xi ) training error ErrD = ED [L(t, y(x))] test error (prediction error) for a given training set D Err = E[ErrD ] = E[L(t, y(x))] expected test errorNBMost methods eectively estimate only Err. 5 / 34
  • 6. Typical behavior of test and training errorExample Training error is not a good estimate of the test error There is some intermediate model complexity that gives minimum expected test error 6 / 34
  • 7. Dening our goalsModel SelectionEstimating the performance of dierent models in order to choosethe best oneModel AssessmentHaving chosen a nal model, estimating its generalization error onnew data 7 / 34
  • 8. Data-rich situation Training set is used to learn the models Validation set is used to estimate prediction error for model selection Test set is used for assessment of the generalization error of the chosen model 8 / 34
  • 9. Outline1 Bias, Variance and Model Complexity2 Nature of Prediction Error3 Error Estimation: Analytical methods AIC BIC SRM Approach4 Error Estimation: Sample re-use Cross-validation Bootstrapping5 Model Assessment in R 9 / 34
  • 10. Bias-Variance DecompositionLets consider expected loss E[L] for regression task: E[L] = L(t, y(x)) p(x, t)dxdt R XUnder squared error loss, h(x) = E[t|x] = tp(t|x)dt is the optimalprediction.Then, E[L] can be decomposed into the sum of three parts: E[L] = bias2 + variance + noisewhere 2 bias = (ED [y(x; D)] − h(x))2 p(x)dx variance = ED [(y(x; D) − ED [y(x; D)])2 ] p(x)dx noise = (h(x) − t)2 p(x, t)dxdt 10 / 34
  • 11. Bias-Variance DecompositionExamples p For a linear model y(x, w) = j=1 wj xj , ∀wj = 0, the in-sample error is: N 1 p 2 Err = (¯(xi ) − h(xi ))2 + y σ + σ2 N i=1 N For a ridge regression model (Tikhonov regularization): N 1 Err = {(ˆ(xi ) − h(xi ))2 + (ˆ(xi ) − y (xi ))2 } + V ar + σ 2 y y ¯ N i=1 where y (xi ) ˆ the best-tting linear approximation to h 11 / 34
  • 12. Behavior of bias and variance 12 / 34
  • 13. Bias-variance tradeoExample Regression with squared loss Classication with 0-1 loss In the 2nd case, prediction error is no longer the sum of squared bias and variance ⇒ The best choices of tuning parameters may dier substantially in the two settings 13 / 34
  • 14. Outline1 Bias, Variance and Model Complexity2 Nature of Prediction Error3 Error Estimation: Analytical methods AIC BIC SRM Approach4 Error Estimation: Sample re-use Cross-validation Bootstrapping5 Model Assessment in R 14 / 34
  • 15. Analytical methods: AIC, BIC, SRM They give the in-sample estimates in the general form: ˆ Err = err + w ˆ where w ˆ is an estimate of the average optimism By using w, ˆ the methods penalize too complex models Unlike regularization, they do not impose a specic regularization parameter λ Each criterion denes its notion of model complexity involved in the penalizing term 15 / 34
  • 16. Akaike Information Criterion (AIC) Applicable for linear models Either log-likelihood loss or squared error loss is used Given a set of models indexed by a tuning parameter α, denote by d(α) number of parameters for each model. Then, d(α) 2 AIC(α) = err + 2 σ ˆ N where σ2 ˆ is typically estimated by the mean squared error of a low-bias model Finally, we choose the model giving smallest AIC 16 / 34
  • 17. Akaike Information Criterion (AIC)Example Phoneme recognition task (N = 1000) Input vector is the log-periodogram of the spoken vowel quantized to 256 uniformly space frequencies Linear logistic regression is used to predict the phonem class Here d(α) is a number of basis functions 17 / 34
  • 18. Bayesian Information Criterion (BIC) BIC, like AIC, is applicable in settings where log-likehood maximization is involved N d BIC = 2 (err + (log N ) σ 2 ) ˆ σ ˆ N BIC is proportional to AIC with the factor 2 replaced by log N Having N 8, BIC tends to penalize complex models more heavily than AIC BIC also provides the posterior probability of each model m: 1 e− 2 BICm M 1 − 2 BICl l=1 e BIC is asympotically consistent as N →∞ 18 / 34
  • 19. Structural Risk Minimization The Vapnik-Chervonenkis (VC) theory provides a general measure of the model complexity and gives associated bounds on the optimism Such a complexity measure, VC dimension, is dened as follows: VC dimension of the class functions {f (x, α)} is the largest number of points that can be shattered by members of {f (x, α)} E.g. a linear indicator function in p dimensions has VC dimension p + 1; sin(αx) has innite VC dimension 19 / 34
  • 20. Structural Risk Minimization (cont.) If we t N training points using {f (x, α)} having VC dimension h, then with probability at least 1 − η the following bound holds: h 2N ln η Err err + (ln + 1) − ) N h N SRM approach ts a nested sequence of models of increasing VC dimensions h1 h2 . . . and then chooses the model with the smallest upper bound SVM classier eciently carries out the SRM approachIssues ˆ There exists the diculty in calculating the VC dimension of a class of functions ˆ In practice, often the upper bound is very loose 20 / 34
  • 21. Outline1 Bias, Variance and Model Complexity2 Nature of Prediction Error3 Error Estimation: Analytical methods AIC BIC SRM Approach4 Error Estimation: Sample re-use Cross-validation Bootstrapping5 Model Assessment in R 21 / 34
  • 22. Sample re-use: cross-validation, bootstrapping These methods directly (and quite accurately) estimate the average generalization error The extra-sample error is evaluated rather than in-sample one (test input vectors do not need to coincide with training ones) They can be used with any loss function, and with nonlinear, adaptive and tting techniques However, they may underestimate true error for such tting methods as trees 22 / 34
  • 23. Cross-validation Probably the simplest and widely used method However, time-consuming method CV procedure looks as follows: 1 Split data into K roughly equal-sized parts 2 For k-th part we t the model y −k (x) to other K − 1 parts 3 Then the cross-validation estimate of the prediction error is N 1 CV = L(ti , y −k(i) (xi )) N i=1 The case K=N (leave-one-out cross-validation) is roughly unbiased, but can have high variance 23 / 34
  • 24. Cross-validation (cont.) In practice, 5- or 10-fold cross-validation is recommended CV tends to overestimate the true prediction error on small datasets Often one-standard error rule is used with CV. See example: We choose the most parsimonious model whose error is no more than one standard error above the error of the best model A model with p=9 would be chosen 24 / 34
  • 25. Bootstrapping General method for assessing statistical accuracy Given a training set, here the bootstrapping procedure steps are: 1 Randomly draw datasets of with replacement from it; each sample is of the same size as the original one 2 This is done by B times, producing B bootstrap datasets 3 Fit the model to each of the bootstrap datasets 4 Examine the prediction error using the original training set as a test set: N 1 1 ˆ Errboot = L(ti , y ∗b (xi )) N |C −i | i=1 b∈C −i where C (−i) is the set of indices of the bootstrap samples that do not contain observation i To alleviate the upward bias, the .632 estimator is used: ˆ (.632) = 0.368 err + 0.632 Errboot Err ˆ 25 / 34
  • 26. Outline1 Bias, Variance and Model Complexity2 Nature of Prediction Error3 Error Estimation: Analytical methods AIC BIC SRM Approach4 Error Estimation: Sample re-use Cross-validation Bootstrapping5 Model Assessment in R 26 / 34
  • 27. http://r-project.org Free software environment for statistical computing and graphics R packages for machine learning and data mining: kernlab, rpart, randomForest, animation, gbm, tm etc. R packages for evaluation: bootstrap,boot RStudio IDE 27 / 34
  • 28. Housing dataset at UCI Machine learningrepositoryhttp://archive.ics.uci.edu/ml/datasets/Housing Housing values in suburbs of Boston 506 intances, 13 attributes + 1 numeric class attribute (MEDV) 28 / 34
  • 29. Loading data in R housing - read.table(∼/projects/r/housing.data,+ header=T) attach(housing) 29 / 34
  • 30. Cross-validation example in RHelper functionCreating a function using crossval() from bootstrap package eval - function(fit,k=10){+ require(bootstrap)+ theta.fit - function(x,y){lsfit(x,y)}+ theta.predict - function(fit,x){cbind(1,x)%*%fit$coef}+ x - fit$model[,2:ncol(fit$model)]+ y - fit$model[,1]+ results - crossval(x,y,theta.fit,theta.predict,+ ngroup=k)+ squared.error=sum((y-results$cv.fit)^2)/length(y)+ cat(Cross-validated squared error =,+ squared.error, n)} 30 / 34
  • 31. Cross-validation example in RModel assessment fit - lm(MEDV∼.,data=housing) # A linear model that uses all the attributes eval(fit)Cross-validated squared error = 23.15827 fit - lm(MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS,+ data=housing) # Less complex model eval(fit)Cross-validated squared error = 23.24319 fit - lm(MEDV∼ RM,data=housing) # Too simple model eval(fit)Cross-validated squared error = 44.38424 31 / 34
  • 32. Bootstrapping example in RHelper functionCreating a function using boot() function from boot package sqer - function(formula,data,indices){+ d - data[indices,]+ fit - lm(formula, data=d)+ return (sum(fit$residuals^2)/length(fit$residuals))+ } 32 / 34
  • 33. Bootstrapping example in RModel assessment results - boot(data=housing,statistic=sqer,R=1000,formula=MEDV∼.) # 1000 bootstrapped datasets print(results)Bootstrap Statistics : original bias std. errort1* 21.89483 -0.76001 2.296025 results - boot(data=housing,statistic=sqer,R=1000,formula=MEDV∼ ZN+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT+CRIM+CHAS) print(results)Bootstrap Statistics : original bias std. errort1* 22.88726 -0.5400892 2.744437 results - boot(data=housing,statistic=sqer,R=1000,formula=MEDV∼ RM) print(results)Bootstrap Statistics : original bias std. errort1* 43.60055 -0.3379168 5.407933 33 / 34
  • 34. Resources T.Hastie, R.Tibshirani, J.Friedman. The Elements of Statistical Learning, 2008 Stanford Engineering Everywhere CS229 Machine Learning. Handouts 4 and 5 http://videolectures.net/stanfordcs229f07_machine_ learning/ 34 / 34

×