SlideShare a Scribd company logo
1 of 32
Linear Model Selection and Regularization
[ISLR.2013.Ch6-6]
Theodore Grammatikopoulos∗
Tue 6th
Jan, 2015
Abstract
The linear model has distinct advantages in terms of inference and, on real-world
problems, and it is often surprisingly competitive in relation to non-linear methods.
Here, we will discuss some ways in which the simple linear model can be improved, by
replacing plain least squares fitting with the alternative fitting procedures of (i) Subset
Selection, (ii) Shrinkage and (iii) Dimensional Reduction. The reason to search for such
improvements is twofold: (a) Prediction accuracy: In cases that the observational data
n are not much larger than the number of predictors p, i.e. n > p, there can be
a lot of variability in the least squares fit resulting in models with poor predictive
capabilities. Furthermore, in case that p > n there is no longer a unique least squares
coefficient estimate. By constraining or shrinking the estimated coefficients, we can
often substantially reduce the variance at the cost of a negligible increase in bias and
finally improve the accuracy of our models, (b) Model Interpret-ability: It is often the
case that some or many of the variables used in a multiple regression model are in
fact not associated with the response. By removing such irrelevant variables we can
obtain a model that is more easily interpreted. Here, we also discuss methods of
automatically performing feature selection or variable selection.
## OTN License Agreement: Oracle Technology Network -
Developer
## Oracle Distribution of R version 3.0.1 (--) Good Sport
## Copyright (C) The R Foundation for Statistical Computing
## Platform: x86_64-unknown-linux-gnu (64-bit)
D:20150106213107+02’00’
∗
e-mail:tgrammat@gmail.com
1
1 Subset Selection Methods
1.1 Best Subset Selection
Here, given the Hitters{ISLR} data, we want to predict a baseball player’s Salary on
the basis of various statistics associated with their performance in previous years.
Note, that the Salary column of Hitters data has some missing values. Therefore, we
better remove any missing values before applying any variable selection method.
library(ISLR)
dim(Hitters)
## [1] 322 20
sum(is.na(Hitters$Salary))
## [1] 59
sum(is.na(Hitters))
## [1] 59
Hitters.Cleaned <- na.omit(Hitters)
dim(Hitters.Cleaned)
## [1] 263 20
attach(Hitters.Cleaned)
To perform the Best Subset Selection method and plot a table of models with the suggested
predictor variables we use the regsubsets() function of the leaps R library.
library(leaps)
reg.fit.Hitters.Cleaned <- regsubsets(Salary ~ ., data = Hitters.
Cleaned,
method = "exhaustive")
summary(reg.fit.Hitters.Cleaned)
2
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters.Cleaned,
method = "exhaustive")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## 1 ( 1 ) " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " "
## CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
## 1 ( 1 ) " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " "*"
## 4 ( 1 ) " " " " "*" " " " " "*" "*"
## 5 ( 1 ) " " " " "*" " " " " "*" "*"
## 6 ( 1 ) " " " " "*" " " " " "*" "*"
3
## 7 ( 1 ) "*" " " " " " " " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " "*" "*"
## Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
In the resulting table above, the asterisk indicates that a given variable is included in the
corresponding model. For instance, this output indicates that the best two-variable model
contains only Hits and CRBI. Here, we have chosen to perform the "exhaustive"
subset selection method, i.e. scanned the total number of possible subset models for the
given number of predictors. regsubsets() function returns by default models with a
maximum of 8 predictors (nvmax = 8), but this can be changed accordingly.
reg.fitFull.Hitters.Cleaned <- regsubsets(Salary ~ ., data =
Hitters.Cleaned,
method = "exhaustive", nvmax = 19)
reg.summary.Hitters.Cleaned <- summary(reg.fitFull.Hitters.
Cleaned)
Plotting RSS, adjR2, Cp and BIC for all the models that the full implementation of
regsubsets() suggests, we can have better understanding of what the best choice
would probably be.
par(mfrow = c(1, 1))
par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 4, 0))
# RSS Plot
plot(reg.summary.Hitters.Cleaned$rss, xlab = "Number of Variables
",
ylab = "RSS", type = "l")
4
# adjR2 Plot
plot(reg.summary.Hitters.Cleaned$adjr2, xlab = "Number of
Variables",
ylab = "Adjusted RSq", type = "l")
which.max.adjr2 <- which.max(reg.summary.Hitters.Cleaned$adjr2)
points(which.max.adjr2, reg.summary.Hitters.Cleaned$adjr2[which.
max.adjr2],
col = "red", cex = 2, pch = 4)
legend("topright", inset = 0.05, paste("Best p =", as.character(
which.max.adjr2)),
text.col = "red")
# Cp Plot
plot(reg.summary.Hitters.Cleaned$cp, xlab = "Number of Predictors
",
ylab = "Cp", type = "l")
which.min.cp <- which.min(reg.summary.Hitters.Cleaned$cp)
points(which.min.cp, reg.summary.Hitters.Cleaned$cp[which.min.cp
],
col = "red", cex = 2, pch = 4)
legend("topright", inset = 0.05, paste("Best p =", as.character(
which.min.cp)),
text.col = "red")
# BIC Plot
plot(reg.summary.Hitters.Cleaned$bic, xlab = "Number of
Predictors",
ylab = "BIC", type = "l")
which.min.bic <- which.min(reg.summary.Hitters.Cleaned$bic)
points(which.min.bic, reg.summary.Hitters.Cleaned$bic[which.min.
bic],
col = "red", cex = 2, pch = 4)
5
legend("topright", inset = 0.05, paste("Best p =", as.character(
which.min.bic)),
text.col = "red")
# Overall Title
title(main = "RSS, adjR2, Cp, BIC Plots vs Number of Predictors 
n[regsubsets - Hitters]",
outer = TRUE)
Figure 1: RSS, adjR2, Cp, BIC Plots vs “Number of Predictors” as calculated by the
regsubsets() R function [Hitters].
6
Note that the BIC criterion suggests a smaller number of predictors, i.e. p = 6, as a
best fit. This is expected since by construction the BIC criterion for linear regression (LR)
models, introduce larger penalty for models with more predictors. More specifically, the
AIC criterion for LR models is given by the formula
AIC =
1
n ˆσ2
RSS + 2 d ˆσ2
, (1)
whereas the BIC criterion is quantified by
BIC ∼
1
n
RSS + log(n) d ˆσ2
, (2)
and log(n) > 2 for n > 7. Note also that ˆσ2
is the estimated variance of associated errors
with the LR fit. Cp (AIC) and adjR2 criteria, on the other hand, suggest 10 and 11
predictors respectively as the best choice of LR fit.
To be more specific about the particular subset of selected variables which are suggested
by the regsubsets() function and for each selection criterion, we use the built-in
plot.regsubsets() function
par(mfrow = c(1, 1))
par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(reg.fitFull.Hitters.Cleaned, scale = "r2", main = "R2:
criterion")
plot(reg.fitFull.Hitters.Cleaned, scale = "adjr2", main = "adjR2:
criterion")
plot(reg.fitFull.Hitters.Cleaned, scale = "Cp", main = "Cp:
criterion")
plot(reg.fitFull.Hitters.Cleaned, scale = "bic", main = "BIC:
criterion")
# Title Plot
title(main = "Best Subset Selection of Variables n regsubsets [
Hitters]",
outer = TRUE)
7
Figure 2: Subset of selected variables as suggested by the regsubsets() R function for
each selection criterion, R2, adjR2, Cp and BIC [Hitters].
The plots in Figure 2 reveal that the regsubsets() function returns
ˆ adjR2 criterion: (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWalks, LeagueN,
DivisionW, PutOuts, Assists) as the best selected variables
ˆ Cp criterion: (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWalks, DivisionW,
PutOuts, Assists)
ˆ BIC criterion: (AtBat, Hits, Walks, CRBI, DivisionW, PutOuts)
8
Requiring to know the coefficients estimates of this last model
coef(reg.fitFull.Hitters.Cleaned, 6)
## (Intercept) AtBat Hits Walks
## 91.5117981 -1.8685892 7.6043976 3.6976468
## CRBI DivisionW PutOuts
## 0.6430169 -122.9515338 0.2643076
1.2 Forward and Backward Step-wise Selection
We can also use the regsubsets function to perform forward step-wise or backward
step-wise selection, using the argument method ="forward" or "backward" respec-
tively. Note that in our case the observation data, n, are by far a larger number than the
number of predictors, p, i.e. n p, and the Backward Step-wise Selection can be also
applied as the Best Subset Selection previously could.
# Fwd Stepwise Selection
reg.fitFwd.Hitters.Cleaned <- regsubsets(Salary ~ ., data =
Hitters.Cleaned,
nvmax = 19, method = "forward")
# Bwd Stepwise Selection
reg.fitBwd.Hitters.Cleaned <- regsubsets(Salary ~ ., data =
Hitters.Cleaned,
nvmax = 19, method = "backward")
More specifically, in the Forward Step-wise case the particular subset of selection variables
which are returned from the regsubsets() function are shown in Figure 3.
par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(reg.fitFwd.Hitters.Cleaned, scale = "r2", main = "R2:
criterion")
plot(reg.fitFwd.Hitters.Cleaned, scale = "adjr2", main = "adjR2:
criterion")
9
plot(reg.fitFwd.Hitters.Cleaned, scale = "Cp", main = "Cp:
criterion")
plot(reg.fitFwd.Hitters.Cleaned, scale = "bic", main = "BIC:
criterion")
# Title Plot
title(main = "Fwd Stepwise Selection of Variables n regsubsets [
Hitters]",
outer = TRUE)
Figure 3: Subset of selected variables as suggested by the regsubsets() R function for
each selection criterion, R2, adjR2, Cp and BIC [Fwd Step-wise Selection, Hitters].
10
In the Backward Step-wise case the particular subset of selection variables which are
returned from the regsubsets() function are shown in Figure 4 below.
par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(reg.fitBwd.Hitters.Cleaned, scale = "r2", main = "R2:
criterion")
plot(reg.fitBwd.Hitters.Cleaned, scale = "adjr2", main = "adjR2:
criterion")
plot(reg.fitBwd.Hitters.Cleaned, scale = "Cp", main = "Cp:
criterion")
plot(reg.fitBwd.Hitters.Cleaned, scale = "bic", main = "BIC:
criterion")
# Title Plot
title(main = "Bwd Stepwise Selection of Variables n regsubsets [
Hitters]",
outer = TRUE)
Comparing these results across all the statistics criteria, we figure out that both methods,
namely Best Subset and Forward Step-wise Variable Selection have the same outcome.
The only difference is observed under the Backward Step-wise Selection procedure and
for the BIC criterion, where the CRuns and CWalks variables, are also suggested by the
model. However, this is easily justifiable due to the larger penalty that BIC information
criterion introduce for models with more predictors. That is, during the Forward Step-
wise Selection procedure, the underlying algorithm is more reluctant to add new variables
in model’s predictor space when the BIC criterion is used, which results in a smaller
and finally different model comparing with the one that the Backward Step-wise Selection
method gives.
Requiring to know the specific coefficients of the models that forward and backward step-
wise selection methods suggest:
# Fwd Stepwise Model - BIC
coef(reg.fitFwd.Hitters.Cleaned, 6)
## (Intercept) AtBat Hits Walks
## 91.5117981 -1.8685892 7.6043976 3.6976468
11
## CRBI DivisionW PutOuts
## 0.6430169 -122.9515338 0.2643076
# Bwd Stepwise Model - BIC
coef(reg.fitBwd.Hitters.Cleaned, 8)
## (Intercept) AtBat Hits Walks
## 117.1520434 -2.0339209 6.8549136 6.4406642
## CRuns CRBI CWalks DivisionW
## 0.7045391 0.5273238 -0.8066062 -123.7798366
## PutOuts
## 0.2753892
Figure 4: Subset of selected variables as suggested by the regsubsets() R function for
each selection criterion, R2, adjR2, Cp and BIC [Bwd Step-wise Selection, Hitters].
12
1.3 Cross-Validation: Choosing Among Models
In the previous section we chose among a set of models of different sizes using the Cp,
BIC, and adjR2 statistics criteria. As an alternative to these methods, we can directly
estimate the test error using Validation-Set approach and Cross-Validation (CV) methods.
Main advantages of the CV methods are:
ˆ Direct estimate of the test error.
ˆ Makes fewer assumptions for the true underlying model.
ˆ It can be used in a wider range of model selection tasks.
In past, performing CV was computationally prohibitive for many problems with large
p and/or n. Therefore, AIC, Cp, BIC and adjR2 were more attractive approaches to
choose a model. However, nowadays with the fast computing machines, the required
computations are not any more an issue.
In order for these approaches to yield accurate estimates of the test error, we must use
only the training observations to perform all aspects of model-fitting, including the variable
selection. If the full data set is used to perform the best subset selection step, the cross-
validation errors that we obtain will not be accurate estimates of the test error (over-fitting).
More specifically, cross-validation re-sampling methods concern the random selection of
k sequential parts of the initial data set, and repeatedly build a model in one k-th part,
whereas testing it on the remaining (k − 1)-folds. The cross-validation error returned by
this procedure is for a quantitative response
CV(k) =
1
k
k
i=1
MSEi ,
or
CV(n) =
1
n
n
i=1
1l(yi ˆyi ) ,
for a categorical one.
Cross-validation (CV) methods has two major advantages in comparison with the simple
Validation set approach: (1) Are less sensitive in the particular observation data set, and (2)
Spans the observational data set more democratically, giving in each k-fold the character
13
of the training data pool whereas keeping the others as test data arena. Therefore, cross-
validation methods usually avoid over-estimation of the test error rate.
Note also that CV methods are less computationally intensive from the so-called Leave-
One-Out-Cross-Validation (LOOCV), CV (k=1), and therefore usually preferred for large data
sets studies.
The best way to apply such methods for our data set example, Hitters, is by utilizing
functions of the bestglm, R library.
library(bestglm)
set.seed(564)
train <- sample(c(TRUE, FALSE), nrow(Hitters.Cleaned), rep = TRUE
)
test <- (!train)
set.seed(328)
X <- Hitters.Cleaned[train, -19]
y <- Hitters.Cleaned[train, 19]
Xy <- cbind(X, y)
# delete-d CVd k=10, with random subsamples [Shao, 1997]
bestglm.CVd.Hitters <- bestglm(Xy, family = gaussian, IC = "CV",
CVArgs = list(Method = "d", K = 10, REP = 1), TopModels = 5,
method = "exhaustive", nvmax = "default")
## Note: binary categorical variables converted to 0-1 so 'leaps'
could be used.
bestglm.CVd.Hitters$Title
## [1] "CVd(d = 10, REP = 1)nNo BICq equivalent"
bestglm.CVd.Hitters$ModelReport
## $NullModel
## Deviance DF
## 26971550 128
##
14
## $LEAPSQ
## [1] TRUE
##
## $glmQ
## [1] FALSE
##
## $gaussianQ
## [1] TRUE
##
## $NumDF
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $CategoricalQ
## [1] TRUE
##
## $IncludeInterceptQ
## [1] TRUE
##
## $Bestk
## [1] 7
bestglm.CVd.Hitters$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1],
FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) Walks CAtBat CHits
## 236.0867 4.4159 -0.4669 1.6483
## CRBI CWalks DivisionW PutOuts
## 0.8894 -0.3068 -136.1346 0.2146
Or we can even use either the (1) “HTF k-fold CV” method (CVHTM) [Hastie et al., 2011] or
the (2) “adjusted K-fold CV” (CVDH) [by A. C. Davison, 2010] by parsing the appropriate
arguments in the bestglm() function
# Hastie et' al. K-fold CV (CVHTM)
15
bestglm.CVHTM.Hitters <- bestglm(Xy, family = gaussian, IC = "CV
",
CVArgs = list(Method = "HTF", K = 10, REP = 1), TopModels =
5,
method = "exhaustive", nvmax = "default")
## Note: binary categorical variables converted to 0-1 so 'leaps'
could be used.
bestglm.CVHTM.Hitters$Title
## [1] "CV(K = 10, REP = 1)nBICq equivalent for q in
(1.52529666674894e-08, 0.050282763767736)"
bestglm.CVHTM.Hitters$ModelReport
## $NullModel
## Deviance DF
## 26971550 128
##
## $LEAPSQ
## [1] TRUE
##
## $glmQ
## [1] FALSE
##
## $gaussianQ
## [1] TRUE
##
## $NumDF
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $CategoricalQ
## [1] TRUE
##
## $IncludeInterceptQ
## [1] TRUE
##
## $Bestk
## [1] 1
bestglm.CVHTM.Hitters$BestModel
16
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1],
FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) CRBI
## 319.7668 0.7198
# adjusted K-fold CV (CVDH)
bestglm.CVDH.Hitters <- bestglm(Xy, family = gaussian, IC = "CV",
CVArgs = list(Method = "DH", K = 10, REP = 1), TopModels = 5,
method = "exhaustive", nvmax = "default")
## Note: binary categorical variables converted to 0-1 so 'leaps'
could be used.
bestglm.CVDH.Hitters$Title
## [1] "CVAdj(K = 10, REP = 1)nBICq equivalent for q in
(0.855658979391075, 0.862338943358585)"
bestglm.CVDH.Hitters$ModelReport
## $NullModel
## Deviance DF
## 26971550 128
##
## $LEAPSQ
## [1] TRUE
##
## $glmQ
## [1] FALSE
##
## $gaussianQ
## [1] TRUE
##
## $NumDF
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
17
## $CategoricalQ
## [1] TRUE
##
## $IncludeInterceptQ
## [1] TRUE
##
## $Bestk
## [1] 10
bestglm.CVDH.Hitters$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1],
FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) AtBat Hits Walks
## 329.5892 -2.5351 7.1477 6.7482
## CAtBat CRuns CRBI CWalks
## -0.1742 1.8657 0.7837 -1.1051
## DivisionW PutOuts Assists
## -119.8939 0.2362 0.2686
2 Ridge Regression and the Lasso
In this section we are going to use the so-called shrinkage methods to better fit linear
predictive models and describe the Salary response in terms of other predictors.
We will use the glmnet package in order to perform both Ridge Regression and Lasso
shrinkage methods. This function has slightly different syntax from what we have used
so far. In particular, we must pass in an X matrix of predictors as well as a y vector of
response.
X <- model.matrix(Salary ~ ., Hitters.Cleaned)[, -1]
y <- Hitters.Cleaned$Salary
The model.matrix() function is particular useful in cases like this; not only does it
produce a matrix corresponding to the 19 predictors but it also automatically transforms
18
any categorical variables into dummy variables. This is important because the glmnet()
function can be only applied on numerical data inputs.
2.1 Ridge Regression
Here, we fit a Ridge Regression linear model leaving the model’s tuning parameter, λ, to
be determined by a cross-validation procedure
library(glmnet)
## Loading required package: Matrix
## Loading required package: lattice
## Loaded glmnet 1.9-8
# Calculate the Ridge Regression (alpha=0) - Cross
# Validations nfolds=Sample Size (LOOCV)
set.seed(456)
# LOOCV Procedure
nfolds = nrow(Hitters.Cleaned)
glmnet.ridge.fit <- cv.glmnet(X[train, ], y[train], type.measure
= "mse",
family = "gaussian", alpha = 0, standardize = TRUE)
The Mean-Squared Error, MSE, plot vs the log(Lambda) tuning parameter is given in
Figure 5 below.
par(mfrow = c(1, 1), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0))
plot(glmnet.ridge.fit)
title(main = "MSE vs Log(Lambda) n Ridge Regression [LOOCV,
Hitters]",
outer = TRUE)
19
Figure 5: Mean-Squared Error, MSE, vs the Log(Lambda) tuning parameter - Ridge Re-
gression Fit under LOOCV procedure (Hitters).
In Figure 6 we plot the ridge regression coefficients vs the lambda control parameter, λ,
and the fraction of (null) deviance ratio which is explained by the model.
par(mfrow = c(1, 2), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0))
# Minimum Mean CV Error
cvm.Min.Index <- which.min(glmnet.ridge.fit$cvm)
# Ridge Regression coefs vs Lambda Control Parameter
plot(glmnet.ridge.fit$glmnet.fit, xvar = "lambda", label = TRUE)
abline(v = log(glmnet.ridge.fit$lambda[cvm.Min.Index]), lty = "
dashed",
col = "black")
20
# Ridge Regression coefs vs Deviance Ratio
plot(glmnet.ridge.fit$glmnet.fit, xvar = "dev", label = TRUE)
abline(v = log(glmnet.ridge.fit$glmnet.fit$dev.ratio[cvm.Min.
Index]),
lty = "dashed", col = "black")
title(main = "Ridge Regression Coefficients vs Log(Lambda) and n
Fraction of Deviance Explained [glmnet(LOOCV), Hitters]",
outer = TRUE)
Figure 6: Ridge regression coefficients vs the Log(Lambda) control parameter and the
fraction of Dev.Ratio explained (glmnet under LOOCV procedure, Hitters).
21
The dashed vertical line in the figure above denotes the best λ parameter as calculated
by the LOOCV procedure. Note that as expected, none of the regression coefficients are
actually vanished by the application of the shrinkage method.
Finally, we refit our ridge regression model on the test data set using the λ parameter
which has been determined by the cross-validation procedure above. We also calculate
the corresponding Mean-Squared Error for the test data part of Hitters.
predict(glmnet.ridge.fit, X[test, ], type = "coefficients", s = "
lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 178.489344920
## AtBat 0.035448419
## Hits 0.608045444
## HmRun 0.930883558
## Runs 0.698708995
## RBI 0.654999526
## Walks 1.466577745
## Years 1.818638592
## CAtBat 0.009877955
## CHits 0.051122325
## CHmRun 0.330954429
## CRuns 0.096656674
## CRBI 0.107287429
## CWalks 0.056021178
## LeagueN -0.697704037
## DivisionW -70.184986685
## PutOuts 0.151336492
## Assists -0.027297267
## Errors -2.055848395
## NewLeagueN -12.596879362
glmnet.ridge.pred <- predict(glmnet.ridge.fit, X[test, ], type =
"response",
s = "lambda.min")
# test MSE
mean((glmnet.ridge.pred - y[test])^2)
## [1] 97569.09
22
2.2 The Lasso
In this section we fit a Lasso linear model leaving the model tuning parameter, λ, to be
determined by a cross-validation procedure.
# Calculate the Ridge Regression (alpha=0) - Cross
# Validations Sample Size (LOOCV)
set.seed(234)
# LOOCV Procedure
nfolds = nrow(Hitters.Cleaned)
glmnet.lassoLOOCV.fit <- cv.glmnet(X[train, ], y[train], type.
measure = "mse",
family = "gaussian", alpha = 1, standardize = TRUE)
The Mean-Squared Error, MSE, plot vs the log(Lambda) tuning parameter is given in
Figure 7.
par(mfrow = c(1, 1), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0))
plot(glmnet.lassoLOOCV.fit)
title(main = "MSE vs Log(Lambda) n Lasso Linear Fit [LOOCV,
Hitters]",
outer = TRUE)
In Figure 8 we plot the Lasso coefficients vs the lambda control parameter, λ, and the
fraction of (null) deviance ratio which is explained by the model.
par(mfrow = c(1, 2), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0))
# Minimum Mean CV Error
23
Figure 7: Mean-Squared Error (MSE) vs the Log(Lambda) tuning parameter - Lasso Linear
Fit under LOOCV procedure (Hitters).
cvm.Min.Index <- which.min(glmnet.lassoLOOCV.fit$cvm)
# Ridge Regression coefs vs Lambda Control Parameter
plot(glmnet.lassoLOOCV.fit$glmnet.fit, xvar = "lambda", label =
TRUE)
abline(v = log(glmnet.lassoLOOCV.fit$lambda[cvm.Min.Index]),
lty = "dashed", col = "black")
# Ridge Regression coefs vs Deviance Ratio
plot(glmnet.lassoLOOCV.fit$glmnet.fit, xvar = "dev", label = TRUE
)
24
abline(v = log(glmnet.lassoLOOCV.fit$glmnet.fit$dev.ratio[cvm.Min
.Index]),
lty = "dashed", col = "black")
title(main = "Lasso Coefficients vs Log(Lambda) and n Fraction
of Deviance Explained [glmnet(LOOCV), Hitters]",
outer = TRUE)
Figure 8: Lasso coefficients vs the Log(Lambda) control parameter and the fraction of
Dev.Ratio explained (glmnet under LOOCV procedure, Hitters).
25
The dashed vertical line in the figure above denotes the best λ parameter as calculated
by the LOOCV procedure. Here in contrast with the ridge regression case some of the
coefficients are actually vanished by the application of the shrinkage method.
Indeed, refitting our lasso linear model on the test data set by using the λ parameter
which has been determined by the cross-validation procedure above, we get the following
results
predict(glmnet.lassoLOOCV.fit, X[test, ], type = "coefficients",
s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 156.72063595
## AtBat .
## Hits 1.15605082
## HmRun .
## Runs .
## RBI .
## Walks 2.18274618
## Years .
## CAtBat .
## CHits 0.01354589
## CHmRun .
## CRuns 0.12054023
## CRBI 0.44744187
## CWalks .
## LeagueN .
## DivisionW -108.31835711
## PutOuts 0.24083595
## Assists .
## Errors -1.85682194
## NewLeagueN .
glmnet.lassoLOOCV.pred <- predict(glmnet.lassoLOOCV.fit, X[test,
], type = "response", s = "lambda.min")
# test MSE
mean((glmnet.lassoLOOCV.pred - y[test])^2)
## [1] 89742.07
26
Note that the corresponding test Mean-Squared Error, MSE, is significantly lower than the
one found above with the ridge regression fit.
3 PCR and PLS Regression
3.1 Principal Components Regression
Principal components regression (PCR) can be performed by using the pcr() function,
which is part of the pls, R library. We now apply PCR to the Hitters data of our
example, and test how successfully we can predict the Salary variable. As before, we
should use a cleaned version of the data set, i.e. a data.frame without missing values.
In addition, we need to standardize all the predictors prior to applying the method.
library(pls)
set.seed(54)
In order to better estimate the model test error we perform the PCR calculation using
LOOCV. The data set is not large and the method does not need great computational
power. In case of a larger data set we could alternatively use the "CV" flag. Summarizing
our current results
pcr.fit.Hitters <- pcr(Salary ~ ., data = Hitters.Cleaned, subset
= train,
scale = TRUE, validation = "LOO")
summary(pcr.fit.Hitters)
## Data: X dimension: 129 19
## Y dimension: 129 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 129 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## CV 460.8 390.9 395.3 397.2 398.5
## adjCV 460.8 390.9 395.3 397.2 398.4
## 5 comps 6 comps 7 comps 8 comps 9 comps
## CV 400.6 392.9 396.2 400.7 403.4
## adjCV 400.5 392.7 396.1 400.6 403.2
27
## 10 comps 11 comps 12 comps 13 comps 14 comps
## CV 413.7 419.9 423.0 442.3 438.4
## adjCV 413.5 419.7 422.7 442.0 438.0
## 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 431.1 419.8 424.2 422.8 436.4
## adjCV 430.8 419.5 423.8 422.4 435.9
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 38.34 61.02 70.97 78.61 84.17
## Salary 31.51 31.82 31.82 31.96 32.65
## 6 comps 7 comps 8 comps 9 comps 10 comps
## X 88.97 92.32 95.20 96.45 97.58
## Salary 37.07 37.36 37.36 37.45 37.98
## 11 comps 12 comps 13 comps 14 comps 15 comps
## X 98.26 98.82 99.28 99.54 99.78
## Salary 38.02 38.05 38.46 40.76 42.71
## 16 comps 17 comps 18 comps 19 comps
## X 99.93 99.98 99.99 100.00
## Salary 45.34 47.40 48.75 48.83
The CV score is provided for each possible number of components, ranging from M = 0
on-wards. Note that pcr() reports the root mean squared error; in order to obtain the
usual MSE, we must square this quantity.
To obtain a graphical representation of the resulting cross-validation scores
par(mfrow = c(1, 1))
validationplot(pcr.fit.Hitters, val.type = "MSEP", xlab = "Number
of Predictors",
ylab = "MSE", main = "MSE vs Number of Predictors [PCR -
Hitters]")
From Figure 9 we see that the smallest cross-validation error occurs when M = 2 compo-
nents are used.
The summary() function also provides the percentage of variance explained in the pre-
dictors and in the response using different numbers of components. This concept is
discussed in greater detail in later articles. Briefly, we can think of this as the amount
of information about the predictors or the response that is captured using M principal
components.
28
Next, we evaluate the model’s predictive accuracy by applying the PCR fit on the test data
set.
pcr.pred.Hitters <- predict(pcr.fit.Hitters, X[test, ], ncomp =
2)
mean((pcr.pred.Hitters - y[test])^2)
## [1] 104209.3
The test MSE of the model is comparable with the results obtained using ridge regression
and the lasso. However, PCR models are more difficult to interpret because it does not
perform any kind of variable selection or even directly produce coefficient estimates.
Figure 9: Mean-Squared Error, MSE, vs Number of Predictors as calculated by the
pcr() function under LOOCV.
29
3.2 Partial Least Squares
Here, we implement the Partial Least Squares (PLS) algorithm by also utilizing functions
in the pls library.
Again we build the model on the test data set, make a plot of the resulting cross-validation
scores and finally evaluate the models’ predictive accuracy.
set.seed(478)
pls.fit.Hitters <- plsr(Salary ~ ., data = Hitters.Cleaned,
subset = train,
scale = TRUE, validation = "LOO")
summary(pls.fit.Hitters)
## Data: X dimension: 129 19
## Y dimension: 129 1
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 129 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## CV 460.8 392.9 407.5 409.1 422.6
## adjCV 460.8 392.8 407.4 408.9 422.3
## 5 comps 6 comps 7 comps 8 comps 9 comps
## CV 449.2 441.0 436.6 426.6 427.8
## adjCV 448.6 440.5 436.2 426.3 427.4
## 10 comps 11 comps 12 comps 13 comps 14 comps
## CV 425.1 409.5 422.2 428.1 425.3
## adjCV 424.6 408.9 421.8 427.7 424.9
## 15 comps 16 comps 17 comps 18 comps 19 comps
## CV 428.9 426.9 430.3 429.9 436.4
## adjCV 428.4 426.5 429.8 429.5 435.9
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 38.19 46.91 65.76 70.18 74.45
## Salary 33.24 37.74 38.67 40.98 43.46
## 6 comps 7 comps 8 comps 9 comps 10 comps
## X 79.99 86.04 89.87 93.62 94.11
## Salary 44.53 45.36 46.14 46.43 47.24
30
## 11 comps 12 comps 13 comps 14 comps 15 comps
## X 95.09 97.08 98.37 98.82 99.15
## Salary 47.65 47.88 48.09 48.36 48.55
## 16 comps 17 comps 18 comps 19 comps
## X 99.59 99.77 99.99 100.00
## Salary 48.62 48.78 48.81 48.83
validationplot(pls.fit.Hitters, val.type = "MSEP", xlab = "Number
of Predictors",
ylab = "MSE", main = "MSE vs Number of Predictors [PLS -
Hitters]")
Figure 10: Mean Squared Error (MSE) vs "Number of Predictors" as calculated by
the plsr() function under LOOCV.
31
As shown in Figure 10, the smallest cross-validation error again occurs when M = 2
components are used. The test MSE is found to be significant smaller in this case, than
the one obtained by ridge regression, lasso and PCR methods before.
pls.pred.Hitters <- predict(pls.fit.Hitters, X[test, ], ncomp =
2)
mean((pls.pred.Hitters - y[test]))
## [1] 75.20599
mean((pls.pred.Hitters - y[test]))
## [1] 75.20599
References
[by A. C. Davison, 2010] by A. C. Davison (2010). Bootstrap Methods and Their Applica-
tion. Cambridge University Press, international edition edition.
[Hastie et al., 2011] Hastie, T., Tibshirani, R., and Friedman, J. (2011). The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer
Series in Statistics). Springer, 2nd ed. 2009. corr. 7th printing 2013 edition.
32

More Related Content

What's hot

Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in RJeffrey Breen
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Daichi Morifuji
 
Exploring Modeling - Best Practices with Aerospike Data Types
Exploring Modeling - Best Practices with Aerospike Data TypesExploring Modeling - Best Practices with Aerospike Data Types
Exploring Modeling - Best Practices with Aerospike Data TypesRonen Botzer
 
Exploring Modeling - Doing More with Lists
Exploring Modeling - Doing More with ListsExploring Modeling - Doing More with Lists
Exploring Modeling - Doing More with ListsRonen Botzer
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreLukas Fittl
 
Countdown to Zero - Counter Use Cases in Aerospike
Countdown to Zero - Counter Use Cases in AerospikeCountdown to Zero - Counter Use Cases in Aerospike
Countdown to Zero - Counter Use Cases in AerospikeRonen Botzer
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxDatabricks
 

What's hot (12)

Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
R 기초 anova
R 기초   anovaR 기초   anova
R 기초 anova
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013 Time Series Analysis by JavaScript LL matsuri 2013
Time Series Analysis by JavaScript LL matsuri 2013
 
Exploring Modeling - Best Practices with Aerospike Data Types
Exploring Modeling - Best Practices with Aerospike Data TypesExploring Modeling - Best Practices with Aerospike Data Types
Exploring Modeling - Best Practices with Aerospike Data Types
 
Exploring Modeling - Doing More with Lists
Exploring Modeling - Doing More with ListsExploring Modeling - Doing More with Lists
Exploring Modeling - Doing More with Lists
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
Countdown to Zero - Counter Use Cases in Aerospike
Countdown to Zero - Counter Use Cases in AerospikeCountdown to Zero - Counter Use Cases in Aerospike
Countdown to Zero - Counter Use Cases in Aerospike
 
GeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony FoxGeoMesa on Apache Spark SQL with Anthony Fox
GeoMesa on Apache Spark SQL with Anthony Fox
 

Similar to Linear Model Selection and Regularization (Article 6 - Practical exercises)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guideprateek kumar
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...prateek kumar
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
CIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & CourseworkCIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & CourseworkTUOS-Sam
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Yao Yao
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Parth Khare
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Dr. Volkan OBAN
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Spencer Fox
 
Lab on contrasts, estimation, and power
Lab on contrasts, estimation, and powerLab on contrasts, estimation, and power
Lab on contrasts, estimation, and powerrichardchandler
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_publicLong Nguyen
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examplesDennis
 

Similar to Linear Model Selection and Regularization (Article 6 - Practical exercises) (20)

Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
FINAL_TAKE_HOME
FINAL_TAKE_HOMEFINAL_TAKE_HOME
FINAL_TAKE_HOME
 
Random forest algorithm for regression a beginner's guide
Random forest algorithm for regression   a beginner's guideRandom forest algorithm for regression   a beginner's guide
Random forest algorithm for regression a beginner's guide
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Iowa_Report_2
Iowa_Report_2Iowa_Report_2
Iowa_Report_2
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
R programming language
R programming languageR programming language
R programming language
 
Gradient boosting for regression problems with example basics of regression...
Gradient boosting for regression problems with example   basics of regression...Gradient boosting for regression problems with example   basics of regression...
Gradient boosting for regression problems with example basics of regression...
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
CIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & CourseworkCIV1900 Matlab - Plotting & Coursework
CIV1900 Matlab - Plotting & Coursework
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
 
Xgboost
XgboostXgboost
Xgboost
 
Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017Big Data Mining in Indian Economic Survey 2017
Big Data Mining in Indian Economic Survey 2017
 
Basic Analysis using Python
Basic Analysis using PythonBasic Analysis using Python
Basic Analysis using Python
 
R Programming Intro
R Programming IntroR Programming Intro
R Programming Intro
 
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
Optimization and Mathematical Programming in R and ROI - R Optimization Infra...
 
Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016Introduction to R Short course Fall 2016
Introduction to R Short course Fall 2016
 
Lab on contrasts, estimation, and power
Lab on contrasts, estimation, and powerLab on contrasts, estimation, and power
Lab on contrasts, estimation, and power
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 

Recently uploaded

How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...mikehavy0
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxJocylDuran
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...varanasisatyanvesh
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样wsppdmt
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontangsiskavia95
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxStephen266013
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1sinhaabhiyanshu
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 

Recently uploaded (20)

How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
Abortion Clinic in Kempton Park +27791653574 WhatsApp Abortion Clinic Service...
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
如何办理澳洲拉筹伯大学毕业证(LaTrobe毕业证书)成绩单原件一模一样
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 

Linear Model Selection and Regularization (Article 6 - Practical exercises)

  • 1. Linear Model Selection and Regularization [ISLR.2013.Ch6-6] Theodore Grammatikopoulos∗ Tue 6th Jan, 2015 Abstract The linear model has distinct advantages in terms of inference and, on real-world problems, and it is often surprisingly competitive in relation to non-linear methods. Here, we will discuss some ways in which the simple linear model can be improved, by replacing plain least squares fitting with the alternative fitting procedures of (i) Subset Selection, (ii) Shrinkage and (iii) Dimensional Reduction. The reason to search for such improvements is twofold: (a) Prediction accuracy: In cases that the observational data n are not much larger than the number of predictors p, i.e. n > p, there can be a lot of variability in the least squares fit resulting in models with poor predictive capabilities. Furthermore, in case that p > n there is no longer a unique least squares coefficient estimate. By constraining or shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a negligible increase in bias and finally improve the accuracy of our models, (b) Model Interpret-ability: It is often the case that some or many of the variables used in a multiple regression model are in fact not associated with the response. By removing such irrelevant variables we can obtain a model that is more easily interpreted. Here, we also discuss methods of automatically performing feature selection or variable selection. ## OTN License Agreement: Oracle Technology Network - Developer ## Oracle Distribution of R version 3.0.1 (--) Good Sport ## Copyright (C) The R Foundation for Statistical Computing ## Platform: x86_64-unknown-linux-gnu (64-bit) D:20150106213107+02’00’ ∗ e-mail:tgrammat@gmail.com 1
  • 2. 1 Subset Selection Methods 1.1 Best Subset Selection Here, given the Hitters{ISLR} data, we want to predict a baseball player’s Salary on the basis of various statistics associated with their performance in previous years. Note, that the Salary column of Hitters data has some missing values. Therefore, we better remove any missing values before applying any variable selection method. library(ISLR) dim(Hitters) ## [1] 322 20 sum(is.na(Hitters$Salary)) ## [1] 59 sum(is.na(Hitters)) ## [1] 59 Hitters.Cleaned <- na.omit(Hitters) dim(Hitters.Cleaned) ## [1] 263 20 attach(Hitters.Cleaned) To perform the Best Subset Selection method and plot a table of models with the suggested predictor variables we use the regsubsets() function of the leaps R library. library(leaps) reg.fit.Hitters.Cleaned <- regsubsets(Salary ~ ., data = Hitters. Cleaned, method = "exhaustive") summary(reg.fit.Hitters.Cleaned) 2
  • 3. ## Subset selection object ## Call: regsubsets.formula(Salary ~ ., data = Hitters.Cleaned, method = "exhaustive") ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits ## 1 ( 1 ) " " " " " " " " " " " " " " " " " " ## 2 ( 1 ) " " "*" " " " " " " " " " " " " " " ## 3 ( 1 ) " " "*" " " " " " " " " " " " " " " ## 4 ( 1 ) " " "*" " " " " " " " " " " " " " " ## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " " ## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " ## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" ## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " ## CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts ## 1 ( 1 ) " " " " "*" " " " " " " " " ## 2 ( 1 ) " " " " "*" " " " " " " " " ## 3 ( 1 ) " " " " "*" " " " " " " "*" ## 4 ( 1 ) " " " " "*" " " " " "*" "*" ## 5 ( 1 ) " " " " "*" " " " " "*" "*" ## 6 ( 1 ) " " " " "*" " " " " "*" "*" 3
  • 4. ## 7 ( 1 ) "*" " " " " " " " " "*" "*" ## 8 ( 1 ) "*" "*" " " "*" " " "*" "*" ## Assists Errors NewLeagueN ## 1 ( 1 ) " " " " " " ## 2 ( 1 ) " " " " " " ## 3 ( 1 ) " " " " " " ## 4 ( 1 ) " " " " " " ## 5 ( 1 ) " " " " " " ## 6 ( 1 ) " " " " " " ## 7 ( 1 ) " " " " " " ## 8 ( 1 ) " " " " " " In the resulting table above, the asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits and CRBI. Here, we have chosen to perform the "exhaustive" subset selection method, i.e. scanned the total number of possible subset models for the given number of predictors. regsubsets() function returns by default models with a maximum of 8 predictors (nvmax = 8), but this can be changed accordingly. reg.fitFull.Hitters.Cleaned <- regsubsets(Salary ~ ., data = Hitters.Cleaned, method = "exhaustive", nvmax = 19) reg.summary.Hitters.Cleaned <- summary(reg.fitFull.Hitters. Cleaned) Plotting RSS, adjR2, Cp and BIC for all the models that the full implementation of regsubsets() suggests, we can have better understanding of what the best choice would probably be. par(mfrow = c(1, 1)) par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 4, 0)) # RSS Plot plot(reg.summary.Hitters.Cleaned$rss, xlab = "Number of Variables ", ylab = "RSS", type = "l") 4
  • 5. # adjR2 Plot plot(reg.summary.Hitters.Cleaned$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l") which.max.adjr2 <- which.max(reg.summary.Hitters.Cleaned$adjr2) points(which.max.adjr2, reg.summary.Hitters.Cleaned$adjr2[which. max.adjr2], col = "red", cex = 2, pch = 4) legend("topright", inset = 0.05, paste("Best p =", as.character( which.max.adjr2)), text.col = "red") # Cp Plot plot(reg.summary.Hitters.Cleaned$cp, xlab = "Number of Predictors ", ylab = "Cp", type = "l") which.min.cp <- which.min(reg.summary.Hitters.Cleaned$cp) points(which.min.cp, reg.summary.Hitters.Cleaned$cp[which.min.cp ], col = "red", cex = 2, pch = 4) legend("topright", inset = 0.05, paste("Best p =", as.character( which.min.cp)), text.col = "red") # BIC Plot plot(reg.summary.Hitters.Cleaned$bic, xlab = "Number of Predictors", ylab = "BIC", type = "l") which.min.bic <- which.min(reg.summary.Hitters.Cleaned$bic) points(which.min.bic, reg.summary.Hitters.Cleaned$bic[which.min. bic], col = "red", cex = 2, pch = 4) 5
  • 6. legend("topright", inset = 0.05, paste("Best p =", as.character( which.min.bic)), text.col = "red") # Overall Title title(main = "RSS, adjR2, Cp, BIC Plots vs Number of Predictors n[regsubsets - Hitters]", outer = TRUE) Figure 1: RSS, adjR2, Cp, BIC Plots vs “Number of Predictors” as calculated by the regsubsets() R function [Hitters]. 6
  • 7. Note that the BIC criterion suggests a smaller number of predictors, i.e. p = 6, as a best fit. This is expected since by construction the BIC criterion for linear regression (LR) models, introduce larger penalty for models with more predictors. More specifically, the AIC criterion for LR models is given by the formula AIC = 1 n ˆσ2 RSS + 2 d ˆσ2 , (1) whereas the BIC criterion is quantified by BIC ∼ 1 n RSS + log(n) d ˆσ2 , (2) and log(n) > 2 for n > 7. Note also that ˆσ2 is the estimated variance of associated errors with the LR fit. Cp (AIC) and adjR2 criteria, on the other hand, suggest 10 and 11 predictors respectively as the best choice of LR fit. To be more specific about the particular subset of selected variables which are suggested by the regsubsets() function and for each selection criterion, we use the built-in plot.regsubsets() function par(mfrow = c(1, 1)) par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(reg.fitFull.Hitters.Cleaned, scale = "r2", main = "R2: criterion") plot(reg.fitFull.Hitters.Cleaned, scale = "adjr2", main = "adjR2: criterion") plot(reg.fitFull.Hitters.Cleaned, scale = "Cp", main = "Cp: criterion") plot(reg.fitFull.Hitters.Cleaned, scale = "bic", main = "BIC: criterion") # Title Plot title(main = "Best Subset Selection of Variables n regsubsets [ Hitters]", outer = TRUE) 7
  • 8. Figure 2: Subset of selected variables as suggested by the regsubsets() R function for each selection criterion, R2, adjR2, Cp and BIC [Hitters]. The plots in Figure 2 reveal that the regsubsets() function returns ˆ adjR2 criterion: (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWalks, LeagueN, DivisionW, PutOuts, Assists) as the best selected variables ˆ Cp criterion: (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWalks, DivisionW, PutOuts, Assists) ˆ BIC criterion: (AtBat, Hits, Walks, CRBI, DivisionW, PutOuts) 8
  • 9. Requiring to know the coefficients estimates of this last model coef(reg.fitFull.Hitters.Cleaned, 6) ## (Intercept) AtBat Hits Walks ## 91.5117981 -1.8685892 7.6043976 3.6976468 ## CRBI DivisionW PutOuts ## 0.6430169 -122.9515338 0.2643076 1.2 Forward and Backward Step-wise Selection We can also use the regsubsets function to perform forward step-wise or backward step-wise selection, using the argument method ="forward" or "backward" respec- tively. Note that in our case the observation data, n, are by far a larger number than the number of predictors, p, i.e. n p, and the Backward Step-wise Selection can be also applied as the Best Subset Selection previously could. # Fwd Stepwise Selection reg.fitFwd.Hitters.Cleaned <- regsubsets(Salary ~ ., data = Hitters.Cleaned, nvmax = 19, method = "forward") # Bwd Stepwise Selection reg.fitBwd.Hitters.Cleaned <- regsubsets(Salary ~ ., data = Hitters.Cleaned, nvmax = 19, method = "backward") More specifically, in the Forward Step-wise case the particular subset of selection variables which are returned from the regsubsets() function are shown in Figure 3. par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(reg.fitFwd.Hitters.Cleaned, scale = "r2", main = "R2: criterion") plot(reg.fitFwd.Hitters.Cleaned, scale = "adjr2", main = "adjR2: criterion") 9
  • 10. plot(reg.fitFwd.Hitters.Cleaned, scale = "Cp", main = "Cp: criterion") plot(reg.fitFwd.Hitters.Cleaned, scale = "bic", main = "BIC: criterion") # Title Plot title(main = "Fwd Stepwise Selection of Variables n regsubsets [ Hitters]", outer = TRUE) Figure 3: Subset of selected variables as suggested by the regsubsets() R function for each selection criterion, R2, adjR2, Cp and BIC [Fwd Step-wise Selection, Hitters]. 10
  • 11. In the Backward Step-wise case the particular subset of selection variables which are returned from the regsubsets() function are shown in Figure 4 below. par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0)) plot(reg.fitBwd.Hitters.Cleaned, scale = "r2", main = "R2: criterion") plot(reg.fitBwd.Hitters.Cleaned, scale = "adjr2", main = "adjR2: criterion") plot(reg.fitBwd.Hitters.Cleaned, scale = "Cp", main = "Cp: criterion") plot(reg.fitBwd.Hitters.Cleaned, scale = "bic", main = "BIC: criterion") # Title Plot title(main = "Bwd Stepwise Selection of Variables n regsubsets [ Hitters]", outer = TRUE) Comparing these results across all the statistics criteria, we figure out that both methods, namely Best Subset and Forward Step-wise Variable Selection have the same outcome. The only difference is observed under the Backward Step-wise Selection procedure and for the BIC criterion, where the CRuns and CWalks variables, are also suggested by the model. However, this is easily justifiable due to the larger penalty that BIC information criterion introduce for models with more predictors. That is, during the Forward Step- wise Selection procedure, the underlying algorithm is more reluctant to add new variables in model’s predictor space when the BIC criterion is used, which results in a smaller and finally different model comparing with the one that the Backward Step-wise Selection method gives. Requiring to know the specific coefficients of the models that forward and backward step- wise selection methods suggest: # Fwd Stepwise Model - BIC coef(reg.fitFwd.Hitters.Cleaned, 6) ## (Intercept) AtBat Hits Walks ## 91.5117981 -1.8685892 7.6043976 3.6976468 11
  • 12. ## CRBI DivisionW PutOuts ## 0.6430169 -122.9515338 0.2643076 # Bwd Stepwise Model - BIC coef(reg.fitBwd.Hitters.Cleaned, 8) ## (Intercept) AtBat Hits Walks ## 117.1520434 -2.0339209 6.8549136 6.4406642 ## CRuns CRBI CWalks DivisionW ## 0.7045391 0.5273238 -0.8066062 -123.7798366 ## PutOuts ## 0.2753892 Figure 4: Subset of selected variables as suggested by the regsubsets() R function for each selection criterion, R2, adjR2, Cp and BIC [Bwd Step-wise Selection, Hitters]. 12
  • 13. 1.3 Cross-Validation: Choosing Among Models In the previous section we chose among a set of models of different sizes using the Cp, BIC, and adjR2 statistics criteria. As an alternative to these methods, we can directly estimate the test error using Validation-Set approach and Cross-Validation (CV) methods. Main advantages of the CV methods are: ˆ Direct estimate of the test error. ˆ Makes fewer assumptions for the true underlying model. ˆ It can be used in a wider range of model selection tasks. In past, performing CV was computationally prohibitive for many problems with large p and/or n. Therefore, AIC, Cp, BIC and adjR2 were more attractive approaches to choose a model. However, nowadays with the fast computing machines, the required computations are not any more an issue. In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting, including the variable selection. If the full data set is used to perform the best subset selection step, the cross- validation errors that we obtain will not be accurate estimates of the test error (over-fitting). More specifically, cross-validation re-sampling methods concern the random selection of k sequential parts of the initial data set, and repeatedly build a model in one k-th part, whereas testing it on the remaining (k − 1)-folds. The cross-validation error returned by this procedure is for a quantitative response CV(k) = 1 k k i=1 MSEi , or CV(n) = 1 n n i=1 1l(yi ˆyi ) , for a categorical one. Cross-validation (CV) methods has two major advantages in comparison with the simple Validation set approach: (1) Are less sensitive in the particular observation data set, and (2) Spans the observational data set more democratically, giving in each k-fold the character 13
  • 14. of the training data pool whereas keeping the others as test data arena. Therefore, cross- validation methods usually avoid over-estimation of the test error rate. Note also that CV methods are less computationally intensive from the so-called Leave- One-Out-Cross-Validation (LOOCV), CV (k=1), and therefore usually preferred for large data sets studies. The best way to apply such methods for our data set example, Hitters, is by utilizing functions of the bestglm, R library. library(bestglm) set.seed(564) train <- sample(c(TRUE, FALSE), nrow(Hitters.Cleaned), rep = TRUE ) test <- (!train) set.seed(328) X <- Hitters.Cleaned[train, -19] y <- Hitters.Cleaned[train, 19] Xy <- cbind(X, y) # delete-d CVd k=10, with random subsamples [Shao, 1997] bestglm.CVd.Hitters <- bestglm(Xy, family = gaussian, IC = "CV", CVArgs = list(Method = "d", K = 10, REP = 1), TopModels = 5, method = "exhaustive", nvmax = "default") ## Note: binary categorical variables converted to 0-1 so 'leaps' could be used. bestglm.CVd.Hitters$Title ## [1] "CVd(d = 10, REP = 1)nNo BICq equivalent" bestglm.CVd.Hitters$ModelReport ## $NullModel ## Deviance DF ## 26971550 128 ## 14
  • 15. ## $LEAPSQ ## [1] TRUE ## ## $glmQ ## [1] FALSE ## ## $gaussianQ ## [1] TRUE ## ## $NumDF ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## ## $CategoricalQ ## [1] TRUE ## ## $IncludeInterceptQ ## [1] TRUE ## ## $Bestk ## [1] 7 bestglm.CVd.Hitters$BestModel ## ## Call: ## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), ## drop = FALSE], y = y)) ## ## Coefficients: ## (Intercept) Walks CAtBat CHits ## 236.0867 4.4159 -0.4669 1.6483 ## CRBI CWalks DivisionW PutOuts ## 0.8894 -0.3068 -136.1346 0.2146 Or we can even use either the (1) “HTF k-fold CV” method (CVHTM) [Hastie et al., 2011] or the (2) “adjusted K-fold CV” (CVDH) [by A. C. Davison, 2010] by parsing the appropriate arguments in the bestglm() function # Hastie et' al. K-fold CV (CVHTM) 15
  • 16. bestglm.CVHTM.Hitters <- bestglm(Xy, family = gaussian, IC = "CV ", CVArgs = list(Method = "HTF", K = 10, REP = 1), TopModels = 5, method = "exhaustive", nvmax = "default") ## Note: binary categorical variables converted to 0-1 so 'leaps' could be used. bestglm.CVHTM.Hitters$Title ## [1] "CV(K = 10, REP = 1)nBICq equivalent for q in (1.52529666674894e-08, 0.050282763767736)" bestglm.CVHTM.Hitters$ModelReport ## $NullModel ## Deviance DF ## 26971550 128 ## ## $LEAPSQ ## [1] TRUE ## ## $glmQ ## [1] FALSE ## ## $gaussianQ ## [1] TRUE ## ## $NumDF ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## ## $CategoricalQ ## [1] TRUE ## ## $IncludeInterceptQ ## [1] TRUE ## ## $Bestk ## [1] 1 bestglm.CVHTM.Hitters$BestModel 16
  • 17. ## ## Call: ## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), ## drop = FALSE], y = y)) ## ## Coefficients: ## (Intercept) CRBI ## 319.7668 0.7198 # adjusted K-fold CV (CVDH) bestglm.CVDH.Hitters <- bestglm(Xy, family = gaussian, IC = "CV", CVArgs = list(Method = "DH", K = 10, REP = 1), TopModels = 5, method = "exhaustive", nvmax = "default") ## Note: binary categorical variables converted to 0-1 so 'leaps' could be used. bestglm.CVDH.Hitters$Title ## [1] "CVAdj(K = 10, REP = 1)nBICq equivalent for q in (0.855658979391075, 0.862338943358585)" bestglm.CVDH.Hitters$ModelReport ## $NullModel ## Deviance DF ## 26971550 128 ## ## $LEAPSQ ## [1] TRUE ## ## $glmQ ## [1] FALSE ## ## $gaussianQ ## [1] TRUE ## ## $NumDF ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## 17
  • 18. ## $CategoricalQ ## [1] TRUE ## ## $IncludeInterceptQ ## [1] TRUE ## ## $Bestk ## [1] 10 bestglm.CVDH.Hitters$BestModel ## ## Call: ## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), ## drop = FALSE], y = y)) ## ## Coefficients: ## (Intercept) AtBat Hits Walks ## 329.5892 -2.5351 7.1477 6.7482 ## CAtBat CRuns CRBI CWalks ## -0.1742 1.8657 0.7837 -1.1051 ## DivisionW PutOuts Assists ## -119.8939 0.2362 0.2686 2 Ridge Regression and the Lasso In this section we are going to use the so-called shrinkage methods to better fit linear predictive models and describe the Salary response in terms of other predictors. We will use the glmnet package in order to perform both Ridge Regression and Lasso shrinkage methods. This function has slightly different syntax from what we have used so far. In particular, we must pass in an X matrix of predictors as well as a y vector of response. X <- model.matrix(Salary ~ ., Hitters.Cleaned)[, -1] y <- Hitters.Cleaned$Salary The model.matrix() function is particular useful in cases like this; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms 18
  • 19. any categorical variables into dummy variables. This is important because the glmnet() function can be only applied on numerical data inputs. 2.1 Ridge Regression Here, we fit a Ridge Regression linear model leaving the model’s tuning parameter, λ, to be determined by a cross-validation procedure library(glmnet) ## Loading required package: Matrix ## Loading required package: lattice ## Loaded glmnet 1.9-8 # Calculate the Ridge Regression (alpha=0) - Cross # Validations nfolds=Sample Size (LOOCV) set.seed(456) # LOOCV Procedure nfolds = nrow(Hitters.Cleaned) glmnet.ridge.fit <- cv.glmnet(X[train, ], y[train], type.measure = "mse", family = "gaussian", alpha = 0, standardize = TRUE) The Mean-Squared Error, MSE, plot vs the log(Lambda) tuning parameter is given in Figure 5 below. par(mfrow = c(1, 1), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0)) plot(glmnet.ridge.fit) title(main = "MSE vs Log(Lambda) n Ridge Regression [LOOCV, Hitters]", outer = TRUE) 19
  • 20. Figure 5: Mean-Squared Error, MSE, vs the Log(Lambda) tuning parameter - Ridge Re- gression Fit under LOOCV procedure (Hitters). In Figure 6 we plot the ridge regression coefficients vs the lambda control parameter, λ, and the fraction of (null) deviance ratio which is explained by the model. par(mfrow = c(1, 2), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0)) # Minimum Mean CV Error cvm.Min.Index <- which.min(glmnet.ridge.fit$cvm) # Ridge Regression coefs vs Lambda Control Parameter plot(glmnet.ridge.fit$glmnet.fit, xvar = "lambda", label = TRUE) abline(v = log(glmnet.ridge.fit$lambda[cvm.Min.Index]), lty = " dashed", col = "black") 20
  • 21. # Ridge Regression coefs vs Deviance Ratio plot(glmnet.ridge.fit$glmnet.fit, xvar = "dev", label = TRUE) abline(v = log(glmnet.ridge.fit$glmnet.fit$dev.ratio[cvm.Min. Index]), lty = "dashed", col = "black") title(main = "Ridge Regression Coefficients vs Log(Lambda) and n Fraction of Deviance Explained [glmnet(LOOCV), Hitters]", outer = TRUE) Figure 6: Ridge regression coefficients vs the Log(Lambda) control parameter and the fraction of Dev.Ratio explained (glmnet under LOOCV procedure, Hitters). 21
  • 22. The dashed vertical line in the figure above denotes the best λ parameter as calculated by the LOOCV procedure. Note that as expected, none of the regression coefficients are actually vanished by the application of the shrinkage method. Finally, we refit our ridge regression model on the test data set using the λ parameter which has been determined by the cross-validation procedure above. We also calculate the corresponding Mean-Squared Error for the test data part of Hitters. predict(glmnet.ridge.fit, X[test, ], type = "coefficients", s = " lambda.min") ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 178.489344920 ## AtBat 0.035448419 ## Hits 0.608045444 ## HmRun 0.930883558 ## Runs 0.698708995 ## RBI 0.654999526 ## Walks 1.466577745 ## Years 1.818638592 ## CAtBat 0.009877955 ## CHits 0.051122325 ## CHmRun 0.330954429 ## CRuns 0.096656674 ## CRBI 0.107287429 ## CWalks 0.056021178 ## LeagueN -0.697704037 ## DivisionW -70.184986685 ## PutOuts 0.151336492 ## Assists -0.027297267 ## Errors -2.055848395 ## NewLeagueN -12.596879362 glmnet.ridge.pred <- predict(glmnet.ridge.fit, X[test, ], type = "response", s = "lambda.min") # test MSE mean((glmnet.ridge.pred - y[test])^2) ## [1] 97569.09 22
  • 23. 2.2 The Lasso In this section we fit a Lasso linear model leaving the model tuning parameter, λ, to be determined by a cross-validation procedure. # Calculate the Ridge Regression (alpha=0) - Cross # Validations Sample Size (LOOCV) set.seed(234) # LOOCV Procedure nfolds = nrow(Hitters.Cleaned) glmnet.lassoLOOCV.fit <- cv.glmnet(X[train, ], y[train], type. measure = "mse", family = "gaussian", alpha = 1, standardize = TRUE) The Mean-Squared Error, MSE, plot vs the log(Lambda) tuning parameter is given in Figure 7. par(mfrow = c(1, 1), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0)) plot(glmnet.lassoLOOCV.fit) title(main = "MSE vs Log(Lambda) n Lasso Linear Fit [LOOCV, Hitters]", outer = TRUE) In Figure 8 we plot the Lasso coefficients vs the lambda control parameter, λ, and the fraction of (null) deviance ratio which is explained by the model. par(mfrow = c(1, 2), mar = c(4, 4, 2, 2), oma = c(0, 0, 5, 0)) # Minimum Mean CV Error 23
  • 24. Figure 7: Mean-Squared Error (MSE) vs the Log(Lambda) tuning parameter - Lasso Linear Fit under LOOCV procedure (Hitters). cvm.Min.Index <- which.min(glmnet.lassoLOOCV.fit$cvm) # Ridge Regression coefs vs Lambda Control Parameter plot(glmnet.lassoLOOCV.fit$glmnet.fit, xvar = "lambda", label = TRUE) abline(v = log(glmnet.lassoLOOCV.fit$lambda[cvm.Min.Index]), lty = "dashed", col = "black") # Ridge Regression coefs vs Deviance Ratio plot(glmnet.lassoLOOCV.fit$glmnet.fit, xvar = "dev", label = TRUE ) 24
  • 25. abline(v = log(glmnet.lassoLOOCV.fit$glmnet.fit$dev.ratio[cvm.Min .Index]), lty = "dashed", col = "black") title(main = "Lasso Coefficients vs Log(Lambda) and n Fraction of Deviance Explained [glmnet(LOOCV), Hitters]", outer = TRUE) Figure 8: Lasso coefficients vs the Log(Lambda) control parameter and the fraction of Dev.Ratio explained (glmnet under LOOCV procedure, Hitters). 25
  • 26. The dashed vertical line in the figure above denotes the best λ parameter as calculated by the LOOCV procedure. Here in contrast with the ridge regression case some of the coefficients are actually vanished by the application of the shrinkage method. Indeed, refitting our lasso linear model on the test data set by using the λ parameter which has been determined by the cross-validation procedure above, we get the following results predict(glmnet.lassoLOOCV.fit, X[test, ], type = "coefficients", s = "lambda.min") ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 156.72063595 ## AtBat . ## Hits 1.15605082 ## HmRun . ## Runs . ## RBI . ## Walks 2.18274618 ## Years . ## CAtBat . ## CHits 0.01354589 ## CHmRun . ## CRuns 0.12054023 ## CRBI 0.44744187 ## CWalks . ## LeagueN . ## DivisionW -108.31835711 ## PutOuts 0.24083595 ## Assists . ## Errors -1.85682194 ## NewLeagueN . glmnet.lassoLOOCV.pred <- predict(glmnet.lassoLOOCV.fit, X[test, ], type = "response", s = "lambda.min") # test MSE mean((glmnet.lassoLOOCV.pred - y[test])^2) ## [1] 89742.07 26
  • 27. Note that the corresponding test Mean-Squared Error, MSE, is significantly lower than the one found above with the ridge regression fit. 3 PCR and PLS Regression 3.1 Principal Components Regression Principal components regression (PCR) can be performed by using the pcr() function, which is part of the pls, R library. We now apply PCR to the Hitters data of our example, and test how successfully we can predict the Salary variable. As before, we should use a cleaned version of the data set, i.e. a data.frame without missing values. In addition, we need to standardize all the predictors prior to applying the method. library(pls) set.seed(54) In order to better estimate the model test error we perform the PCR calculation using LOOCV. The data set is not large and the method does not need great computational power. In case of a larger data set we could alternatively use the "CV" flag. Summarizing our current results pcr.fit.Hitters <- pcr(Salary ~ ., data = Hitters.Cleaned, subset = train, scale = TRUE, validation = "LOO") summary(pcr.fit.Hitters) ## Data: X dimension: 129 19 ## Y dimension: 129 1 ## Fit method: svdpc ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 129 leave-one-out segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps ## CV 460.8 390.9 395.3 397.2 398.5 ## adjCV 460.8 390.9 395.3 397.2 398.4 ## 5 comps 6 comps 7 comps 8 comps 9 comps ## CV 400.6 392.9 396.2 400.7 403.4 ## adjCV 400.5 392.7 396.1 400.6 403.2 27
  • 28. ## 10 comps 11 comps 12 comps 13 comps 14 comps ## CV 413.7 419.9 423.0 442.3 438.4 ## adjCV 413.5 419.7 422.7 442.0 438.0 ## 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 431.1 419.8 424.2 422.8 436.4 ## adjCV 430.8 419.5 423.8 422.4 435.9 ## ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps ## X 38.34 61.02 70.97 78.61 84.17 ## Salary 31.51 31.82 31.82 31.96 32.65 ## 6 comps 7 comps 8 comps 9 comps 10 comps ## X 88.97 92.32 95.20 96.45 97.58 ## Salary 37.07 37.36 37.36 37.45 37.98 ## 11 comps 12 comps 13 comps 14 comps 15 comps ## X 98.26 98.82 99.28 99.54 99.78 ## Salary 38.02 38.05 38.46 40.76 42.71 ## 16 comps 17 comps 18 comps 19 comps ## X 99.93 99.98 99.99 100.00 ## Salary 45.34 47.40 48.75 48.83 The CV score is provided for each possible number of components, ranging from M = 0 on-wards. Note that pcr() reports the root mean squared error; in order to obtain the usual MSE, we must square this quantity. To obtain a graphical representation of the resulting cross-validation scores par(mfrow = c(1, 1)) validationplot(pcr.fit.Hitters, val.type = "MSEP", xlab = "Number of Predictors", ylab = "MSE", main = "MSE vs Number of Predictors [PCR - Hitters]") From Figure 9 we see that the smallest cross-validation error occurs when M = 2 compo- nents are used. The summary() function also provides the percentage of variance explained in the pre- dictors and in the response using different numbers of components. This concept is discussed in greater detail in later articles. Briefly, we can think of this as the amount of information about the predictors or the response that is captured using M principal components. 28
  • 29. Next, we evaluate the model’s predictive accuracy by applying the PCR fit on the test data set. pcr.pred.Hitters <- predict(pcr.fit.Hitters, X[test, ], ncomp = 2) mean((pcr.pred.Hitters - y[test])^2) ## [1] 104209.3 The test MSE of the model is comparable with the results obtained using ridge regression and the lasso. However, PCR models are more difficult to interpret because it does not perform any kind of variable selection or even directly produce coefficient estimates. Figure 9: Mean-Squared Error, MSE, vs Number of Predictors as calculated by the pcr() function under LOOCV. 29
  • 30. 3.2 Partial Least Squares Here, we implement the Partial Least Squares (PLS) algorithm by also utilizing functions in the pls library. Again we build the model on the test data set, make a plot of the resulting cross-validation scores and finally evaluate the models’ predictive accuracy. set.seed(478) pls.fit.Hitters <- plsr(Salary ~ ., data = Hitters.Cleaned, subset = train, scale = TRUE, validation = "LOO") summary(pls.fit.Hitters) ## Data: X dimension: 129 19 ## Y dimension: 129 1 ## Fit method: kernelpls ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 129 leave-one-out segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps ## CV 460.8 392.9 407.5 409.1 422.6 ## adjCV 460.8 392.8 407.4 408.9 422.3 ## 5 comps 6 comps 7 comps 8 comps 9 comps ## CV 449.2 441.0 436.6 426.6 427.8 ## adjCV 448.6 440.5 436.2 426.3 427.4 ## 10 comps 11 comps 12 comps 13 comps 14 comps ## CV 425.1 409.5 422.2 428.1 425.3 ## adjCV 424.6 408.9 421.8 427.7 424.9 ## 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 428.9 426.9 430.3 429.9 436.4 ## adjCV 428.4 426.5 429.8 429.5 435.9 ## ## TRAINING: % variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps ## X 38.19 46.91 65.76 70.18 74.45 ## Salary 33.24 37.74 38.67 40.98 43.46 ## 6 comps 7 comps 8 comps 9 comps 10 comps ## X 79.99 86.04 89.87 93.62 94.11 ## Salary 44.53 45.36 46.14 46.43 47.24 30
  • 31. ## 11 comps 12 comps 13 comps 14 comps 15 comps ## X 95.09 97.08 98.37 98.82 99.15 ## Salary 47.65 47.88 48.09 48.36 48.55 ## 16 comps 17 comps 18 comps 19 comps ## X 99.59 99.77 99.99 100.00 ## Salary 48.62 48.78 48.81 48.83 validationplot(pls.fit.Hitters, val.type = "MSEP", xlab = "Number of Predictors", ylab = "MSE", main = "MSE vs Number of Predictors [PLS - Hitters]") Figure 10: Mean Squared Error (MSE) vs "Number of Predictors" as calculated by the plsr() function under LOOCV. 31
  • 32. As shown in Figure 10, the smallest cross-validation error again occurs when M = 2 components are used. The test MSE is found to be significant smaller in this case, than the one obtained by ridge regression, lasso and PCR methods before. pls.pred.Hitters <- predict(pls.fit.Hitters, X[test, ], ncomp = 2) mean((pls.pred.Hitters - y[test])) ## [1] 75.20599 mean((pls.pred.Hitters - y[test])) ## [1] 75.20599 References [by A. C. Davison, 2010] by A. C. Davison (2010). Bootstrap Methods and Their Applica- tion. Cambridge University Press, international edition edition. [Hastie et al., 2011] Hastie, T., Tibshirani, R., and Friedman, J. (2011). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics). Springer, 2nd ed. 2009. corr. 7th printing 2013 edition. 32