Linear Model Selection and Regularization (Article 6 - Practical exercises)

Linear Model Selection and Regularization
[ISLR.2013.Ch6-6]
Theodore Grammatikopoulos∗
Tue 6th
Jan, 2015
Abstract
The linear model has distinct advantages in terms of inference and, on real-world
problems, and it is often surprisingly competitive in relation to non-linear methods.
Here, we will discuss some ways in which the simple linear model can be improved, by
replacing plain least squares fitting with the alternative fitting procedures of (i) Subset
Selection, (ii) Shrinkage and (iii) Dimensional Reduction. The reason to search for such
improvements is twofold: (a) Prediction accuracy: In cases that the observational data
n are not much larger than the number of predictors p, i.e. n > p, there can be
a lot of variability in the least squares fit resulting in models with poor predictive
capabilities. Furthermore, in case that p > n there is no longer a unique least squares
coefficient estimate. By constraining or shrinking the estimated coefficients, we can
often substantially reduce the variance at the cost of a negligible increase in bias and
finally improve the accuracy of our models, (b) Model Interpret-ability: It is often the
case that some or many of the variables used in a multiple regression model are in
fact not associated with the response. By removing such irrelevant variables we can
obtain a model that is more easily interpreted. Here, we also discuss methods of
automatically performing feature selection or variable selection.
## OTN License Agreement: Oracle Technology Network -
Developer
## Oracle Distribution of R version 3.0.1 (--) Good Sport
## Copyright (C) The R Foundation for Statistical Computing
## Platform: x86_64-unknown-linux-gnu (64-bit)
D:20150106213107+02’00’
∗
e-mail:tgrammat@gmail.com
1

1 Subset Selection Methods
1.1 Best Subset Selection
Here, given the Hitters{ISLR} data, we want to predict a baseball player’s Salary on
the basis of various statistics associated with their performance in previous years.
Note, that the Salary column of Hitters data has some missing values. Therefore, we
better remove any missing values before applying any variable selection method.
library(ISLR)
dim(Hitters)
## [1] 322 20
sum(is.na(Hitters$Salary))
## [1] 59
sum(is.na(Hitters))
## [1] 59
Hitters.Cleaned <- na.omit(Hitters)
dim(Hitters.Cleaned)
## [1] 263 20
attach(Hitters.Cleaned)
To perform the Best Subset Selection method and plot a table of models with the suggested
predictor variables we use the regsubsets() function of the leaps R library.
library(leaps)
reg.fit.Hitters.Cleaned <- regsubsets(Salary ~ ., data = Hitters.
Cleaned,
method = "exhaustive")
summary(reg.fit.Hitters.Cleaned)
2

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters.Cleaned,
method = "exhaustive")
## 19 Variables (and intercept)
## Forced in Forced out
## AtBat FALSE FALSE
## Hits FALSE FALSE
## HmRun FALSE FALSE
## Runs FALSE FALSE
## RBI FALSE FALSE
## Walks FALSE FALSE
## Years FALSE FALSE
## CAtBat FALSE FALSE
## CHits FALSE FALSE
## CHmRun FALSE FALSE
## CRuns FALSE FALSE
## CRBI FALSE FALSE
## CWalks FALSE FALSE
## LeagueN FALSE FALSE
## DivisionW FALSE FALSE
## PutOuts FALSE FALSE
## Assists FALSE FALSE
## Errors FALSE FALSE
## NewLeagueN FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## 1 ( 1 ) " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " "*" " " " " " "
## 7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " " " " " "*" " " " " " "
## CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts
## 1 ( 1 ) " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " "*"
## 4 ( 1 ) " " " " "*" " " " " "*" "*"
## 5 ( 1 ) " " " " "*" " " " " "*" "*"
## 6 ( 1 ) " " " " "*" " " " " "*" "*"
3

## 7 ( 1 ) "*" " " " " " " " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " "*" "*"
## Assists Errors NewLeagueN
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
In the resulting table above, the asterisk indicates that a given variable is included in the
corresponding model. For instance, this output indicates that the best two-variable model
contains only Hits and CRBI. Here, we have chosen to perform the "exhaustive"
subset selection method, i.e. scanned the total number of possible subset models for the
given number of predictors. regsubsets() function returns by default models with a
maximum of 8 predictors (nvmax = 8), but this can be changed accordingly.
reg.fitFull.Hitters.Cleaned <- regsubsets(Salary ~ ., data =
Hitters.Cleaned,
method = "exhaustive", nvmax = 19)
reg.summary.Hitters.Cleaned <- summary(reg.fitFull.Hitters.
Cleaned)
Plotting RSS, adjR2, Cp and BIC for all the models that the full implementation of
regsubsets() suggests, we can have better understanding of what the best choice
would probably be.
par(mfrow = c(1, 1))
par(mfrow = c(2, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 4, 0))
# RSS Plot
plot(reg.summary.Hitters.Cleaned$rss, xlab = "Number of Variables
",
ylab = "RSS", type = "l")
4

# adjR2 Plot
plot(reg.summary.Hitters.Cleaned$adjr2, xlab = "Number of
Variables",
ylab = "Adjusted RSq", type = "l")
which.max.adjr2 <- which.max(reg.summary.Hitters.Cleaned$adjr2)
points(which.max.adjr2, reg.summary.Hitters.Cleaned$adjr2[which.
max.adjr2],
col = "red", cex = 2, pch = 4)
legend("topright", inset = 0.05, paste("Best p =", as.character(
which.max.adjr2)),
text.col = "red")
# Cp Plot
plot(reg.summary.Hitters.Cleaned$cp, xlab = "Number of Predictors
",
ylab = "Cp", type = "l")
which.min.cp <- which.min(reg.summary.Hitters.Cleaned$cp)
points(which.min.cp, reg.summary.Hitters.Cleaned$cp[which.min.cp
],
which.min.cp)),
text.col = "red")
# BIC Plot
plot(reg.summary.Hitters.Cleaned$bic, xlab = "Number of
Predictors",
ylab = "BIC", type = "l")
which.min.bic <- which.min(reg.summary.Hitters.Cleaned$bic)
points(which.min.bic, reg.summary.Hitters.Cleaned$bic[which.min.
bic],
5

which.min.bic)),
text.col = "red")
# Overall Title
title(main = "RSS, adjR2, Cp, BIC Plots vs Number of Predictors
n[regsubsets - Hitters]",
outer = TRUE)
Figure 1: RSS, adjR2, Cp, BIC Plots vs “Number of Predictors” as calculated by the
regsubsets() R function [Hitters].
6

Note that the BIC criterion suggests a smaller number of predictors, i.e. p = 6, as a
best fit. This is expected since by construction the BIC criterion for linear regression (LR)
models, introduce larger penalty for models with more predictors. More specifically, the
AIC criterion for LR models is given by the formula
AIC =
1
n ˆσ2
RSS + 2 d ˆσ2
, (1)
whereas the BIC criterion is quantified by
BIC ∼
1
n
RSS + log(n) d ˆσ2
, (2)
and log(n) > 2 for n > 7. Note also that ˆσ2
is the estimated variance of associated errors
with the LR fit. Cp (AIC) and adjR2 criteria, on the other hand, suggest 10 and 11
predictors respectively as the best choice of LR fit.
To be more specific about the particular subset of selected variables which are suggested
by the regsubsets() function and for each selection criterion, we use the built-in
plot.regsubsets() function
plot(reg.fitFull.Hitters.Cleaned, scale = "r2", main = "R2:
criterion")
plot(reg.fitFull.Hitters.Cleaned, scale = "adjr2", main = "adjR2:
criterion")
plot(reg.fitFull.Hitters.Cleaned, scale = "Cp", main = "Cp:
criterion")
plot(reg.fitFull.Hitters.Cleaned, scale = "bic", main = "BIC:
criterion")
# Title Plot
title(main = "Best Subset Selection of Variables n regsubsets [
Hitters]",
outer = TRUE)
7

Figure 2: Subset of selected variables as suggested by the regsubsets() R function for
each selection criterion, R2, adjR2, Cp and BIC [Hitters].
The plots in Figure 2 reveal that the regsubsets() function returns
ˆ adjR2 criterion: (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWalks, LeagueN,
DivisionW, PutOuts, Assists) as the best selected variables
ˆ Cp criterion: (AtBat, Hits, Walks, CAtBat, CRuns, CRBI, CWalks, DivisionW,
PutOuts, Assists)
ˆ BIC criterion: (AtBat, Hits, Walks, CRBI, DivisionW, PutOuts)
8

Requiring to know the coeﬃcients estimates of this last model
coef(reg.fitFull.Hitters.Cleaned, 6)
## (Intercept) AtBat Hits Walks
## 91.5117981 -1.8685892 7.6043976 3.6976468
## CRBI DivisionW PutOuts
## 0.6430169 -122.9515338 0.2643076
1.2 Forward and Backward Step-wise Selection
We can also use the regsubsets function to perform forward step-wise or backward
step-wise selection, using the argument method ="forward" or "backward" respec-
tively. Note that in our case the observation data, n, are by far a larger number than the
number of predictors, p, i.e. n p, and the Backward Step-wise Selection can be also
applied as the Best Subset Selection previously could.
# Fwd Stepwise Selection
reg.fitFwd.Hitters.Cleaned <- regsubsets(Salary ~ ., data =
Hitters.Cleaned,
nvmax = 19, method = "forward")
# Bwd Stepwise Selection
reg.fitBwd.Hitters.Cleaned <- regsubsets(Salary ~ ., data =
Hitters.Cleaned,
nvmax = 19, method = "backward")
More speciﬁcally, in the Forward Step-wise case the particular subset of selection variables
which are returned from the regsubsets() function are shown in Figure 3.
plot(reg.fitFwd.Hitters.Cleaned, scale = "r2", main = "R2:
criterion")
plot(reg.fitFwd.Hitters.Cleaned, scale = "adjr2", main = "adjR2:
criterion")
9

plot(reg.fitFwd.Hitters.Cleaned, scale = "Cp", main = "Cp:
criterion")
plot(reg.fitFwd.Hitters.Cleaned, scale = "bic", main = "BIC:
criterion")
# Title Plot
title(main = "Fwd Stepwise Selection of Variables n regsubsets [
Hitters]",
outer = TRUE)
each selection criterion, R2, adjR2, Cp and BIC [Fwd Step-wise Selection, Hitters].
10

In the Backward Step-wise case the particular subset of selection variables which are
returned from the regsubsets() function are shown in Figure 4 below.
plot(reg.fitBwd.Hitters.Cleaned, scale = "r2", main = "R2:
criterion")
plot(reg.fitBwd.Hitters.Cleaned, scale = "adjr2", main = "adjR2:
criterion")
plot(reg.fitBwd.Hitters.Cleaned, scale = "Cp", main = "Cp:
criterion")
plot(reg.fitBwd.Hitters.Cleaned, scale = "bic", main = "BIC:
criterion")
# Title Plot
title(main = "Bwd Stepwise Selection of Variables n regsubsets [
Hitters]",
outer = TRUE)
Comparing these results across all the statistics criteria, we figure out that both methods,
namely Best Subset and Forward Step-wise Variable Selection have the same outcome.
The only difference is observed under the Backward Step-wise Selection procedure and
for the BIC criterion, where the CRuns and CWalks variables, are also suggested by the
model. However, this is easily justifiable due to the larger penalty that BIC information
criterion introduce for models with more predictors. That is, during the Forward Step-
wise Selection procedure, the underlying algorithm is more reluctant to add new variables
in model’s predictor space when the BIC criterion is used, which results in a smaller
and finally different model comparing with the one that the Backward Step-wise Selection
method gives.
Requiring to know the specific coefficients of the models that forward and backward step-
wise selection methods suggest:
# Fwd Stepwise Model - BIC
coef(reg.fitFwd.Hitters.Cleaned, 6)
## 91.5117981 -1.8685892 7.6043976 3.6976468
11

## CRBI DivisionW PutOuts
## 0.6430169 -122.9515338 0.2643076
# Bwd Stepwise Model - BIC
coef(reg.fitBwd.Hitters.Cleaned, 8)
## 117.1520434 -2.0339209 6.8549136 6.4406642
## CRuns CRBI CWalks DivisionW
## 0.7045391 0.5273238 -0.8066062 -123.7798366
## PutOuts
## 0.2753892
each selection criterion, R2, adjR2, Cp and BIC [Bwd Step-wise Selection, Hitters].
12

1.3 Cross-Validation: Choosing Among Models
In the previous section we chose among a set of models of different sizes using the Cp,
BIC, and adjR2 statistics criteria. As an alternative to these methods, we can directly
estimate the test error using Validation-Set approach and Cross-Validation (CV) methods.
Main advantages of the CV methods are:
ˆ Direct estimate of the test error.
ˆ Makes fewer assumptions for the true underlying model.
ˆ It can be used in a wider range of model selection tasks.
In past, performing CV was computationally prohibitive for many problems with large
p and/or n. Therefore, AIC, Cp, BIC and adjR2 were more attractive approaches to
choose a model. However, nowadays with the fast computing machines, the required
computations are not any more an issue.
In order for these approaches to yield accurate estimates of the test error, we must use
only the training observations to perform all aspects of model-fitting, including the variable
selection. If the full data set is used to perform the best subset selection step, the cross-
validation errors that we obtain will not be accurate estimates of the test error (over-fitting).
More specifically, cross-validation re-sampling methods concern the random selection of
k sequential parts of the initial data set, and repeatedly build a model in one k-th part,
whereas testing it on the remaining (k − 1)-folds. The cross-validation error returned by
this procedure is for a quantitative response
CV(k) =
1
k
k
i=1
MSEi ,
or
CV(n) =
1
n
n
i=1
1l(yi ˆyi ) ,
for a categorical one.
Cross-validation (CV) methods has two major advantages in comparison with the simple
Validation set approach: (1) Are less sensitive in the particular observation data set, and (2)
Spans the observational data set more democratically, giving in each k-fold the character
13

of the training data pool whereas keeping the others as test data arena. Therefore, cross-
validation methods usually avoid over-estimation of the test error rate.
Note also that CV methods are less computationally intensive from the so-called Leave-
One-Out-Cross-Validation (LOOCV), CV (k=1), and therefore usually preferred for large data
sets studies.
The best way to apply such methods for our data set example, Hitters, is by utilizing
functions of the bestglm, R library.
library(bestglm)
set.seed(564)
train <- sample(c(TRUE, FALSE), nrow(Hitters.Cleaned), rep = TRUE
)
test <- (!train)
set.seed(328)
X <- Hitters.Cleaned[train, -19]
y <- Hitters.Cleaned[train, 19]
Xy <- cbind(X, y)
# delete-d CVd k=10, with random subsamples [Shao, 1997]
bestglm.CVd.Hitters <- bestglm(Xy, family = gaussian, IC = "CV",
CVArgs = list(Method = "d", K = 10, REP = 1), TopModels = 5,
method = "exhaustive", nvmax = "default")
## Note: binary categorical variables converted to 0-1 so 'leaps'
could be used.
bestglm.CVd.Hitters$Title
## [1] "CVd(d = 10, REP = 1)nNo BICq equivalent"
bestglm.CVd.Hitters$ModelReport
## $NullModel
## Deviance DF
## 26971550 128
##
14

## $LEAPSQ
## [1] TRUE
##
## $glmQ
## [1] FALSE
##
## $gaussianQ
## [1] TRUE
##
## $NumDF
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $CategoricalQ
## [1] TRUE
##
## $IncludeInterceptQ
## [1] TRUE
##
## $Bestk
## [1] 7
bestglm.CVd.Hitters$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1],
FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) Walks CAtBat CHits
## 236.0867 4.4159 -0.4669 1.6483
## CRBI CWalks DivisionW PutOuts
## 0.8894 -0.3068 -136.1346 0.2146
Or we can even use either the (1) “HTF k-fold CV” method (CVHTM) [Hastie et al., 2011] or
the (2) “adjusted K-fold CV” (CVDH) [by A. C. Davison, 2010] by parsing the appropriate
arguments in the bestglm() function
# Hastie et' al. K-fold CV (CVHTM)
15

bestglm.CVHTM.Hitters <- bestglm(Xy, family = gaussian, IC = "CV
",
CVArgs = list(Method = "HTF", K = 10, REP = 1), TopModels =
5,
could be used.
bestglm.CVHTM.Hitters$Title
## [1] "CV(K = 10, REP = 1)nBICq equivalent for q in
(1.52529666674894e-08, 0.050282763767736)"
bestglm.CVHTM.Hitters$ModelReport
## $NullModel
## Deviance DF
## 26971550 128
##
## $LEAPSQ
## [1] TRUE
##
## $glmQ
## [1] FALSE
##
## $gaussianQ
## [1] TRUE
##
## $NumDF
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $CategoricalQ
## [1] TRUE
##
## [1] TRUE
##
## $Bestk
## [1] 1
bestglm.CVHTM.Hitters$BestModel
16

##
## Call:
FALSE),
##
## Coefficients:
## (Intercept) CRBI
## 319.7668 0.7198
# adjusted K-fold CV (CVDH)
bestglm.CVDH.Hitters <- bestglm(Xy, family = gaussian, IC = "CV",
CVArgs = list(Method = "DH", K = 10, REP = 1), TopModels = 5,
could be used.
bestglm.CVDH.Hitters$Title
## [1] "CVAdj(K = 10, REP = 1)nBICq equivalent for q in
(0.855658979391075, 0.862338943358585)"
bestglm.CVDH.Hitters$ModelReport
## $NullModel
## Deviance DF
## 26971550 128
##
## $LEAPSQ
## [1] TRUE
##
## $glmQ
## [1] FALSE
##
## $gaussianQ
## [1] TRUE
##
## $NumDF
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
17

## $CategoricalQ
## [1] TRUE
##
## [1] TRUE
##
## $Bestk
## [1] 10
bestglm.CVDH.Hitters$BestModel
##
## Call:
FALSE),
##
## Coefficients:
## 329.5892 -2.5351 7.1477 6.7482
## CAtBat CRuns CRBI CWalks
## -0.1742 1.8657 0.7837 -1.1051
## DivisionW PutOuts Assists
## -119.8939 0.2362 0.2686
2 Ridge Regression and the Lasso
In this section we are going to use the so-called shrinkage methods to better ﬁt linear
predictive models and describe the Salary response in terms of other predictors.
We will use the glmnet package in order to perform both Ridge Regression and Lasso
shrinkage methods. This function has slightly diﬀerent syntax from what we have used
so far. In particular, we must pass in an X matrix of predictors as well as a y vector of
response.
X <- model.matrix(Salary ~ ., Hitters.Cleaned)[, -1]
y <- Hitters.Cleaned$Salary
The model.matrix() function is particular useful in cases like this; not only does it
produce a matrix corresponding to the 19 predictors but it also automatically transforms
18

any categorical variables into dummy variables. This is important because the glmnet()
function can be only applied on numerical data inputs.
2.1 Ridge Regression
Here, we ﬁt a Ridge Regression linear model leaving the model’s tuning parameter, λ, to
be determined by a cross-validation procedure
library(glmnet)
## Loading required package: Matrix
## Loading required package: lattice
## Loaded glmnet 1.9-8
# Calculate the Ridge Regression (alpha=0) - Cross
# Validations nfolds=Sample Size (LOOCV)
set.seed(456)
# LOOCV Procedure
nfolds = nrow(Hitters.Cleaned)
glmnet.ridge.fit <- cv.glmnet(X[train, ], y[train], type.measure
= "mse",
family = "gaussian", alpha = 0, standardize = TRUE)
The Mean-Squared Error, MSE, plot vs the log(Lambda) tuning parameter is given in
Figure 5 below.
plot(glmnet.ridge.fit)
title(main = "MSE vs Log(Lambda) n Ridge Regression [LOOCV,
Hitters]",
outer = TRUE)
19

Figure 5: Mean-Squared Error, MSE, vs the Log(Lambda) tuning parameter - Ridge Re-
gression Fit under LOOCV procedure (Hitters).
In Figure 6 we plot the ridge regression coeﬃcients vs the lambda control parameter, λ,
and the fraction of (null) deviance ratio which is explained by the model.
# Minimum Mean CV Error
cvm.Min.Index <- which.min(glmnet.ridge.fit$cvm)
# Ridge Regression coefs vs Lambda Control Parameter
plot(glmnet.ridge.fit$glmnet.fit, xvar = "lambda", label = TRUE)
abline(v = log(glmnet.ridge.fit$lambda[cvm.Min.Index]), lty = "
dashed",
col = "black")
20

# Ridge Regression coefs vs Deviance Ratio
plot(glmnet.ridge.fit$glmnet.fit, xvar = "dev", label = TRUE)
abline(v = log(glmnet.ridge.fit$glmnet.fit$dev.ratio[cvm.Min.
Index]),
lty = "dashed", col = "black")
title(main = "Ridge Regression Coefficients vs Log(Lambda) and n
Fraction of Deviance Explained [glmnet(LOOCV), Hitters]",
outer = TRUE)
Figure 6: Ridge regression coeﬃcients vs the Log(Lambda) control parameter and the
fraction of Dev.Ratio explained (glmnet under LOOCV procedure, Hitters).
21

The dashed vertical line in the figure above denotes the best λ parameter as calculated
by the LOOCV procedure. Note that as expected, none of the regression coefficients are
actually vanished by the application of the shrinkage method.
Finally, we refit our ridge regression model on the test data set using the λ parameter
which has been determined by the cross-validation procedure above. We also calculate
the corresponding Mean-Squared Error for the test data part of Hitters.
predict(glmnet.ridge.fit, X[test, ], type = "coefficients", s = "
lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 178.489344920
## AtBat 0.035448419
## Hits 0.608045444
## HmRun 0.930883558
## Runs 0.698708995
## RBI 0.654999526
## Walks 1.466577745
## Years 1.818638592
## CAtBat 0.009877955
## CHits 0.051122325
## CHmRun 0.330954429
## CRuns 0.096656674
## CRBI 0.107287429
## CWalks 0.056021178
## LeagueN -0.697704037
## DivisionW -70.184986685
## PutOuts 0.151336492
## Assists -0.027297267
## Errors -2.055848395
## NewLeagueN -12.596879362
glmnet.ridge.pred <- predict(glmnet.ridge.fit, X[test, ], type =
"response",
s = "lambda.min")
# test MSE
mean((glmnet.ridge.pred - y[test])^2)
## [1] 97569.09
22

2.2 The Lasso
In this section we ﬁt a Lasso linear model leaving the model tuning parameter, λ, to be
determined by a cross-validation procedure.
# Calculate the Ridge Regression (alpha=0) - Cross
# Validations Sample Size (LOOCV)
set.seed(234)
# LOOCV Procedure
nfolds = nrow(Hitters.Cleaned)
glmnet.lassoLOOCV.fit <- cv.glmnet(X[train, ], y[train], type.
measure = "mse",
family = "gaussian", alpha = 1, standardize = TRUE)
The Mean-Squared Error, MSE, plot vs the log(Lambda) tuning parameter is given in
Figure 7.
plot(glmnet.lassoLOOCV.fit)
title(main = "MSE vs Log(Lambda) n Lasso Linear Fit [LOOCV,
Hitters]",
outer = TRUE)
In Figure 8 we plot the Lasso coeﬃcients vs the lambda control parameter, λ, and the
fraction of (null) deviance ratio which is explained by the model.
# Minimum Mean CV Error
23

Figure 7: Mean-Squared Error (MSE) vs the Log(Lambda) tuning parameter - Lasso Linear
Fit under LOOCV procedure (Hitters).
cvm.Min.Index <- which.min(glmnet.lassoLOOCV.fit$cvm)
# Ridge Regression coefs vs Lambda Control Parameter
plot(glmnet.lassoLOOCV.fit$glmnet.fit, xvar = "lambda", label =
TRUE)
abline(v = log(glmnet.lassoLOOCV.fit$lambda[cvm.Min.Index]),
# Ridge Regression coefs vs Deviance Ratio
plot(glmnet.lassoLOOCV.fit$glmnet.fit, xvar = "dev", label = TRUE
)
24

abline(v = log(glmnet.lassoLOOCV.fit$glmnet.fit$dev.ratio[cvm.Min
.Index]),
title(main = "Lasso Coefficients vs Log(Lambda) and n Fraction
of Deviance Explained [glmnet(LOOCV), Hitters]",
outer = TRUE)
Figure 8: Lasso coeﬃcients vs the Log(Lambda) control parameter and the fraction of
Dev.Ratio explained (glmnet under LOOCV procedure, Hitters).
25

The dashed vertical line in the figure above denotes the best λ parameter as calculated
by the LOOCV procedure. Here in contrast with the ridge regression case some of the
coefficients are actually vanished by the application of the shrinkage method.
Indeed, refitting our lasso linear model on the test data set by using the λ parameter
which has been determined by the cross-validation procedure above, we get the following
results
predict(glmnet.lassoLOOCV.fit, X[test, ], type = "coefficients",
s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 156.72063595
## AtBat .
## Hits 1.15605082
## HmRun .
## Runs .
## RBI .
## Walks 2.18274618
## Years .
## CAtBat .
## CHits 0.01354589
## CHmRun .
## CRuns 0.12054023
## CRBI 0.44744187
## CWalks .
## LeagueN .
## DivisionW -108.31835711
## PutOuts 0.24083595
## Assists .
## Errors -1.85682194
## NewLeagueN .
glmnet.lassoLOOCV.pred <- predict(glmnet.lassoLOOCV.fit, X[test,
], type = "response", s = "lambda.min")
# test MSE
mean((glmnet.lassoLOOCV.pred - y[test])^2)
## [1] 89742.07
26

Note that the corresponding test Mean-Squared Error, MSE, is significantly lower than the
one found above with the ridge regression fit.
3 PCR and PLS Regression
3.1 Principal Components Regression
Principal components regression (PCR) can be performed by using the pcr() function,
which is part of the pls, R library. We now apply PCR to the Hitters data of our
example, and test how successfully we can predict the Salary variable. As before, we
should use a cleaned version of the data set, i.e. a data.frame without missing values.
In addition, we need to standardize all the predictors prior to applying the method.
library(pls)
set.seed(54)
In order to better estimate the model test error we perform the PCR calculation using
LOOCV. The data set is not large and the method does not need great computational
power. In case of a larger data set we could alternatively use the "CV" flag. Summarizing
our current results
pcr.fit.Hitters <- pcr(Salary ~ ., data = Hitters.Cleaned, subset
= train,
scale = TRUE, validation = "LOO")
summary(pcr.fit.Hitters)
## Data: X dimension: 129 19
## Y dimension: 129 1
## Fit method: svdpc
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 129 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## CV 460.8 390.9 395.3 397.2 398.5
## adjCV 460.8 390.9 395.3 397.2 398.4
## 5 comps 6 comps 7 comps 8 comps 9 comps
## CV 400.6 392.9 396.2 400.7 403.4
## adjCV 400.5 392.7 396.1 400.6 403.2
27

## CV 413.7 419.9 423.0 442.3 438.4
## adjCV 413.5 419.7 422.7 442.0 438.0
## CV 431.1 419.8 424.2 422.8 436.4
## adjCV 430.8 419.5 423.8 422.4 435.9
##
## TRAINING: % variance explained
## X 38.34 61.02 70.97 78.61 84.17
## Salary 31.51 31.82 31.82 31.96 32.65
## X 88.97 92.32 95.20 96.45 97.58
## Salary 37.07 37.36 37.36 37.45 37.98
## X 98.26 98.82 99.28 99.54 99.78
## Salary 38.02 38.05 38.46 40.76 42.71
## 16 comps 17 comps 18 comps 19 comps
## X 99.93 99.98 99.99 100.00
## Salary 45.34 47.40 48.75 48.83
The CV score is provided for each possible number of components, ranging from M = 0
on-wards. Note that pcr() reports the root mean squared error; in order to obtain the
usual MSE, we must square this quantity.
To obtain a graphical representation of the resulting cross-validation scores
validationplot(pcr.fit.Hitters, val.type = "MSEP", xlab = "Number
of Predictors",
ylab = "MSE", main = "MSE vs Number of Predictors [PCR -
Hitters]")
From Figure 9 we see that the smallest cross-validation error occurs when M = 2 compo-
nents are used.
The summary() function also provides the percentage of variance explained in the pre-
dictors and in the response using diﬀerent numbers of components. This concept is
discussed in greater detail in later articles. Brieﬂy, we can think of this as the amount
of information about the predictors or the response that is captured using M principal
components.
28

Next, we evaluate the model’s predictive accuracy by applying the PCR fit on the test data
set.
pcr.pred.Hitters <- predict(pcr.fit.Hitters, X[test, ], ncomp =
2)
mean((pcr.pred.Hitters - y[test])^2)
## [1] 104209.3
The test MSE of the model is comparable with the results obtained using ridge regression
and the lasso. However, PCR models are more difficult to interpret because it does not
perform any kind of variable selection or even directly produce coefficient estimates.
Figure 9: Mean-Squared Error, MSE, vs Number of Predictors as calculated by the
pcr() function under LOOCV.
29

3.2 Partial Least Squares
Here, we implement the Partial Least Squares (PLS) algorithm by also utilizing functions
in the pls library.
Again we build the model on the test data set, make a plot of the resulting cross-validation
scores and ﬁnally evaluate the models’ predictive accuracy.
set.seed(478)
pls.fit.Hitters <- plsr(Salary ~ ., data = Hitters.Cleaned,
subset = train,
scale = TRUE, validation = "LOO")
summary(pls.fit.Hitters)
## Data: X dimension: 129 19
## Y dimension: 129 1
## Fit method: kernelpls
## Number of components considered: 19
##
## VALIDATION: RMSEP
## Cross-validated using 129 leave-one-out segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps
## CV 460.8 392.9 407.5 409.1 422.6
## adjCV 460.8 392.8 407.4 408.9 422.3
## CV 449.2 441.0 436.6 426.6 427.8
## adjCV 448.6 440.5 436.2 426.3 427.4
## CV 425.1 409.5 422.2 428.1 425.3
## adjCV 424.6 408.9 421.8 427.7 424.9
## CV 428.9 426.9 430.3 429.9 436.4
## adjCV 428.4 426.5 429.8 429.5 435.9
##
## TRAINING: % variance explained
## X 38.19 46.91 65.76 70.18 74.45
## Salary 33.24 37.74 38.67 40.98 43.46
## X 79.99 86.04 89.87 93.62 94.11
## Salary 44.53 45.36 46.14 46.43 47.24
30

## X 95.09 97.08 98.37 98.82 99.15
## Salary 47.65 47.88 48.09 48.36 48.55
## 16 comps 17 comps 18 comps 19 comps
## X 99.59 99.77 99.99 100.00
## Salary 48.62 48.78 48.81 48.83
validationplot(pls.fit.Hitters, val.type = "MSEP", xlab = "Number
of Predictors",
ylab = "MSE", main = "MSE vs Number of Predictors [PLS -
Hitters]")
Figure 10: Mean Squared Error (MSE) vs "Number of Predictors" as calculated by
the plsr() function under LOOCV.
31

As shown in Figure 10, the smallest cross-validation error again occurs when M = 2
components are used. The test MSE is found to be signiﬁcant smaller in this case, than
the one obtained by ridge regression, lasso and PCR methods before.
pls.pred.Hitters <- predict(pls.fit.Hitters, X[test, ], ncomp =
2)
mean((pls.pred.Hitters - y[test]))
## [1] 75.20599
mean((pls.pred.Hitters - y[test]))
## [1] 75.20599
References
[by A. C. Davison, 2010] by A. C. Davison (2010). Bootstrap Methods and Their Applica-
tion. Cambridge University Press, international edition edition.
[Hastie et al., 2011] Hastie, T., Tibshirani, R., and Friedman, J. (2011). The Elements
of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer
Series in Statistics). Springer, 2nd ed. 2009. corr. 7th printing 2013 edition.
32

Linear Model Selection and Regularization (Article 6 - Practical exercises)

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Linear Model Selection and Regularization (Article 6 - Practical exercises)

Similar to Linear Model Selection and Regularization (Article 6 - Practical exercises) (20)

Recently uploaded

Recently uploaded (20)

Linear Model Selection and Regularization (Article 6 - Practical exercises)