SlideShare a Scribd company logo
1 of 59
Download to read offline
Linear Model Selection and Regularization
• Recall the linear model
Y = β0 + β1X1 + · · · + βpXp + .
• In the lectures that follow, we consider some approaches for
extending the linear model framework. In the lectures
covering Chapter 7 of the text, we generalize the linear
model in order to accommodate non-linear, but still
additive, relationships.
• In the lectures covering Chapter 8 we consider even more
general non-linear models.
1 / 57
In praise of linear models!
• Despite its simplicity, the linear model has distinct
advantages in terms of its interpretability and often shows
good predictive performance.
• Hence we discuss in this lecture some ways in which the
simple linear model can be improved, by replacing ordinary
least squares fitting with some alternative fitting
procedures.
2 / 57
Why consider alternatives to least squares?
• Prediction Accuracy: especially when p > n, to control the
variance.
• Model Interpretability: By removing irrelevant features —
that is, by setting the corresponding coefficient estimates
to zero — we can obtain a model that is more easily
interpreted. We will present some approaches for
automatically performing feature selection.
3 / 57
Three classes of methods
• Subset Selection. We identify a subset of the p predictors
that we believe to be related to the response. We then fit a
model using least squares on the reduced set of variables.
• Shrinkage. We fit a model involving all p predictors, but
the estimated coefficients are shrunken towards zero
relative to the least squares estimates. This shrinkage (also
known as regularization) has the effect of reducing variance
and can also perform variable selection.
• Dimension Reduction. We project the p predictors into a
M-dimensional subspace, where M < p. This is achieved by
computing M different linear combinations, or projections,
of the variables. Then these M projections are used as
predictors to fit a linear regression model by least squares.
4 / 57
Subset Selection
Best subset and stepwise model selection procedures
Best Subset Selection
1. Let M0 denote the null model, which contains no
predictors. This model simply predicts the sample mean
for each observation.
2. For k = 1, 2, . . . p:
(a) Fit all p
k models that contain exactly k predictors.
(b) Pick the best among these p
k models, and call it Mk. Here
best is defined as having the smallest RSS, or equivalently
largest R2
.
3. Select a single best model from among M0, . . . , Mp using
cross-validated prediction error, Cp (AIC), BIC, or
adjusted R2.
5 / 57
Example- Credit data set
2 4 6 8 10
2e+074e+076e+078e+07
Number of Predictors
ResidualSumofSquares
2 4 6 8 10
0.00.20.40.60.81.0
Number of Predictors
R2
For each possible model containing a subset of the ten predictors
in the Credit data set, the RSS and R2 are displayed. The red
frontier tracks the best model for a given number of predictors,
according to RSS and R2. Though the data set contains only
ten predictors, the x-axis ranges from 1 to 11, since one of the
variables is categorical and takes on three values, leading to the
creation of two dummy variables
6 / 57
Extensions to other models
• Although we have presented best subset selection here for
least squares regression, the same ideas apply to other
types of models, such as logistic regression.
• The deviance— negative two times the maximized
log-likelihood— plays the role of RSS for a broader class of
models.
7 / 57
Stepwise Selection
• For computational reasons, best subset selection cannot be
applied with very large p. Why not?
• Best subset selection may also suffer from statistical
problems when p is large: larger the search space, the
higher the chance of finding models that look good on the
training data, even though they might not have any
predictive power on future data.
• Thus an enormous search space can lead to overfitting and
high variance of the coefficient estimates.
• For both of these reasons, stepwise methods, which explore
a far more restricted set of models, are attractive
alternatives to best subset selection.
8 / 57
Forward Stepwise Selection
• Forward stepwise selection begins with a model containing
no predictors, and then adds predictors to the model,
one-at-a-time, until all of the predictors are in the model.
• In particular, at each step the variable that gives the
greatest additional improvement to the fit is added to the
model.
9 / 57
In Detail
Forward Stepwise Selection
1. Let M0 denote the null model, which contains no
predictors.
2. For k = 0, . . . , p − 1:
2.1 Consider all p − k models that augment the predictors in
Mk with one additional predictor.
2.2 Choose the best among these p − k models, and call it
Mk+1. Here best is defined as having smallest RSS or
highest R2
.
3. Select a single best model from among M0, . . . , Mp using
cross-validated prediction error, Cp (AIC), BIC, or
adjusted R2.
10 / 57
More on Forward Stepwise Selection
• Computational advantage over best subset selection is
clear.
• It is not guaranteed to find the best possible model out of
all 2p models containing subsets of the p predictors. Why
not? Give an example.
11 / 57
Credit data example
# Variables Best subset Forward stepwise
One rating rating
Two rating, income rating, income
Three rating, income, student rating, income, student
Four cards, income rating, income,
student, limit student, limit
The first four selected models for best subset selection and
forward stepwise selection on the Credit data set. The first
three models are identical but the fourth models differ.
12 / 57
Backward Stepwise Selection
• Like forward stepwise selection, backward stepwise selection
provides an efficient alternative to best subset selection.
• However, unlike forward stepwise selection, it begins with
the full least squares model containing all p predictors, and
then iteratively removes the least useful predictor,
one-at-a-time.
13 / 57
Backward Stepwise Selection: details
Backward Stepwise Selection
1. Let Mp denote the full model, which contains all p
predictors.
2. For k = p, p − 1, . . . , 1:
2.1 Consider all k models that contain all but one of the
predictors in Mk, for a total of k − 1 predictors.
2.2 Choose the best among these k models, and call it Mk−1.
Here best is defined as having smallest RSS or highest R2
.
3. Select a single best model from among M0, . . . , Mp using
cross-validated prediction error, Cp (AIC), BIC, or
adjusted R2.
14 / 57
More on Backward Stepwise Selection
• Like forward stepwise selection, the backward selection
approach searches through only 1 + p(p + 1)/2 models, and
so can be applied in settings where p is too large to apply
best subset selection
• Like forward stepwise selection, backward stepwise
selection is not guaranteed to yield the best model
containing a subset of the p predictors.
• Backward selection requires that the number of samples n
is larger than the number of variables p (so that the full
model can be fit). In contrast, forward stepwise can be
used even when n < p, and so is the only viable subset
method when p is very large.
15 / 57
Choosing the Optimal Model
• The model containing all of the predictors will always have
the smallest RSS and the largest R2, since these quantities
are related to the training error.
• We wish to choose a model with low test error, not a model
with low training error. Recall that training error is usually
a poor estimate of test error.
• Therefore, RSS and R2 are not suitable for selecting the
best model among a collection of models with different
numbers of predictors.
16 / 57
Estimating test error: two approaches
• We can indirectly estimate test error by making an
adjustment to the training error to account for the bias due
to overfitting.
• We can directly estimate the test error, using either a
validation set approach or a cross-validation approach, as
discussed in previous lectures.
• We illustrate both approaches next.
17 / 57
Cp, AIC, BIC, and Adjusted R2
• These techniques adjust the training error for the model
size, and can be used to select among a set of models with
different numbers of variables.
• The next figure displays Cp, BIC, and adjusted R2 for the
best model of each size produced by best subset selection
on the Credit data set.
18 / 57
Credit data example
2 4 6 8 10
1000015000200002500030000
Number of Predictors
Cp
2 4 6 8 10
1000015000200002500030000
Number of Predictors
BIC
2 4 6 8 10
0.860.880.900.920.940.96
Number of Predictors
AdjustedR2
19 / 57
Now for some details
• Mallow’s Cp:
Cp =
1
n
RSS + 2dˆσ2
,
where d is the total # of parameters used and ˆσ2 is an
estimate of the variance of the error associated with each
response measurement.
• The AIC criterion is defined for a large class of models fit
by maximum likelihood:
AIC = −2 log L + 2 · d
where L is the maximized value of the likelihood function
for the estimated model.
• In the case of the linear model with Gaussian errors,
maximum likelihood and least squares are the same thing,
and Cp and AIC are equivalent. Prove this.
20 / 57
Details on BIC
BIC =
1
n
RSS + log(n)dˆσ2
.
• Like Cp, the BIC will tend to take on a small value for a
model with a low test error, and so generally we select the
model that has the lowest BIC value.
• Notice that BIC replaces the 2dˆσ2 used by Cp with a
log(n)dˆσ2 term, where n is the number of observations.
• Since log n > 2 for any n > 7, the BIC statistic generally
places a heavier penalty on models with many variables,
and hence results in the selection of smaller models than
Cp. See Figure on slide 19.
21 / 57
Adjusted R2
• For a least squares model with d variables, the adjusted R2
statistic is calculated as
Adjusted R2
= 1 −
RSS/(n − d − 1)
TSS/(n − 1)
.
where TSS is the total sum of squares.
• Unlike Cp, AIC, and BIC, for which a small value indicates
a model with a low test error, a large value of adjusted R2
indicates a model with a small test error.
• Maximizing the adjusted R2 is equivalent to minimizing
RSS
n−d−1. While RSS always decreases as the number of
variables in the model increases, RSS
n−d−1 may increase or
decrease, due to the presence of d in the denominator.
• Unlike the R2 statistic, the adjusted R2 statistic pays a
price for the inclusion of unnecessary variables in the
model. See Figure on slide 19.
22 / 57
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk
indexed by model size k = 0, 1, 2, . . .. Our job here is to
select ˆk. Once selected, we will return model Mˆk
23 / 57
Validation and Cross-Validation
• Each of the procedures returns a sequence of models Mk
indexed by model size k = 0, 1, 2, . . .. Our job here is to
select ˆk. Once selected, we will return model Mˆk
• We compute the validation set error or the cross-validation
error for each model Mk under consideration, and then
select the k for which the resulting estimated test error is
smallest.
• This procedure has an advantage relative to AIC, BIC, Cp,
and adjusted R2, in that it provides a direct estimate of
the test error, and doesn’t require an estimate of the error
variance σ2.
• It can also be used in a wider range of model selection
tasks, even in cases where it is hard to pinpoint the model
degrees of freedom (e.g. the number of predictors in the
model) or hard to estimate the error variance σ2.
23 / 57
Credit data example
2 4 6 8 10
100120140160180200220
Number of Predictors
SquareRootofBIC
2 4 6 8 10
100120140160180200220
Number of Predictors
ValidationSetError
2 4 6 8 10
100120140160180200220
Number of Predictors
Cross−ValidationError
24 / 57
Details of Previous Figure
• The validation errors were calculated by randomly selecting
three-quarters of the observations as the training set, and
the remainder as the validation set.
• The cross-validation errors were computed using k = 10
folds. In this case, the validation and cross-validation
methods both result in a six-variable model.
• However, all three approaches suggest that the four-, five-,
and six-variable models are roughly equivalent in terms of
their test errors.
• In this setting, we can select a model using the
one-standard-error rule. We first calculate the standard
error of the estimated test MSE for each model size, and
then select the smallest model for which the estimated test
error is within one standard error of the lowest point on
the curve. What is the rationale for this?
25 / 57
Shrinkage Methods
Ridge regression and Lasso
• The subset selection methods use least squares to fit a
linear model that contains a subset of the predictors.
• As an alternative, we can fit a model containing all p
predictors using a technique that constrains or regularizes
the coefficient estimates, or equivalently, that shrinks the
coefficient estimates towards zero.
• It may not be immediately obvious why such a constraint
should improve the fit, but it turns out that shrinking the
coefficient estimates can significantly reduce their variance.
26 / 57
Ridge regression
• Recall that the least squares fitting procedure estimates
β0, β1, . . . , βp using the values that minimize
RSS =
n
i=1

yi − β0 −
p
j=1
βjxij


2
.
• In contrast, the ridge regression coefficient estimates ˆβR
are the values that minimize
n
i=1

yi − β0 −
p
j=1
βjxij


2
+ λ
p
j=1
β2
j = RSS + λ
p
j=1
β2
j ,
where λ ≥ 0 is a tuning parameter, to be determined
separately.
27 / 57
Ridge regression: continued
• As with least squares, ridge regression seeks coefficient
estimates that fit the data well, by making the RSS small.
• However, the second term, λ j β2
j , called a shrinkage
penalty, is small when β1, . . . , βp are close to zero, and so it
has the effect of shrinking the estimates of βj towards zero.
• The tuning parameter λ serves to control the relative
impact of these two terms on the regression coefficient
estimates.
• Selecting a good value for λ is critical; cross-validation is
used for this.
28 / 57
Credit data example
1e−02 1e+00 1e+02 1e+04
−300−1000100200300400
StandardizedCoefficients
Income
Limit
Rating
Student
0.0 0.2 0.4 0.6 0.8 1.0−300−1000100200300400
StandardizedCoefficients
λ ˆβR
λ 2/ ˆβ 2
29 / 57
Details of Previous Figure
• In the left-hand panel, each curve corresponds to the ridge
regression coefficient estimate for one of the ten variables,
plotted as a function of λ.
• The right-hand panel displays the same ridge coefficient
estimates as the left-hand panel, but instead of displaying
λ on the x-axis, we now display ˆβR
λ 2/ ˆβ 2, where ˆβ
denotes the vector of least squares coefficient estimates.
• The notation β 2 denotes the 2 norm (pronounced “ell
2”) of a vector, and is defined as β 2 = p
j=1 βj
2
.
30 / 57
Ridge regression: scaling of predictors
• The standard least squares coefficient estimates are scale
equivariant: multiplying Xj by a constant c simply leads to
a scaling of the least squares coefficient estimates by a
factor of 1/c. In other words, regardless of how the jth
predictor is scaled, Xj
ˆβj will remain the same.
• In contrast, the ridge regression coefficient estimates can
change substantially when multiplying a given predictor by
a constant, due to the sum of squared coefficients term in
the penalty part of the ridge regression objective function.
• Therefore, it is best to apply ridge regression after
standardizing the predictors, using the formula
˜xij =
xij
1
n
n
i=1(xij − xj)2
31 / 57
Why Does Ridge Regression Improve Over Least
Squares?
The Bias-Variance tradeoff
1e−01 1e+01 1e+03
0102030405060
MeanSquaredError
0.0 0.2 0.4 0.6 0.8 1.0
0102030405060
MeanSquaredError
λ ˆβR
λ 2/ ˆβ 2
Simulated data with n = 50 observations, p = 45 predictors, all having
nonzero coefficients. Squared bias (black), variance (green), and test
mean squared error (purple) for the ridge regression predictions on a
simulated data set, as a function of λ and ˆβR
λ 2/ ˆβ 2. The
horizontal dashed lines indicate the minimum possible MSE. The
purple crosses indicate the ridge regression models for which the MSE
is smallest. 32 / 57
The Lasso
• Ridge regression does have one obvious disadvantage:
unlike subset selection, which will generally select models
that involve just a subset of the variables, ridge regression
will include all p predictors in the final model
• The Lasso is a relatively recent alternative to ridge
regression that overcomes this disadvantage. The lasso
coefficients, ˆβL
λ , minimize the quantity
n
i=1

yi − β0 −
p
j=1
βjxij


2
+ λ
p
j=1
|βj| = RSS + λ
p
j=1
|βj|.
• In statistical parlance, the lasso uses an 1 (pronounced
“ell 1”) penalty instead of an 2 penalty. The 1 norm of a
coefficient vector β is given by β 1 = |βj|.
33 / 57
The Lasso: continued
• As with ridge regression, the lasso shrinks the coefficient
estimates towards zero.
• However, in the case of the lasso, the 1 penalty has the
effect of forcing some of the coefficient estimates to be
exactly equal to zero when the tuning parameter λ is
sufficiently large.
• Hence, much like best subset selection, the lasso performs
variable selection.
• We say that the lasso yields sparse models — that is,
models that involve only a subset of the variables.
• As in ridge regression, selecting a good value of λ for the
lasso is critical; cross-validation is again the method of
choice.
34 / 57
Example: Credit dataset
20 50 100 200 500 2000 5000
−2000100200300400
StandardizedCoefficients
0.0 0.2 0.4 0.6 0.8 1.0−300−1000100200300400
StandardizedCoefficients
Income
Limit
Rating
Student
λ ˆβL
λ 1/ ˆβ 1
35 / 57
The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results in
coefficient estimates that are exactly equal to zero?
36 / 57
The Variable Selection Property of the Lasso
Why is it that the lasso, unlike ridge regression, results in
coefficient estimates that are exactly equal to zero?
One can show that the lasso and ridge regression coefficient
estimates solve the problems
minimize
β
n
i=1

yi − β0 −
p
j=1
βjxij


2
subject to
p
j=1
|βj| ≤ s
and
minimize
β
n
i=1

yi − β0 −
p
j=1
βjxij


2
subject to
p
j=1
β2
j ≤ s,
respectively.
36 / 57
The Lasso Picture
37 / 57
Comparing the Lasso and Ridge Regression
0.02 0.10 0.50 2.00 10.00 50.00
0102030405060
MeanSquaredError
0.0 0.2 0.4 0.6 0.8 1.0
0102030405060
R2
on Training Data
MeanSquaredError
λ
Left: Plots of squared bias (black), variance (green), and test
MSE (purple) for the lasso on simulated data set of Slide 32.
Right: Comparison of squared bias, variance and test MSE
between lasso (solid) and ridge (dashed). Both are plotted
against their R2 on the training data, as a common form of
indexing. The crosses in both plots indicate the lasso model for
which the MSE is smallest.
38 / 57
Comparing the Lasso and Ridge Regression: continued
0.02 0.10 0.50 2.00 10.00 50.00
020406080100
MeanSquaredError
0.4 0.5 0.6 0.7 0.8 0.9 1.0
020406080100
R2
on Training Data
MeanSquaredError
λ
Left: Plots of squared bias (black), variance (green), and test
MSE (purple) for the lasso. The simulated data is similar to
that in Slide 38, except that now only two predictors are related
to the response. Right: Comparison of squared bias, variance
and test MSE between lasso (solid) and ridge (dashed). Both
are plotted against their R2 on the training data, as a common
form of indexing. The crosses in both plots indicate the lasso
model for which the MSE is smallest.
39 / 57
Conclusions
• These two examples illustrate that neither ridge regression
nor the lasso will universally dominate the other.
• In general, one might expect the lasso to perform better
when the response is a function of only a relatively small
number of predictors.
• However, the number of predictors that is related to the
response is never known a priori for real data sets.
• A technique such as cross-validation can be used in order
to determine which approach is better on a particular data
set.
40 / 57
Selecting the Tuning Parameter for Ridge Regression
and Lasso
• As for subset selection, for ridge regression and lasso we
require a method to determine which of the models under
consideration is best.
• That is, we require a method selecting a value for the
tuning parameter λ or equivalently, the value of the
constraint s.
• Cross-validation provides a simple way to tackle this
problem. We choose a grid of λ values, and compute the
cross-validation error rate for each value of λ.
• We then select the tuning parameter value for which the
cross-validation error is smallest.
• Finally, the model is re-fit using all of the available
observations and the selected value of the tuning
parameter.
41 / 57
Credit data example
5e−03 5e−02 5e−01 5e+00
25.025.225.425.6
Cross−ValidationError
5e−03 5e−02 5e−01 5e+00
−300−1000100300
StandardizedCoefficients
λλ
Left: Cross-validation errors that result from applying ridge
regression to the Credit data set with various values of λ.
Right: The coefficient estimates as a function of λ. The vertical
dashed lines indicates the value of λ selected by cross-validation.
42 / 57
Simulated data example
0.0 0.2 0.4 0.6 0.8 1.0
020060010001400
Cross−ValidationError
0.0 0.2 0.4 0.6 0.8 1.0
−5051015
StandardizedCoefficients
ˆβL
λ 1/ ˆβ 1
ˆβL
λ 1/ ˆβ 1
Left: Ten-fold cross-validation MSE for the lasso, applied to the
sparse simulated data set from Slide 39. Right: The
corresponding lasso coefficient estimates are displayed. The
vertical dashed lines indicate the lasso fit for which the
cross-validation error is smallest.
43 / 57
Dimension Reduction Methods
• The methods that we have discussed so far in this chapter
have involved fitting linear regression models, via least
squares or a shrunken approach, using the original
predictors, X1, X2, . . . , Xp.
• We now explore a class of approaches that transform the
predictors and then fit a least squares model using the
transformed variables. We will refer to these techniques as
dimension reduction methods.
44 / 57
Dimension Reduction Methods: details
• Let Z1, Z2, . . . , ZM represent M < p linear combinations of
our original p predictors. That is,
Zm =
p
j=1
φmjXj (1)
for some constants φm1, . . . , φmp.
• We can then fit the linear regression model,
yi = θ0 +
M
m=1
θmzim + i, i = 1, . . . , n, (2)
using ordinary least squares.
• Note that in model (2), the regression coefficients are given
by θ0, θ1, . . . , θM . If the constants φm1, . . . , φmp are chosen
wisely, then such dimension reduction approaches can often
outperform OLS regression.
45 / 57
• Notice that from definition (1),
M
m=1
θmzim =
M
m=1
θm
p
j=1
φmjxij =
p
j=1
M
m=1
θmφmjxij =
p
j=1
βjxij,
where
βj =
M
m=1
θmφmj. (3)
• Hence model (2) can be thought of as a special case of the
original linear regression model.
• Dimension reduction serves to constrain the estimated βj
coefficients, since now they must take the form (3).
• Can win in the bias-variance tradeoff.
46 / 57
Principal Components Regression
• Here we apply principal components analysis (PCA)
(discussed in Chapter 10 of the text) to define the linear
combinations of the predictors, for use in our regression.
• The first principal component is that (normalized) linear
combination of the variables with the largest variance.
• The second principal component has largest variance,
subject to being uncorrelated with the first.
• And so on.
• Hence with many correlated original variables, we replace
them with a small set of principal components that capture
their joint variation.
47 / 57
Pictures of PCA
10 20 30 40 50 60 70
05101520253035
Population
AdSpending
The population size (pop) and ad spending (ad) for 100
different cities are shown as purple circles. The green solid line
indicates the first principal component, and the blue dashed line
indicates the second principal component.
48 / 57
Pictures of PCA: continued
20 30 40 50
51015202530
Population
AdSpending
−20 −10 0 10 20
−10−50510
1st Principal Component
2ndPrincipalComponent
A subset of the advertising data. Left: The first principal
component, chosen to minimize the sum of the squared
perpendicular distances to each point, is shown in green. These
distances are represented using the black dashed line segments.
Right: The left-hand panel has been rotated so that the first
principal component lies on the x-axis.
49 / 57
Pictures of PCA: continued
−3 −2 −1 0 1 2 3
2030405060
1st Principal Component
Population
−3 −2 −1 0 1 2 3
51015202530
1st Principal ComponentAdSpending
Plots of the first principal component scores zi1 versus pop and
ad. The relationships are strong.
50 / 57
Pictures of PCA: continued
−1.0 −0.5 0.0 0.5 1.0
2030405060
2nd Principal Component
Population
−1.0 −0.5 0.0 0.5 1.0
51015202530
2nd Principal ComponentAdSpending
Plots of the second principal component scores zi2 versus pop
and ad. The relationships are weak.
51 / 57
Application to Principal Components Regression
0 10 20 30 40
010203040506070
Number of Components
MeanSquaredError
0 10 20 30 40
050100150
Number of Components
MeanSquaredError
Squared Bias
Test MSE
Variance
PCR was applied to two simulated data sets. The black, green,
and purple lines correspond to squared bias, variance, and test
mean squared error, respectively. Left: Simulated data from
slide 32. Right: Simulated data from slide 39.
52 / 57
Choosing the number of directions M
2 4 6 8 10
−300−1000100200300400
Number of Components
StandardizedCoefficients
Income
Limit
Rating
Student
2 4 6 8 10
20000400006000080000
Number of Components
Cross−ValidationMSE
Left: PCR standardized coefficient estimates on the Credit
data set for different values of M. Right: The 10-fold cross
validation MSE obtained using PCR, as a function of M.
53 / 57
Partial Least Squares
• PCR identifies linear combinations, or directions, that best
represent the predictors X1, . . . , Xp.
• These directions are identified in an unsupervised way, since
the response Y is not used to help determine the principal
component directions.
• That is, the response does not supervise the identification
of the principal components.
• Consequently, PCR suffers from a potentially serious
drawback: there is no guarantee that the directions that
best explain the predictors will also be the best directions
to use for predicting the response.
54 / 57
Partial Least Squares: continued
• Like PCR, PLS is a dimension reduction method, which
first identifies a new set of features Z1, . . . , ZM that are
linear combinations of the original features, and then fits a
linear model via OLS using these M new features.
• But unlike PCR, PLS identifies these new features in a
supervised way – that is, it makes use of the response Y in
order to identify new features that not only approximate
the old features well, but also that are related to the
response.
• Roughly speaking, the PLS approach attempts to find
directions that help explain both the response and the
predictors.
55 / 57
Details of Partial Least Squares
• After standardizing the p predictors, PLS computes the
first direction Z1 by setting each φ1j in (1) equal to the
coefficient from the simple linear regression of Y onto Xj.
• One can show that this coefficient is proportional to the
correlation between Y and Xj.
• Hence, in computing Z1 = p
j=1 φ1jXj, PLS places the
highest weight on the variables that are most strongly
related to the response.
• Subsequent directions are found by taking residuals and
then repeating the above prescription.
56 / 57
Summary
• Model selection methods are an essential tool for data
analysis, especially for big datasets involving many
predictors.
• Research into methods that give sparsity, such as the lasso
is an especially hot area.
• Later, we will return to sparsity in more detail, and will
describe related approaches such as the elastic net.
57 / 57

More Related Content

What's hot

Logistic regression
Logistic regressionLogistic regression
Logistic regressionDrZahid Khan
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineRishabh Gupta
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tearsAnkit Sharma
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesRajat Sharma
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Mohammed Musah
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regressionkishanthkumaar
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regressionAkhilesh Joshi
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentationAyanaRukasar
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component AnalysisSunjeet Jena
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...Simplilearn
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regressionAkhilesh Joshi
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear RegressionAndrew Ferlitsch
 

What's hot (20)

Multiple Linear Regression
Multiple Linear Regression Multiple Linear Regression
Multiple Linear Regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Logistic Regression Analysis
Logistic Regression AnalysisLogistic Regression Analysis
Logistic Regression Analysis
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tears
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | Disadvantages
 
Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)Introduction to principal component analysis (pca)
Introduction to principal component analysis (pca)
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
decision tree regression
decision tree regressiondecision tree regression
decision tree regression
 
Support Vector Machine ppt presentation
Support Vector Machine ppt presentationSupport Vector Machine ppt presentation
Support Vector Machine ppt presentation
 
Introduction to Principle Component Analysis
Introduction to Principle Component AnalysisIntroduction to Principle Component Analysis
Introduction to Principle Component Analysis
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
Support Vector machine
Support Vector machineSupport Vector machine
Support Vector machine
 
Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
multiple linear regression
multiple linear regressionmultiple linear regression
multiple linear regression
 
ML - Multiple Linear Regression
ML - Multiple Linear RegressionML - Multiple Linear Regression
ML - Multiple Linear Regression
 

Similar to Model selection

604_multiplee.ppt
604_multiplee.ppt604_multiplee.ppt
604_multiplee.pptRufesh
 
variableselectionmodelBuilding.ppt
variableselectionmodelBuilding.pptvariableselectionmodelBuilding.ppt
variableselectionmodelBuilding.pptrajalakshmi5921
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .pptneelamsanjeevkumar
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsYoussefKitane
 
AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012OptiModel
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakashShivaram Prakash
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...Amir Ziai
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...o_almasi
 
ASS_SDM2012_Ali
ASS_SDM2012_AliASS_SDM2012_Ali
ASS_SDM2012_AliMDO_Lab
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4YoussefKitane
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionIJECEIAES
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 

Similar to Model selection (20)

604_multiplee.ppt
604_multiplee.ppt604_multiplee.ppt
604_multiplee.ppt
 
variableselectionmodelBuilding.ppt
variableselectionmodelBuilding.pptvariableselectionmodelBuilding.ppt
variableselectionmodelBuilding.ppt
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
 
Sequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_modelsSequential estimation of_discrete_choice_models
Sequential estimation of_discrete_choice_models
 
report
reportreport
report
 
15303589.ppt
15303589.ppt15303589.ppt
15303589.ppt
 
OTTO-Report
OTTO-ReportOTTO-Report
OTTO-Report
 
AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
 
Module-2_ML.pdf
Module-2_ML.pdfModule-2_ML.pdf
Module-2_ML.pdf
 
A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...A parsimonious SVM model selection criterion for classification of real-world ...
A parsimonious SVM model selection criterion for classification of real-world ...
 
ASS_SDM2012_Ali
ASS_SDM2012_AliASS_SDM2012_Ali
ASS_SDM2012_Ali
 
Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Dm
DmDm
Dm
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
SEM
SEMSEM
SEM
 

More from Animesh Kumar

10th class weaker students case
10th class weaker students case10th class weaker students case
10th class weaker students caseAnimesh Kumar
 
Portfolios, mutual funds class
Portfolios, mutual funds  classPortfolios, mutual funds  class
Portfolios, mutual funds classAnimesh Kumar
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
Mit18 05 s14_class4slides
Mit18 05 s14_class4slidesMit18 05 s14_class4slides
Mit18 05 s14_class4slidesAnimesh Kumar
 
Mit18 05 s14_class3slides
Mit18 05 s14_class3slidesMit18 05 s14_class3slides
Mit18 05 s14_class3slidesAnimesh Kumar
 
Mit18 05 s14_class2slides
Mit18 05 s14_class2slidesMit18 05 s14_class2slides
Mit18 05 s14_class2slidesAnimesh Kumar
 
Mit18 05 s14_class2slides
Mit18 05 s14_class2slidesMit18 05 s14_class2slides
Mit18 05 s14_class2slidesAnimesh Kumar
 

More from Animesh Kumar (13)

10th class weaker students case
10th class weaker students case10th class weaker students case
10th class weaker students case
 
Portfolios, mutual funds class
Portfolios, mutual funds  classPortfolios, mutual funds  class
Portfolios, mutual funds class
 
2016 may cover
2016 may cover2016 may cover
2016 may cover
 
2016 june cover
2016 june cover2016 june cover
2016 june cover
 
2016 april cover
2016 april cover2016 april cover
2016 april cover
 
2016 march cover
2016 march cover2016 march cover
2016 march cover
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Cjc tutorial
Cjc tutorialCjc tutorial
Cjc tutorial
 
Pen drive bootable
Pen drive bootablePen drive bootable
Pen drive bootable
 
Mit18 05 s14_class4slides
Mit18 05 s14_class4slidesMit18 05 s14_class4slides
Mit18 05 s14_class4slides
 
Mit18 05 s14_class3slides
Mit18 05 s14_class3slidesMit18 05 s14_class3slides
Mit18 05 s14_class3slides
 
Mit18 05 s14_class2slides
Mit18 05 s14_class2slidesMit18 05 s14_class2slides
Mit18 05 s14_class2slides
 
Mit18 05 s14_class2slides
Mit18 05 s14_class2slidesMit18 05 s14_class2slides
Mit18 05 s14_class2slides
 

Recently uploaded

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 

Recently uploaded (20)

Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 

Model selection

  • 1. Linear Model Selection and Regularization • Recall the linear model Y = β0 + β1X1 + · · · + βpXp + . • In the lectures that follow, we consider some approaches for extending the linear model framework. In the lectures covering Chapter 7 of the text, we generalize the linear model in order to accommodate non-linear, but still additive, relationships. • In the lectures covering Chapter 8 we consider even more general non-linear models. 1 / 57
  • 2. In praise of linear models! • Despite its simplicity, the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance. • Hence we discuss in this lecture some ways in which the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures. 2 / 57
  • 3. Why consider alternatives to least squares? • Prediction Accuracy: especially when p > n, to control the variance. • Model Interpretability: By removing irrelevant features — that is, by setting the corresponding coefficient estimates to zero — we can obtain a model that is more easily interpreted. We will present some approaches for automatically performing feature selection. 3 / 57
  • 4. Three classes of methods • Subset Selection. We identify a subset of the p predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables. • Shrinkage. We fit a model involving all p predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection. • Dimension Reduction. We project the p predictors into a M-dimensional subspace, where M < p. This is achieved by computing M different linear combinations, or projections, of the variables. Then these M projections are used as predictors to fit a linear regression model by least squares. 4 / 57
  • 5. Subset Selection Best subset and stepwise model selection procedures Best Subset Selection 1. Let M0 denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation. 2. For k = 1, 2, . . . p: (a) Fit all p k models that contain exactly k predictors. (b) Pick the best among these p k models, and call it Mk. Here best is defined as having the smallest RSS, or equivalently largest R2 . 3. Select a single best model from among M0, . . . , Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2. 5 / 57
  • 6. Example- Credit data set 2 4 6 8 10 2e+074e+076e+078e+07 Number of Predictors ResidualSumofSquares 2 4 6 8 10 0.00.20.40.60.81.0 Number of Predictors R2 For each possible model containing a subset of the ten predictors in the Credit data set, the RSS and R2 are displayed. The red frontier tracks the best model for a given number of predictors, according to RSS and R2. Though the data set contains only ten predictors, the x-axis ranges from 1 to 11, since one of the variables is categorical and takes on three values, leading to the creation of two dummy variables 6 / 57
  • 7. Extensions to other models • Although we have presented best subset selection here for least squares regression, the same ideas apply to other types of models, such as logistic regression. • The deviance— negative two times the maximized log-likelihood— plays the role of RSS for a broader class of models. 7 / 57
  • 8. Stepwise Selection • For computational reasons, best subset selection cannot be applied with very large p. Why not? • Best subset selection may also suffer from statistical problems when p is large: larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data. • Thus an enormous search space can lead to overfitting and high variance of the coefficient estimates. • For both of these reasons, stepwise methods, which explore a far more restricted set of models, are attractive alternatives to best subset selection. 8 / 57
  • 9. Forward Stepwise Selection • Forward stepwise selection begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model. • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model. 9 / 57
  • 10. In Detail Forward Stepwise Selection 1. Let M0 denote the null model, which contains no predictors. 2. For k = 0, . . . , p − 1: 2.1 Consider all p − k models that augment the predictors in Mk with one additional predictor. 2.2 Choose the best among these p − k models, and call it Mk+1. Here best is defined as having smallest RSS or highest R2 . 3. Select a single best model from among M0, . . . , Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2. 10 / 57
  • 11. More on Forward Stepwise Selection • Computational advantage over best subset selection is clear. • It is not guaranteed to find the best possible model out of all 2p models containing subsets of the p predictors. Why not? Give an example. 11 / 57
  • 12. Credit data example # Variables Best subset Forward stepwise One rating rating Two rating, income rating, income Three rating, income, student rating, income, student Four cards, income rating, income, student, limit student, limit The first four selected models for best subset selection and forward stepwise selection on the Credit data set. The first three models are identical but the fourth models differ. 12 / 57
  • 13. Backward Stepwise Selection • Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. • However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time. 13 / 57
  • 14. Backward Stepwise Selection: details Backward Stepwise Selection 1. Let Mp denote the full model, which contains all p predictors. 2. For k = p, p − 1, . . . , 1: 2.1 Consider all k models that contain all but one of the predictors in Mk, for a total of k − 1 predictors. 2.2 Choose the best among these k models, and call it Mk−1. Here best is defined as having smallest RSS or highest R2 . 3. Select a single best model from among M0, . . . , Mp using cross-validated prediction error, Cp (AIC), BIC, or adjusted R2. 14 / 57
  • 15. More on Backward Stepwise Selection • Like forward stepwise selection, the backward selection approach searches through only 1 + p(p + 1)/2 models, and so can be applied in settings where p is too large to apply best subset selection • Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of the p predictors. • Backward selection requires that the number of samples n is larger than the number of variables p (so that the full model can be fit). In contrast, forward stepwise can be used even when n < p, and so is the only viable subset method when p is very large. 15 / 57
  • 16. Choosing the Optimal Model • The model containing all of the predictors will always have the smallest RSS and the largest R2, since these quantities are related to the training error. • We wish to choose a model with low test error, not a model with low training error. Recall that training error is usually a poor estimate of test error. • Therefore, RSS and R2 are not suitable for selecting the best model among a collection of models with different numbers of predictors. 16 / 57
  • 17. Estimating test error: two approaches • We can indirectly estimate test error by making an adjustment to the training error to account for the bias due to overfitting. • We can directly estimate the test error, using either a validation set approach or a cross-validation approach, as discussed in previous lectures. • We illustrate both approaches next. 17 / 57
  • 18. Cp, AIC, BIC, and Adjusted R2 • These techniques adjust the training error for the model size, and can be used to select among a set of models with different numbers of variables. • The next figure displays Cp, BIC, and adjusted R2 for the best model of each size produced by best subset selection on the Credit data set. 18 / 57
  • 19. Credit data example 2 4 6 8 10 1000015000200002500030000 Number of Predictors Cp 2 4 6 8 10 1000015000200002500030000 Number of Predictors BIC 2 4 6 8 10 0.860.880.900.920.940.96 Number of Predictors AdjustedR2 19 / 57
  • 20. Now for some details • Mallow’s Cp: Cp = 1 n RSS + 2dˆσ2 , where d is the total # of parameters used and ˆσ2 is an estimate of the variance of the error associated with each response measurement. • The AIC criterion is defined for a large class of models fit by maximum likelihood: AIC = −2 log L + 2 · d where L is the maximized value of the likelihood function for the estimated model. • In the case of the linear model with Gaussian errors, maximum likelihood and least squares are the same thing, and Cp and AIC are equivalent. Prove this. 20 / 57
  • 21. Details on BIC BIC = 1 n RSS + log(n)dˆσ2 . • Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value. • Notice that BIC replaces the 2dˆσ2 used by Cp with a log(n)dˆσ2 term, where n is the number of observations. • Since log n > 2 for any n > 7, the BIC statistic generally places a heavier penalty on models with many variables, and hence results in the selection of smaller models than Cp. See Figure on slide 19. 21 / 57
  • 22. Adjusted R2 • For a least squares model with d variables, the adjusted R2 statistic is calculated as Adjusted R2 = 1 − RSS/(n − d − 1) TSS/(n − 1) . where TSS is the total sum of squares. • Unlike Cp, AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted R2 indicates a model with a small test error. • Maximizing the adjusted R2 is equivalent to minimizing RSS n−d−1. While RSS always decreases as the number of variables in the model increases, RSS n−d−1 may increase or decrease, due to the presence of d in the denominator. • Unlike the R2 statistic, the adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model. See Figure on slide 19. 22 / 57
  • 23. Validation and Cross-Validation • Each of the procedures returns a sequence of models Mk indexed by model size k = 0, 1, 2, . . .. Our job here is to select ˆk. Once selected, we will return model Mˆk 23 / 57
  • 24. Validation and Cross-Validation • Each of the procedures returns a sequence of models Mk indexed by model size k = 0, 1, 2, . . .. Our job here is to select ˆk. Once selected, we will return model Mˆk • We compute the validation set error or the cross-validation error for each model Mk under consideration, and then select the k for which the resulting estimated test error is smallest. • This procedure has an advantage relative to AIC, BIC, Cp, and adjusted R2, in that it provides a direct estimate of the test error, and doesn’t require an estimate of the error variance σ2. • It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance σ2. 23 / 57
  • 25. Credit data example 2 4 6 8 10 100120140160180200220 Number of Predictors SquareRootofBIC 2 4 6 8 10 100120140160180200220 Number of Predictors ValidationSetError 2 4 6 8 10 100120140160180200220 Number of Predictors Cross−ValidationError 24 / 57
  • 26. Details of Previous Figure • The validation errors were calculated by randomly selecting three-quarters of the observations as the training set, and the remainder as the validation set. • The cross-validation errors were computed using k = 10 folds. In this case, the validation and cross-validation methods both result in a six-variable model. • However, all three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors. • In this setting, we can select a model using the one-standard-error rule. We first calculate the standard error of the estimated test MSE for each model size, and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve. What is the rationale for this? 25 / 57
  • 27. Shrinkage Methods Ridge regression and Lasso • The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. • As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. • It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. 26 / 57
  • 28. Ridge regression • Recall that the least squares fitting procedure estimates β0, β1, . . . , βp using the values that minimize RSS = n i=1  yi − β0 − p j=1 βjxij   2 . • In contrast, the ridge regression coefficient estimates ˆβR are the values that minimize n i=1  yi − β0 − p j=1 βjxij   2 + λ p j=1 β2 j = RSS + λ p j=1 β2 j , where λ ≥ 0 is a tuning parameter, to be determined separately. 27 / 57
  • 29. Ridge regression: continued • As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. • However, the second term, λ j β2 j , called a shrinkage penalty, is small when β1, . . . , βp are close to zero, and so it has the effect of shrinking the estimates of βj towards zero. • The tuning parameter λ serves to control the relative impact of these two terms on the regression coefficient estimates. • Selecting a good value for λ is critical; cross-validation is used for this. 28 / 57
  • 30. Credit data example 1e−02 1e+00 1e+02 1e+04 −300−1000100200300400 StandardizedCoefficients Income Limit Rating Student 0.0 0.2 0.4 0.6 0.8 1.0−300−1000100200300400 StandardizedCoefficients λ ˆβR λ 2/ ˆβ 2 29 / 57
  • 31. Details of Previous Figure • In the left-hand panel, each curve corresponds to the ridge regression coefficient estimate for one of the ten variables, plotted as a function of λ. • The right-hand panel displays the same ridge coefficient estimates as the left-hand panel, but instead of displaying λ on the x-axis, we now display ˆβR λ 2/ ˆβ 2, where ˆβ denotes the vector of least squares coefficient estimates. • The notation β 2 denotes the 2 norm (pronounced “ell 2”) of a vector, and is defined as β 2 = p j=1 βj 2 . 30 / 57
  • 32. Ridge regression: scaling of predictors • The standard least squares coefficient estimates are scale equivariant: multiplying Xj by a constant c simply leads to a scaling of the least squares coefficient estimates by a factor of 1/c. In other words, regardless of how the jth predictor is scaled, Xj ˆβj will remain the same. • In contrast, the ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function. • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula ˜xij = xij 1 n n i=1(xij − xj)2 31 / 57
  • 33. Why Does Ridge Regression Improve Over Least Squares? The Bias-Variance tradeoff 1e−01 1e+01 1e+03 0102030405060 MeanSquaredError 0.0 0.2 0.4 0.6 0.8 1.0 0102030405060 MeanSquaredError λ ˆβR λ 2/ ˆβ 2 Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coefficients. Squared bias (black), variance (green), and test mean squared error (purple) for the ridge regression predictions on a simulated data set, as a function of λ and ˆβR λ 2/ ˆβ 2. The horizontal dashed lines indicate the minimum possible MSE. The purple crosses indicate the ridge regression models for which the MSE is smallest. 32 / 57
  • 34. The Lasso • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, ˆβL λ , minimize the quantity n i=1  yi − β0 − p j=1 βjxij   2 + λ p j=1 |βj| = RSS + λ p j=1 |βj|. • In statistical parlance, the lasso uses an 1 (pronounced “ell 1”) penalty instead of an 2 penalty. The 1 norm of a coefficient vector β is given by β 1 = |βj|. 33 / 57
  • 35. The Lasso: continued • As with ridge regression, the lasso shrinks the coefficient estimates towards zero. • However, in the case of the lasso, the 1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently large. • Hence, much like best subset selection, the lasso performs variable selection. • We say that the lasso yields sparse models — that is, models that involve only a subset of the variables. • As in ridge regression, selecting a good value of λ for the lasso is critical; cross-validation is again the method of choice. 34 / 57
  • 36. Example: Credit dataset 20 50 100 200 500 2000 5000 −2000100200300400 StandardizedCoefficients 0.0 0.2 0.4 0.6 0.8 1.0−300−1000100200300400 StandardizedCoefficients Income Limit Rating Student λ ˆβL λ 1/ ˆβ 1 35 / 57
  • 37. The Variable Selection Property of the Lasso Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? 36 / 57
  • 38. The Variable Selection Property of the Lasso Why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly equal to zero? One can show that the lasso and ridge regression coefficient estimates solve the problems minimize β n i=1  yi − β0 − p j=1 βjxij   2 subject to p j=1 |βj| ≤ s and minimize β n i=1  yi − β0 − p j=1 βjxij   2 subject to p j=1 β2 j ≤ s, respectively. 36 / 57
  • 40. Comparing the Lasso and Ridge Regression 0.02 0.10 0.50 2.00 10.00 50.00 0102030405060 MeanSquaredError 0.0 0.2 0.4 0.6 0.8 1.0 0102030405060 R2 on Training Data MeanSquaredError λ Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set of Slide 32. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their R2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest. 38 / 57
  • 41. Comparing the Lasso and Ridge Regression: continued 0.02 0.10 0.50 2.00 10.00 50.00 020406080100 MeanSquaredError 0.4 0.5 0.6 0.7 0.8 0.9 1.0 020406080100 R2 on Training Data MeanSquaredError λ Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso. The simulated data is similar to that in Slide 38, except that now only two predictors are related to the response. Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). Both are plotted against their R2 on the training data, as a common form of indexing. The crosses in both plots indicate the lasso model for which the MSE is smallest. 39 / 57
  • 42. Conclusions • These two examples illustrate that neither ridge regression nor the lasso will universally dominate the other. • In general, one might expect the lasso to perform better when the response is a function of only a relatively small number of predictors. • However, the number of predictors that is related to the response is never known a priori for real data sets. • A technique such as cross-validation can be used in order to determine which approach is better on a particular data set. 40 / 57
  • 43. Selecting the Tuning Parameter for Ridge Regression and Lasso • As for subset selection, for ridge regression and lasso we require a method to determine which of the models under consideration is best. • That is, we require a method selecting a value for the tuning parameter λ or equivalently, the value of the constraint s. • Cross-validation provides a simple way to tackle this problem. We choose a grid of λ values, and compute the cross-validation error rate for each value of λ. • We then select the tuning parameter value for which the cross-validation error is smallest. • Finally, the model is re-fit using all of the available observations and the selected value of the tuning parameter. 41 / 57
  • 44. Credit data example 5e−03 5e−02 5e−01 5e+00 25.025.225.425.6 Cross−ValidationError 5e−03 5e−02 5e−01 5e+00 −300−1000100300 StandardizedCoefficients λλ Left: Cross-validation errors that result from applying ridge regression to the Credit data set with various values of λ. Right: The coefficient estimates as a function of λ. The vertical dashed lines indicates the value of λ selected by cross-validation. 42 / 57
  • 45. Simulated data example 0.0 0.2 0.4 0.6 0.8 1.0 020060010001400 Cross−ValidationError 0.0 0.2 0.4 0.6 0.8 1.0 −5051015 StandardizedCoefficients ˆβL λ 1/ ˆβ 1 ˆβL λ 1/ ˆβ 1 Left: Ten-fold cross-validation MSE for the lasso, applied to the sparse simulated data set from Slide 39. Right: The corresponding lasso coefficient estimates are displayed. The vertical dashed lines indicate the lasso fit for which the cross-validation error is smallest. 43 / 57
  • 46. Dimension Reduction Methods • The methods that we have discussed so far in this chapter have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, X1, X2, . . . , Xp. • We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods. 44 / 57
  • 47. Dimension Reduction Methods: details • Let Z1, Z2, . . . , ZM represent M < p linear combinations of our original p predictors. That is, Zm = p j=1 φmjXj (1) for some constants φm1, . . . , φmp. • We can then fit the linear regression model, yi = θ0 + M m=1 θmzim + i, i = 1, . . . , n, (2) using ordinary least squares. • Note that in model (2), the regression coefficients are given by θ0, θ1, . . . , θM . If the constants φm1, . . . , φmp are chosen wisely, then such dimension reduction approaches can often outperform OLS regression. 45 / 57
  • 48. • Notice that from definition (1), M m=1 θmzim = M m=1 θm p j=1 φmjxij = p j=1 M m=1 θmφmjxij = p j=1 βjxij, where βj = M m=1 θmφmj. (3) • Hence model (2) can be thought of as a special case of the original linear regression model. • Dimension reduction serves to constrain the estimated βj coefficients, since now they must take the form (3). • Can win in the bias-variance tradeoff. 46 / 57
  • 49. Principal Components Regression • Here we apply principal components analysis (PCA) (discussed in Chapter 10 of the text) to define the linear combinations of the predictors, for use in our regression. • The first principal component is that (normalized) linear combination of the variables with the largest variance. • The second principal component has largest variance, subject to being uncorrelated with the first. • And so on. • Hence with many correlated original variables, we replace them with a small set of principal components that capture their joint variation. 47 / 57
  • 50. Pictures of PCA 10 20 30 40 50 60 70 05101520253035 Population AdSpending The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles. The green solid line indicates the first principal component, and the blue dashed line indicates the second principal component. 48 / 57
  • 51. Pictures of PCA: continued 20 30 40 50 51015202530 Population AdSpending −20 −10 0 10 20 −10−50510 1st Principal Component 2ndPrincipalComponent A subset of the advertising data. Left: The first principal component, chosen to minimize the sum of the squared perpendicular distances to each point, is shown in green. These distances are represented using the black dashed line segments. Right: The left-hand panel has been rotated so that the first principal component lies on the x-axis. 49 / 57
  • 52. Pictures of PCA: continued −3 −2 −1 0 1 2 3 2030405060 1st Principal Component Population −3 −2 −1 0 1 2 3 51015202530 1st Principal ComponentAdSpending Plots of the first principal component scores zi1 versus pop and ad. The relationships are strong. 50 / 57
  • 53. Pictures of PCA: continued −1.0 −0.5 0.0 0.5 1.0 2030405060 2nd Principal Component Population −1.0 −0.5 0.0 0.5 1.0 51015202530 2nd Principal ComponentAdSpending Plots of the second principal component scores zi2 versus pop and ad. The relationships are weak. 51 / 57
  • 54. Application to Principal Components Regression 0 10 20 30 40 010203040506070 Number of Components MeanSquaredError 0 10 20 30 40 050100150 Number of Components MeanSquaredError Squared Bias Test MSE Variance PCR was applied to two simulated data sets. The black, green, and purple lines correspond to squared bias, variance, and test mean squared error, respectively. Left: Simulated data from slide 32. Right: Simulated data from slide 39. 52 / 57
  • 55. Choosing the number of directions M 2 4 6 8 10 −300−1000100200300400 Number of Components StandardizedCoefficients Income Limit Rating Student 2 4 6 8 10 20000400006000080000 Number of Components Cross−ValidationMSE Left: PCR standardized coefficient estimates on the Credit data set for different values of M. Right: The 10-fold cross validation MSE obtained using PCR, as a function of M. 53 / 57
  • 56. Partial Least Squares • PCR identifies linear combinations, or directions, that best represent the predictors X1, . . . , Xp. • These directions are identified in an unsupervised way, since the response Y is not used to help determine the principal component directions. • That is, the response does not supervise the identification of the principal components. • Consequently, PCR suffers from a potentially serious drawback: there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response. 54 / 57
  • 57. Partial Least Squares: continued • Like PCR, PLS is a dimension reduction method, which first identifies a new set of features Z1, . . . , ZM that are linear combinations of the original features, and then fits a linear model via OLS using these M new features. • But unlike PCR, PLS identifies these new features in a supervised way – that is, it makes use of the response Y in order to identify new features that not only approximate the old features well, but also that are related to the response. • Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors. 55 / 57
  • 58. Details of Partial Least Squares • After standardizing the p predictors, PLS computes the first direction Z1 by setting each φ1j in (1) equal to the coefficient from the simple linear regression of Y onto Xj. • One can show that this coefficient is proportional to the correlation between Y and Xj. • Hence, in computing Z1 = p j=1 φ1jXj, PLS places the highest weight on the variables that are most strongly related to the response. • Subsequent directions are found by taking residuals and then repeating the above prescription. 56 / 57
  • 59. Summary • Model selection methods are an essential tool for data analysis, especially for big datasets involving many predictors. • Research into methods that give sparsity, such as the lasso is an especially hot area. • Later, we will return to sparsity in more detail, and will describe related approaches such as the elastic net. 57 / 57