1
Chapter 9 Variable Selection and
Model building
Ray-Bing Chen
Institute of Statistics
National University of Kaohsiung
2
9.1 Introduction
9.1.1 The Model-Building Problem
• Ensure that the function form of the model is
correct and that the underlying assumptions are
not violated.
• A pool of candidate regressors
• Variable selection problem
• Two conflicting objectives:
– Include as many regressors as possible: the
information content in these factors can
influence the predicted values, y
3
– Include as few regressors as possible: the
variance of the prediction increases as the
number of the regressors increases
• “Best” regression equation???
• Several algorithms can be used for variable
selection, but these procedures frequently specify
different subsets of the candidate regressors as
best.
• An idealized setting:
– The correct functional forms of regressors are
known.
– No outliers or influential observations
4
• Residual analysis
• Iterative approach:
1. A variable selection strategy
2. Check the correct functional forms, outliers
and influential observations
• None of the variable selection procedures are
guaranteed to produce the best regression
equation for a given data set.
5
9.1.2 Consequences of Model Misspecification
• The full model
• The subset model
6
7
8
9
• Motivation for variable selection:
– Deleting variables from the model can improve
the precision of parameter estimates. This is
also true for the variance of predicted response.
– Deleting variable from the model will introduce
the bias.
– However, if the deleted variables have small
effects, the MSE of the biased estimates will be
less than the variance of the unbiased estimates.
10
9.1.3 Criteria for Evaluating Subset Regression
Models
• Coefficient of Multiple Determination:
11
– Aitkin (1974) : R2-adequate subset: the subset
regressor variables produce R2 > R2
0
12
13
14
15
16
17
• Uses of Regression and Model Evaluation Criteria
– Data description: Minimize SSRes and as few
regressors as possible
– Prediction and estimation: Minimize the mean
square error of prediction. Use PRESS statistic
– Parameter estimation: Chapter 10
– Control: minimize the standard errors of the
regression coefficients.
18
9.2 Computational Techniques
for Variable Selection
9.2.1 All Possible Regressions
• Fit all possible regression equations, and then
select the best one by some suitable criterions.
• Assume the model includes the intercept term
• If there are K candidate regressors, there are 2K
total equations to be estimated and examined.
19
Example 9.1 The Hald Cement Data
20
21
• R2
p criterion:
22
23
24
25
26
27
28
29
9.2.2 Stepwise Regression Methods
• Three broad categories:
1. Forward selection
2. Backward elimination
3. Stepwise regression
30
31
Backward elimination
– Start with a model with all K candidate
regressors.
– The partial F-statistic is computed for each
regressor, and drop a regressor which has the
smallest F-statistic and < FOUT.
– Stop when all partial F-statistics > FOUT.
32
Stepwise Regression
• A modification of forward selection.
• A regressor added at an earlier step may be
redundant. Hence this variable should be dropped
from the model.
• Two cutoff values: FOUT and FIN
• Usually choose FIN > FOUT : more difficult to add a
regressor than to delete one.

variableselectionmodelBuilding.ppt

  • 1.
    1 Chapter 9 VariableSelection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung
  • 2.
    2 9.1 Introduction 9.1.1 TheModel-Building Problem • Ensure that the function form of the model is correct and that the underlying assumptions are not violated. • A pool of candidate regressors • Variable selection problem • Two conflicting objectives: – Include as many regressors as possible: the information content in these factors can influence the predicted values, y
  • 3.
    3 – Include asfew regressors as possible: the variance of the prediction increases as the number of the regressors increases • “Best” regression equation??? • Several algorithms can be used for variable selection, but these procedures frequently specify different subsets of the candidate regressors as best. • An idealized setting: – The correct functional forms of regressors are known. – No outliers or influential observations
  • 4.
    4 • Residual analysis •Iterative approach: 1. A variable selection strategy 2. Check the correct functional forms, outliers and influential observations • None of the variable selection procedures are guaranteed to produce the best regression equation for a given data set.
  • 5.
    5 9.1.2 Consequences ofModel Misspecification • The full model • The subset model
  • 6.
  • 7.
  • 8.
  • 9.
    9 • Motivation forvariable selection: – Deleting variables from the model can improve the precision of parameter estimates. This is also true for the variance of predicted response. – Deleting variable from the model will introduce the bias. – However, if the deleted variables have small effects, the MSE of the biased estimates will be less than the variance of the unbiased estimates.
  • 10.
    10 9.1.3 Criteria forEvaluating Subset Regression Models • Coefficient of Multiple Determination:
  • 11.
    11 – Aitkin (1974): R2-adequate subset: the subset regressor variables produce R2 > R2 0
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    17 • Uses ofRegression and Model Evaluation Criteria – Data description: Minimize SSRes and as few regressors as possible – Prediction and estimation: Minimize the mean square error of prediction. Use PRESS statistic – Parameter estimation: Chapter 10 – Control: minimize the standard errors of the regression coefficients.
  • 18.
    18 9.2 Computational Techniques forVariable Selection 9.2.1 All Possible Regressions • Fit all possible regression equations, and then select the best one by some suitable criterions. • Assume the model includes the intercept term • If there are K candidate regressors, there are 2K total equations to be estimated and examined.
  • 19.
    19 Example 9.1 TheHald Cement Data
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    29 9.2.2 Stepwise RegressionMethods • Three broad categories: 1. Forward selection 2. Backward elimination 3. Stepwise regression
  • 30.
  • 31.
    31 Backward elimination – Startwith a model with all K candidate regressors. – The partial F-statistic is computed for each regressor, and drop a regressor which has the smallest F-statistic and < FOUT. – Stop when all partial F-statistics > FOUT.
  • 32.
    32 Stepwise Regression • Amodification of forward selection. • A regressor added at an earlier step may be redundant. Hence this variable should be dropped from the model. • Two cutoff values: FOUT and FIN • Usually choose FIN > FOUT : more difficult to add a regressor than to delete one.