Linear Regression.pdf

Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-1
9/8/2019

Leonardo Auslender –Ch. 1 Copyright 2
9/8/2019
Contents
Types of Regression modeling.
Linear Regression.
Goodness of fit.
Fraud Data quick example
Homework and interview questions.
Non-Linear Regression (quickly).
Problem Areas
Heteroskedasticity
Co-linearity
Outliers
Leverage
Example of problem areas.
QQ Plots
Logging Dependent Variable
Residuals.
Interactions.
Model and Variable Selections.
Full example.
Problem if selected away too many variables.
SAS Programs
References.

9/8/2019
New Applicants,
Fraudsters,
Mortgage
Delinquents,
Diseased,
Etc.
Historical
Data
Predictive
Model
Exploratory
/ Interpretive
Model

9/8/2019
Exploratory/Interpretive and Predictive Models.
Typically, businesses interested in both aspects. By way of
example, Phone company interested in stopping attrition,
interested in understanding characteristics of past attriters
vs. non-attriters, and to predict who may be future ones to
implement anti-attrition plans. Marketing plans typically
dependent on profiling, not directly on predictive scores.
Model purported to provide tools to prevent attrition.
Typically, both aspects of modeling are not easily
achievable.
Typically, data mining, data science, etc. search for better
predictive models, leaving interpretation as secondary aim.
Thus, black-box models (e.g., neural networks) are more
readily employed.

9/8/2019
Types of Regression models.
Dependent on type of dependent or target variable (single):
Continuous:
Linear Regression: standard and most prevalent method.
Ridge Regression: Specialized for the case of co-linearity.
Lasso Regression: least absolute shrinkage and selection operator:
performs variable selection in addition to avoiding co-linearity.
Partial Least Squares: Useful when n << p, reduces # predictor to
small subset of components like PCA, then Linear regression on components.
Non-linear regression: allows for more flexible functional form.
Categorical:
Binary logistic: estimates probability of occurrence of “event”.
Ordinal logistic: Dep var has at least 3 levels, that can be ranked
(e.g.., worse, fair, better).
Nominal (or Polychotomous): More than 2 levels.
Counts (e.g., number of visits to doctors, etc).
Poisson, Negative Binomial, Zero-inflated Regression:

9/8/2019
All models are wrong, but some are useful – George Box (1976)
(right, which ones are useful?)
Linear models: area that has received the most attention in
statistical modeling. Part of supervised learning in ML, Data
Mining, etc. Unsupervised methods refer to those without
target or dependent variable (e.g., clustering, PCA).
Given dependent (or target) variable Y and set of non-random predictors X,
aim at a) finding linear function f(X) of predictors to predict Y, given some
conditioning criterion, b) interpreting function and c) once f(X) is estimated
and predictions for Y found, “loss” function is used to evaluate prediction
errors, typically squared error function.
This presentation is about Linear Regression, and its estimation
method is called Ordinary Least Squares (OLS), or Least Squares Estimation
(LSE).. Impossible to review all methods well in a single lecture.

9/8/2019
Tendency to just plunge into whatever data are available and obtain a
model. But …
Gladwell (2008, p.1 and subsequent) mentions that mortality in town of Roseto, Pa,
was far lower than in other towns. Heart attacks below age 55 were unknown while they
were prevalent all over the US. Scientists analyzed genetic information (Rosetans were
mostly originals from a small southern Italian town), exercise, diet, but could not find any
true answers at the individual level.
The difference was the social interconnections that the Rosetans had in Roseto, the
friendliness in the streets, the chatting in the old Italian dialect, 3 generations living under
the same roof, the calming effect of the church services, the egalitarian ethos that allowed
people to enrich themselves without flaunting it to others not so fortunate, in short a
protective social structure. They counted 22 civic organizations in a town of 2,000.
Spatial databases will capture the effect but not the reason unless societal
interconnections are first understood, conceptualized and measured. Typical databases
may not include them ➔
UNDERSTAND that raw data may not readily model reality and that NOT everything
is modelable.

9/8/2019

Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Model: Y = 0 + 1*X1 + 2*X2 +… + p*Xp + ε =
= X + ε,
Y continuous, dependent or target variable,
X, set of predictors, either binary or continuous, all Numeric.
Linearity is assumed because the function f(X) = X’β, and β are fixed and
not random coefficients, unknown and in need of estimation.
Criterion: Minimize sum of squared errors to find Betas.
Ch. 2.2-9
* * ....
* * ....
.
.
* * ....
1 1 11 2 12 p 1p 1
2 1 21 2 22 p 2p 2
n 1 n1 2 n2 p np n
Y X X X
Y X X X
Y X X X
=  +  + +  + 
=  +  + +  + 
=  +  + +  + 

9/8/2019 Ch. 2.2-10
or succintly
:
. .
. .
. . . . .
.
. . . . .
.
. .
,
1 11 12 1p
21 22 2p
2
n1 n2 np p p
n
Matrix representation
Y X X X
X X X
Y
X X X
Y
Y X
 
 
 
       
       
 
       
       
 
= +
       
 
       
       
 
     
 
=  + 

9/8/2019
Idealized view with single predictor X.
At every
Value of
Total
length
There are
Many
values
Of body
Depth
Symmetric
ally
Distributed
.

9/8/2019
Example (Horst, 1941):
Y = pilot performance
X1 = mechanical ability of the pilot
X2 = verbal ability of the pilot
N: undetermined number of observations, p = 2.
Required: N > p (vital).
From the standpoint of computation, no assumptions on
error , because only betas are estimated. Residuals
(estimates of , are obtained by subtracting predicted from
actual.
Ch. 2.2-12
1 1 2 2
1 1 2 2
ˆ ˆ
Model: constant
Estimate constant
Y X X
Y b X b X
  
= + + +
= + +


9/8/2019
Requested Analyses: Names & Descriptions.
Model #
Model Name Model Description
***
Overall Models
-1
M1 Multivariate regression TOTAL_SPEND
1
M1_TRN_REGR_NONE Regr TRN NONE
2

9/8/2019
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 17605721712 2934286952 23.90 <.0001 (1)
Error 5953 7.308901E11 122776763
Corrected Total 5959 7.484958E11
Root MSE 11080 R-Square 0.0235 (2)
Dependent Mean 18608 Adj R-Sq 0.0225
Coeff Var 59.54689 (5)
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 15664 501.50886 31.23 <.0001
NUM_MEMBERS Number of members covered 1 -43.86759 144.12951 -0.30 0.7609 (3)
DOCTOR_VISITS Total visits to a doctor 1 138.06114 20.29884 6.80 <.0001 (4)
FRAUD Fraudulent Activity yes/no 1 -1813.84488 394.55243 -4.60 <.0001
MEMBER_DURATION Membership duration 1 9.29226 1.81950 5.11 <.0001
NO_CLAIMS No of claims made recently 1 -172.63492 142.61993 -1.21 0.2262
OPTOM_PRESC Number of opticals claimed 1 477.92230 88.49029 5.40 <.0001

9/8/2019

Linear Regression: geometrical projections.
Y
X
Z
Projection of Y
On X.
Projection of Y
On Z.
Y hat = a X + b Z = inner
Product of optimal projections
Of Y on the plane spanned by
X and Z.
9/8/2019 Ch. 2.2-16
Plane spanned
By X and Z

9/8/2019

9/8/2019
Assumptions beyond sheer computation.
1) Nonrandom X matrix of predictors.
2) X matrix is of full rank (vector components linearly independent).
3) ε i uncorrelated across observations, common mean 0 and
unknown positive and constant variance σ2 (E.g., not using height and weight as
predictors)..
4) ε ~ N (0, σ2).
Inferential Problems.
1) Point and CI Estimation of  and c’  (linear combination).
2) Estimation of σ2.
3) Hypothesis testing on  and c’  (linear combination of parameters), c
vector of constants.
4) Prediction (point and interval) of new Y.
Regardless of assumptions, always possible to fit a linear
regression to data. Resulting equation may not be USEFUL,
or may be misleading ➔ use your brains.
Ch. 3-18

9/8/2019
Ordinary Least Squares (OLS): ***
Estimate value that minimizes the residual sum of squares
criterion: (usually also called min SSE, we’ll use either one term):
−
= − = − −
=
  2
i ij j
1
RSS (Y ( x )) (Y X )'(Y X ),
from which:
OLS estimate : (X'X) X' Y
  

OLS estimates are B(est) L(inear) U(nbiased) E(stimators): BLUE,
that is, minimum variance, and are also the MLE (maximum
likelihood estimators). .
−
− −
 = 
= 

 =
−
 =   
 −  −


2 1
2
0 A
1 1
2
Var( ) (X'X)
Fitted or predicted Y: Y X
RSS( )
n p
To test H : C d,H : C d at level of significance :
(C d)'[C(X'X) C'] (C d)
nrow(C)
Compare to upper % point of F (nrow C, n - p)
Ch. 3-19

9/8/2019
To test whether at least one coefficient is NOT ZERO,
Use F-test, where “0” refers to null model and “F” to full.
~ ( , )
0 F
F
SSE SSE
p
F test F p n p
MSE
−
− = −
 =  
=
0 A
H : C d,H : C d
typically, d 0

9/8/2019

9/8/2019
➔Corr (X,Y) = if SD(Y) = SD(X). E.g., if both
Standardized, otherwise same sign at least, and
interpretation from correlation holds in simple regression
case.
Notice that regression of X on Y is NOT inverse of
regression of Y on X because of SD(X) and SD(Y).
ˆ

/
Confusion on signs of coefficients
and interpretation.
( )
ˆ {
( )
} ˆ
( ) ( )
y
i
xy xy
x
i
xy
Y X
s
Y Y
r r
s
X X
sg r sg
  
 
= + +
−
= = 
−
=


2
1 2
2

9/8/2019
In multiple linear regression, previous relationship does
not hold because predictors can be correlated (rxz)
weighted by ryz, hinting at co-linearity and/or relationships
of supression/enhancement.
. .
. 2
2
But in multivariate, e.g.: ,
estimated equation (emphasizing "partial")
and for example:
ˆ ˆ ˆ ,
ˆ
1
ˆ
( ) ( )
( ) ( ) and 1
YX Z YZ X
Y YX YZ XZ
YX Z
X XZ
YX
YX YZ XZ XZ
Y X Z
Y a X Z
s r r r
s r
sg sg r
abs r abs r r r
   
 
 
= + + +

= + +
−
=
−
= 
  

9/8/2019
Illustrating
Beta coefficients
Issue,
with example
From BEDA

9/8/2019
Zero X Y slope > 0, partial Y X / Z slope < 0.

9/8/2019
Note coeff (var_x) < 0 while corr (Var_Y, Var_x) > 0.
Note that non-intercept p-values are significant. .
Estimates
DF
Paramet
er
Estimate
Standar
d Error t Value Pr > |t|
Variable
1 -0.05 0.07 -0.78 0.44
Intercept
var_z
1 0.80 0.11 7.51 0.00
var_x
1 -0.49 0.11 -4.64 0.00

9/8/2019

Leonardo Auslender –Ch. 1 Copyright 9/8/2019
Goodness of fit
Some Necessary Definitions in linear models.
Correlation, R2, etc. (Y-hat = predicted Y).
ˆ
| |
ˆ ˆ
( , ) cos( , )
| |
ˆ ˆ ˆ
| | | | cos( , ) | | | |
( , 1995, .36)
1 cos 1
Y
R Corr Y Y Y Y
Y
Y Y Y Y Y Y
Wickens p
= = = 
= 
−  

Length of predicted vector never larger than original
length of Y ➔ Regression to the mean.
Ch. 2.2-28

Some Necessary Definitions in linear models, Goodness
of Fit measures: R2 (Coeff. of Determination, r2 is for
simple regr.)
0  R2  1 = Model SS (Regresion SS) / Total SS = 1 – SSE/SST
(computational formula).
With just Y and X in regression, r2 = corr2 (Y, X) (from previous formula).
n
i
i = 1
n
i
i = 1
n
i i
i = 1
2
2
2
Regression Sum of Squares =
ˆ
RSS = (y - y)
Total Sum of Squares =
TSS = (y - y)
Sum of Squares of Error =
ˆ
SSE = (y - y )




9/8/2019
Y
X
Z
90o
90o
90o

=
Corr(X, Y) cos( ),
zero-order (X,Y ) corr.

ˆ ˆ
Corr(Y, Y) cos( ) Y / Y

= =

2 2 2
R cos ( ) 1 sin ( )
 
= = −
Ŷ
ˆ ˆ
sin( ) (Y Y) / Y e / Y
 = − =

 
−
2 ˆ
R (angle between Y and Y),
and vice versa.
Ch. 3-30
Geometric appreciation.

9/8/2019
Other measures of goodness of fit.
Press residuals.
The prediction error (PRESS residuals) are defined as:
, where is the fitted value of the ith response based on all observations except
the ith one. It can be shown that (and hii element of H matrix (X’X)-1X’)
The Press Statistic.
PRESS is generally regarded as a measure of how well a regression model will
perform in predicting new data.
Small PRESS
An R2- like statistic for prediction (based on PRESS
We expect this model to explain about R2 (prediction)% of variability in predicting
new observations.
Use PRESS to compare models: A model with small PRESS is preferable to one
with large PRESS.
i
(i)
ii
e
e
1 h
=
−

9/8/2019
Evaluation of measures of goodness of fit.
2
2
1
k
Mean Square Error
[0,1], unitless, measures prop. Var(Y) fitted by preds.
ˆ
( )
1
= standard error of fit for =
S y
estimate of error Var
, .
=RM
Root MSE SE=
. ( )
n
i i
i
R
Y Y
MSE
n p
MS
Coeff Variation CV
E
RMSE
Y
=

−
= =
− −
=

( / 2 , 2 )
2
2
/ , 2
k
95%
Prediction interval for k-th observation (already in data set):
s (prediction) = se of prediction or fit for ne
ˆ
:
ˆ
w x
* { },
(
( )
1
,
( )
[ ]
k n
k
i
h s n
and y
y t s prediction
s p ed
CI
r i
x x
n x x
t S
MSE


−
−

−
+
−


2
2
1
( )
1
) [1
( )
]
k
n
i
i
X X
ction MSE
n
X X
=
−
= + +
−


9/8/2019
Evaluation of measures of goodness of fit (cont.).
1) Model better when R2 higher and/or RMSE lower. RMSE called standard error of
regression S by Minitab, but we don’t use that terminology in here.
2) RMSE: average distance between Y and predictions, R2 does not tell this. RMSE can
be used also for non-linear models, R2 cannot. Roughly, -+ 2 * RMSE produces 95%
CI for residuals.
3) 95% Prediction Interval line can be built (above and below regression line) shows
where 95% of data lie in reference to prediction line.
4) 95% Mean Fitted values can be built (above and below regression line) shows where
95% of data lie in reference to fitted line.
5) CV evaluates relative closeness of the predictions to actual values; R2 evaluates how
much of variability in Y is fitted by the model. CV cannot be used when mean(Y) = 0
or when Y has mixtures of negative and positive values in its range.
6) Usefulness: if know that predictions must be within specific interval from data points
(model precision), RMSE provides the info, R2 does not.

9/8/2019

9/8/2019
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 17605721712 2934286952 23.90 <.0001 (1)
Error 5953 7.308901E11 122776763
Corrected Total 5959 7.484958E11
Root MSE 11080 R-Square 0.0235 (2)
Dependent Mean 18608 Adj R-Sq 0.0225
Coeff Var 59.54689 (5)
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 15664 501.50886 31.23 <.0001
NUM_MEMBERS Number of members covered 1 -43.86759 144.12951 -0.30 0.7609 (3)
DOCTOR_VISITS Total visits to a doctor 1 138.06114 20.29884 6.80 <.0001 (4)
FRAUD Fraudulent Activity yes/no 1 -1813.84488 394.55243 -4.60 <.0001
MEMBER_DURATION Membership duration 1 9.29226 1.81950 5.11 <.0001
NO_CLAIMS No of claims made recently 1 -172.63492 142.61993 -1.21 0.2262
OPTOM_PRESC Number of opticals claimed 1 477.92230 88.49029 5.40 <.0001

9/8/2019
Variance
Variable Label DF Inflation 95% Confidence Limits
Intercept Intercept 1 0 14681 16647
NUM_MEMBERS Number of members covered 1 1.00194 -326.41369 238.67852
DOCTOR_VISITS Total visits to a doctor 1 1.04606 98.26805 177.85423
FRAUD Fraudulent Activity yes/no 1 1.20681 -2587.31068 -1040.37907
MEMBER_DURATION Membership duration 1 1.08243 5.72537 12.85914
NO_CLAIMS No of claims made recently 1 1.14992 -452.22170 106.95185
OPTOM_PRESC Number of opticals claimed 1 1.03956 304.44924 651.39535
Interpretation:
(1) Implies that the variables in the model do fit the dependent variable significantly.
(2) Is the R-square value, that is very low since it is close to 0 but still significant at alpha = 5%.
(3) The coefficient for Num_members is not significantly different from 0.
(4) The coefficient for doctor_visits is significantly different from 0. An increase in doctor_visits by 1 unit,
keeping the rest of the variables constant, raises total_spending by around $138.
(5) Notice that the 95% confidence limits for variables deemed not to be significant, overlap 0.
(6) Parameter estimates are also called “main effects” of corresponding variables and are constant.
(7) The coefficient of variation (5) is defined as RMSE / mean dep var, and is unitless, can be used to
compared different models, smaller better.

9/8/2019
Caveat about coefficient interpretation:
Classical interpretation of regression coefficient ‘b1’ for variable X1 is: For
given values of all remaining predictors, a change in one unit of X1,
changes the predicted value by ‘b1’.
In a non-experimental setting (as this one), in which data sets are collected
opportunistically, predictors/variables are not necessarily orthogonal.
As a matter of fact, it is better to at least suspect that the predictors are
correlated. In this case, it is not possible to state “keeping the rest of the
variables constant” categorically, because raising the variable of interest
by one unit, affects the values that are supposed to be left constant.
Question:
The coefficient for fraud is -1813. Since fraud = 1 implies that there was
fraud activity, can you interpret it in relation to total_spend alone? Is it
possible to immediately interpret coefficients?

9/8/2019

9/8/2019
From Fitted …..

9/8/2019
To Predicted…..

9/8/2019
Model is seriously bad.

9/8/2019
QQ plot of residuals indicates problem with dependent
Variable. Residuals supposed to be normally distributed.

9/8/2019
No comments.

9/8/2019
End of first
Modeling attempt,
Will try to
improve soon.

9/8/2019
Using categorical variables in linear or logistic regression.
Given variable X with ‘p’ levels (e.g. colors, white, black, yellow, p = 3), LR
can use this information by creating ‘p – 1’ dummy (binary) variables:
Dummy1 = 1 if color = white, else 0
Dummy2 = 1 if color = black, else 0.
If both dummy1 and dummy2 = 0 then color is yellow. Method is called
DUMMY CODING. If we created 3 Dummies for a model, we would be
creating co-linearity among the Predictors.
When just dummy coded variables are our predictors, constant of linear
regression is mean of reference group. Coefficient of dummy predictor is
mean (dummy k) – mean (reference dummy).
In EFFECT CODING, we again create p-1 binary variables, but dummy1 =
dummy2 = -1 when color is yellow.

9/8/2019
Using categorical variables in linear or logistic regression.
In LR with only effect coding predictors, constant is overall mean of dep
var. Coefficient for effect1 is mean of predictor 1 – overall mean, etc.
In case of linear regression with variable selection, ‘p’ dummies can be
constructed, because variable selection will select at most p – 1 of them.
The same rule applies to logistic regression, but not to tree based
methods. Binary Tree based models searches over all possible 2
subgroups of p-levels to find optimal splits (to be reviewed at later
lecture).

9/8/2019

9/8/2019
Constant RMSEs, different R_squares.

9/8/2019

9/8/2019
2) Grocery store investigates relationship between cardboard
bags provided freely for each purchase versus revenue and
number of customers per visit. Data on more than 10,000
purchases. Quick glance revealed that number of bags utilized
ranged from 0 to 8 with mode of 3, the distribution of revenue is
skewed leftwards and the number of customers per visit ranged
from 1 to 6 with a mode of 1.
Analyst estimated linear regression with following information,
where information in parenthesis corresponds to standard
errors. All diagnostic measures of goodness of fit were
significant and very good.

9/8/2019
a) Provide interpretation to the coefficients.
b) Is it possible to provide an interpretation to the constant?
Did you expect the constant to be 0?
c) Have any model assumptions been violated with just this
information?
d) Would you implement this model? Why? Or why not?
e) Would you consider adding a product interaction to the
equation? What would it mean?
f) Should the dependent variable “Bags” be transformed?
Why? Or why not?
g) Assume that bags are a very costly item, and that it is
important to obtain a very good model. Discuss possible ways (variables,
transformations, model searches) to improve the present one. For
instance, could baggers and bag carriers (into the car) be a solution?
Would there be an interaction with customer gender?

9/8/2019

9/8/2019
Non-linear regressions.
Any regression beyond not linear in β parameters.
Examples (θ, theta, is our previous β).
3 4
1 2
( * ** )
1 2 1
1 4 2 4 3
: * * *
: ( ) *
: * cos( ) ( * cos(2 * )
x
Powerfunction X
Weibull Growth e
Fourier X X
 
 
  
    
−
+ −
+ + + +
For linear models, SSR + SSE = SS Total, from which R2 is
derived.
In nonlinear regression, SSR + SSE ≠ SS Total! This
completely invalidates R2 for nonlinear models, and it is not
bounded between 0 and 1. Still incorrectly used in many
fields (Spiess, Neumeyer, 2010)

9/8/2019

9/8/2019
Problem area: Heterokedasticity.
Error term in the original linear equation that links Y to X is assumed to be normal
with constant variance. If this assumption is violated, then the variance (and
consequently inference based on p-values) is affected.
E(ei
2)  2
Heteroskedasticity can be visually detected in a univariate fashion by graphing each
individual predictor versus the residuals, and looking for non-random patterns. In
the context of large data bases, this method is infeasible. Analytically, there are a
serious of tests, such as:
The Breusch-Pagan LM Test (used in the table below)
The Glesjer LM Test
The Harvey-Godfrey LM Test
The Park LM Test
The Goldfeld-Quandt Tests
White’s Test

9/8/2019
Problem area: Heterokedasticity.
Resolving hetero:
(a) Generalized Least Squares
(b) Weighted Least Squares
(c) Heteroskedasticity-Consistent Estimation Methods
Modeling Heteroskedasticity: Estimated Generalized Least Squares (EGLS).
1. Obtain the original regression results and keep the residuals ei, i = 1 ,,,, n.
2. Obtain log ( )
3. Run a regression with log ( ) as dependent variable on the original predictors, and
obtain predicted values Pred.i.
4. Exponentiate the predicted values Pred.i to obtain
5. Estimate the original equation of Y on X by using weighted least squares, using as
weights.
2
i
e
2
i

1/ i


9/8/2019
Problem area: Co-linearity (wrongly called multi-colinearity).
Co-linearity: existence of linear relationships among predictors (independently of Ys)
such that estimated coefficients become unstable when those relationships among Xs
have high r-squares. In extreme case of perfectly linear relationship among predictors, one
predictor can be perfectly modeled from others, ➔ it is irrelevant in regression against Y.
For instance, if regression contains Xs that denote components of wealth and one of those predictors
is merely sum of all others, the use of all the Xs to model Y would yield an unstable equation. (because
covariance matrix is then singular).
Present way to find out about co-linearity is NOT by correlation matrices but by
calculation of variance inflation factor (VIF, we omit other methods in this presentation).
This factor is calculated for every predictor and is the transformation of the r-square of a
regression of every predictor on the rest of the predictors (notice that the dependent
variable is not part of this). The present rule of thumb is that VIF values at or above 10
should be considered too high, and probably the variable in question (or part of the other
predictors) should be studied in more detail and ultimately removed from the analysis.

9/8/2019
Co-linearity: left panel, any beta2 minimizes SS1 along
bottom ridge., beta2 is not unique,
Ridge regression penalizes ‘ridge’ and thus optimal
beta2 is unique on right panel.

9/8/2019
Consequences of Heteroskedasticity and Co-linearity
1) Beta estimators are still linear
2) Beta estimators are still unbiased
3) Beta estimators are not efficient - the minimum variance
property no longer holds, ➔ tend to vary more with data.
4) Estimates of the coefficient variances are biased
5) The standard estimated variance is not an unbiased
estimator.
6) Confidence intervals are too narrow and inference can
be misleading.

9/8/2019
Problem Area: Unusual data points: Outliers, leverage and Influence.
Outlier: data point in which value of dependent variable does not follow
general trend of rest of data. Data point is influential (has leverage) if it unduly
influences any part of regression analysis, such as predicted responses,
estimated slope coefficients, or hypothesis test results. Thus, outlier
represents extreme value of Y w.r.t. other values of Y, not to values of X.
From bi-variate standpoint, an observation could be
1) an outlier if Y is extreme,
2) influential if X is extreme,
3) both outlier and influential if X and Y are extremes,
4) & also outlier and influential if distant from rest of data w/o being
extreme,
Definitions are not precise; from bi-variate standpoint, tendency to analyze
these effects by deletion of suspected observations. However, if
percentage of suspected observations is large, then the issue as to what
constitute the population of reference is in question.

9/8/2019
Working with unusual data points.
Not without further analysis. If the unusual nature is due to data error, then
resolve the issue but in reference to the model at hand. If you are
modeling height of titans in history, then you must expect heights beyond
10ft. If not, heights beyond 7ft are suspect. That is, condition on model at
hand.
A quick way to manage unusual data points is, once a regression is
obtained, to separate those points the absolute standardized residual
value of which is higher than 2. Standardized residuals are commonly
known as studentized residuals. The studentized residuals are the
residuals divided by the standard error of a regression where the i-th
observation has been deleted.

9/8/2019

9/8/2019
Different Regressions for Total_spend: “M1” general
Name for models for Fraud data. TRN: train data.
NONE: No var sel. LT_2 and GT_2: using studentized
Residuals to split data set. No_hetero after using Weighted
Least squares.
Requested Analyses: Names & Descriptions.
Model #
***
Overall Models
-1
M1 Multivariate regression TOTAL_SPEND
1
M1_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2
2
M1_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2
3
M1_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2
4
M1_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2
5
6
M1_TRN_REGR_NONE_NO_HETERO Regr TRN WLS
7

Leonardo Auslender –Ch. 1 Copyright Ch. 3-67
9/8/2019

9/8/2019
Notice variation in p-value significance.

9/8/2019
VIFs – Co-linearity: none if accept VIF > 10 criterion.

9/8/2019
R-Squares by models: typical jump when using studentized residuals.

Checking Hetero, still persistent and also using stpw
(stepwise variable selection).
Heterokedasticity test.
Breusch-Pagan Test for Heteroskedasticity
Model Name CHI-SQUARE DF P-VALUE
Num
Obs Hetero?
M1_TRN_REGR_NONE 42.50672 5 0.000 5,960 YES
M1_TRN_NONE_LT_2 49.405089 5 0.000 5,703 YES
M1_TRN_NONE_GT_2 26.637793 5 0.000 257 YES

9/8/2019

9/8/2019
QQ-plots: Need To transform depvar
In these models.

9/8/2019

Requested Analyses: Names & Descriptions. Model
#
***
Overall Models
-1
M2 Log total spend
1
M2_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2
2
M2_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2
3
M2_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2
4
M2_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2
5
6
M2_TRN_REGR_NONE_NO_HETER
O
Regr TRN WLS
7

9/8/2019
0.0235 0.038 0.1976 0.038 0.0381 0.0236 0.0388 0.0424 0.0995 0.0424 0.0429 0.0401
M
1_NO
NE
M
1_NO
NE_G
T_2
M
1_NO
NE_G
T_2_NO
_HETERO
M
1_NO
NE_LT_2
M
1_NO
NE_LT_2_NO
_HETERO
M
1_NO
NE_NO
_HETERO
M
2_NO
NE
M
2_NO
NE_G
T_2
M
2_NO
NE_G
T_2_NO
_HETERO
M
2_NO
NE_LT_2
M
2_NO
NE_LT_2_NO
_HETERO
M
2_NO
NE_NO
_HETERO
All models
0.00
0.05
0.10
0.15
0.20
Mean
_RSQ_
Avg TRN
Dep_interval_system
_RSQ_ by Model Name
Comparing initial M1 with logged M2.

9/8/2019
Log transform works for original model.

9/8/2019

9/8/2019
Log-transformed residuals and -+2 * RMSE as Cis: most of residuals contained in
CI..
More Info in model comparison section.

9/8/2019
M1 models: original total_spend (dep var) variable.
Notice: separate scales for heteroskedastic corrected models next slide.

9/8/2019

9/8/2019
M2 models: logged total_spend (dep var) variable.

9/8/2019
Interview Question:
Is there a problem,
Possible
Improvement?
1. X typically observed,
Y predicted.
2. Predicted more
Concentrated than
observed (look at ranges).
3. A lowess curve
(smoother)
Would provide much info
(there’s linear relationship
in residuals not accounted
for by the original
regression).

9/8/2019

9/8/2019
Interaction effects in regression analysis (moderated regressions).
The typical linear model assumes a constant effect of each predictor on Y
regardless of the level(s) of the other variable(s). The usual way of introducing
interactions is by multiplication: I.e., X1 * X2, but there are infinite functional ways
of representing interactions of X1 and X2. The product of the variables is chosen due
to its ease and because the product is somewhat sensitive to diverse functional
forms.
Arguments against the use of interactions.
A) Difficult to interpret and alter original coefficients of main effects (if
present).
B) Can generate co-linearity because product of X’s is nonlinear function of
the original variables.
C) From A) and B), coefficients of main effects can become insignificant.
Arguments in favor of the use of interactions.
The multiplicative term turns the additive relationship of X1 and X2 into a
conditional one. An example will better illustrate it.

9/8/2019
Let Y = b0 + b1X1 + b2D + b3DX1 + , where X1 is continuous and D a dummy
variable. The presence of D implies that there are in effect two regressions of Y
on X1, according to the values of D: (b1 and b2 typically called main effects of X1
and X2 respectively).
1) D = 0 ➔ Y = b0 + b1X1 + 
2) D = 1 ➔ Y = (b0 + b2) + (b1 + b3) X1 + 
If D is a continuous or a ratio variable instead, which we will call X2, and
rearranging the previous formulae we obtain:
a) For given value of X2,
Y = (b0 + b2 X2) + (b1 + b3 X2) X1 + ……… (1)
b) For given value of X1,
Y = (b0 + b1X1) + (b2 + b3 X1) X2 + ……. (2)
When X1 = X2 = 0, b0 is the intercept for the two equations. The coefficient b3
has different meanings: in equation (1), it is the change in the slope of Y due to a
change in X1 for a given value of X2: Y / X1 | X2. Conversely, in equation (2),
b3 is Y / X2 | X1.

9/8/2019
Y = (b0 + b2 X2) + (b1 + b3 X2) X1 + …… (1)
Y = (b0 + b1 X1) + (b2 + b3 X1) X2 + ……. (2)
b1 and b2 are the baseline slopes. That is, b1 is the slope of Y on X1
when X2 = 0 and the roles are reversed for b2. If X2 (or X1) were not
0, the corresponding conditional slope can be extrapolated to 0.
The two regressions emerge regardless of the actual values
accorded to D. It is D’s binary nature that implies two regressions,
as shown, and two regressions as well in the continuous case.
From this discussion, b1 and b2 do not describe additive main
effects but conditional ones. The resulting surface is not a plane,
but a warped regression one. We illustrate these points in the
graphs below that represent a regression of duration of a loan until
fully paid regressed on age of the customer and amount lent (from
a different data set, not discussed in here).

9/8/2019

9/8/2019
Change in previously additive coefficients: interactive ones describe
particular conditional, not additive, effects.
Interpretation and example.
Y = f (original without X1 * X2 additive effect) (base)
Y = 90 – 2.5 * X1 - .1 *X2 + .5 * X1*X2 + . (original)
Where 5  X1  10 10  X2  100.
At the extreme points of X1, we get: (omitting the error for simplicity)
Y = (90 – 12.5) + (-.1 + 2.5) X2 .. (3) X1 = 5
Y = (90 – 25) + (-1. + .50 ) X2 … (4) X1 = 10
The same exercise can be done over the range of X2. The coefficients in (3)
and (4) cover the corresponding coefficients in the additive model,
dependent on the distribution of X1.
Next: simulation results for original equation and (3) and (4).

9/8/2019
Interaction Models
Variable
Intercept x1 x2 x1_x2
Parame
ter
Estimat
e Pr > |t|
Parame
ter
Estimat
e Pr > |t|
Parame
ter
Estimat
e Pr > |t|
Parame
ter
Estimat
e Pr > |t|
Model Name
-116 .000 24.91 .000 3.657 .000
base
Original
90.07 .000 -2.51 .000 -.103 .000 0.500 .000
Model3
77.51 .000 2.400 .000
Model4
64.99 .000 0.400 .000
Base model estimates are far off “real values” and model
is biased (true model contains interactions). In case of 3
other models, note how close parameters correspond to
previous equations, N = 10000. See residual and prediction
plots next.

9/8/2019
Seriously bad model.

9/8/2019
Better model.

9/8/2019
Original model shows dependence of pred on x2 and on interaction, not captured in
base model. Other two models show different slopes of X2 incorporating x1 effects.

9/8/2019
Interactions and Significance of Coefficients.
Similarly, standard errors become “conditional” standard errors ,
which turns the t-tests “conditional” as well: the effect of one variable
is conditioned on a particular value of a third variable. Repeating the
previous equation:
a) For given X2, Y = (b0 + b2X2) + (b1 + b3X2)X1 +  …… (1)
Conditional t-test for X1 is: t = (b1 + b3X2) / s(b1 + b3X2).
Previously significant “additive” coefficients and present “interaction”
coefficient may be insignificant. But, for a specific value of X2 (within
the observed range), the t-test may prove significant, which entails
that the regression and its ‘main’ effects are valid within specific
ranges of the conditioning variable.

9/8/2019
Interpreting main effects in regression models with
interactions
In models without interaction terms (i.e., without terms constructed as product of
other terms), regression coefficient for a variable is the slope of the regression
surface in the direction of that variable. It is constant, regardless of the values of
other variables, and therefore can be said to measure overall effect of variable.
In models with product interactions, this interpretation can be made without further
qualification only for those variables that are not involved in any interactions.
For variables that are involved in interactions, "main effect" regression coefficient
is the slope of the regression surface in the direction of that variable WHEN ALL
OTHER VARIABLES THAT INTERACT WITH THAT VARIABLE HAVE VALUES OF
ZERO, and significance test of the coefficient refers to the slope of the regression
surface ONLY IN THAT REGION OF THE PREDICTOR SPACE, which may sometimes
be far from the region in which the data lie. In Anova terms, the coefficient is
measuring a simple main effect, not an overall main effect.

9/8/2019
Main and Overall Effects.
Analogous measure of overall effect of variable: average slope of regression
surface in direction of variable, averaging over all N cases in data.
Expressed as weighted sum of regression coefficients of all the terms in the
model that involve that variable.
Weights are awkward to describe but easy to get. A variable's main- effect
coefficient always gets a weight of 1. For each other coefficient of a term
involving that variable, the weight is the mean of the product of the other
variables in that term. For example, if the model is
y = b0 + b1 * x1 + b2 * x2 +
b3 * x3 + b4 * x4 + b5 * x5 +
b12 * x1 * x2 + b13 * x1 * x3 +
b23 * x2 * x3 + b45 * x4 * x5 + b123 * x1 * x2 * x3 +
error

9/8/2019
y = b0 + b1 * x1 + b2 * x2 +
b3 * x3 + b4 * x4 + b5 * x5 +
b12 * x1 * x2 + b13 * x1 * x3 +
b23 * x2 * x3 + b45 * x4 * x5 + b123 * x1 * x2 * x3 +
error
Overall main effects are
B1 = b1 + b12 * M [x2] + b13 * M[x3] + b123 * M [x2 * x3],
B2 = b2 + b12 * M[x1] + b23 * M [x3] + b123*M[x1*x3],
B3 = b3 + b13*M[x1] + b23 * M [x2] + b123 * M [x1 * x2],
B4 = b4 + b45*M[x5],
B5 = b5 + b45 * M[x4],
where M [.] denotes the sample mean of the quantity inside the brackets. All
the product terms inside the brackets are among those that were constructed in
order to do the regression.

9/8/2019
Two-way interactions and centering.
In models that have only main effects and two-way interactions, there is a
simpler way to get the overall effects: center (at their means) all variables that
are involved in interactions. Then all the M[.] expressions will become 0, and
regression coefficients will be interpretable as overall effects. (The values of
the b's will change; the values of the B's will not.)
So, why not always center at the means, routinely? Because means are
sample dependent and model may not be replicable in different studies. I.e.,
coefficients will be different.
Of course, then their main effects will not be overall main effects unless their
means are exactly the same as yours, which is unlikely. Moreover, centering
will convert simple effects to overall effects only for models with no three-way
or higher interactions; if there are such interactions, centering will simplify the b
--> B computations but will not eliminate them.

9/8/2019
Interactions significance.
Significance of the overall effects may be tested by the usual procedures for
testing linear combinations of regression coefficients (see earlier slides on
testing linear combinations of parameters).
However, the results must be interpreted with care because overall effects are
not structural parameters but are design-dependent. The structural
parameters -- the regression coefficients (uncentered, or with rational centering)
and the error variance -- may be expected to remain invariant under changes in
the distribution of the predictors, but the overall effects will generally change.
Overall effects are specific to particular sample and should not be expected to
carry over to other samples with different distributions on predictors.
If an overall effect is significant in one study and not in another, it may reflect
nothing more than a difference in the distribution of the predictors. In particular,
it should not be taken as evidence that the relation of the d.v. to the predictors is
different in the two groups.

9/8/2019

9/8/2019
Added product interactions to Fraud data set.
doc_membd = "Doc Visits * member_dur“
Doc_numm = “Doc Visiits * Num_members”
doc_membd_numm = "Doc_visits * member_dur * num_members“

9/8/2019
Notice that doctor_visits overall effect is smaller than Corresponding
main effect dependent on 2 or 3 way interactions, etc.
doc
doc_m
em
bd
doc_m
em
bd_num
m
doc_num
m
m
em
bd
m
em
bd_num
m
no_claim
s
num
m
optom
_presc
variable
0.0
0.2
0.4
0.6
0.8
1.0
R_estimate
(Sum)
M2_3_way_skip_member_d
M2_3_way_full_model
M1_2_way_full_model
M2_3_way_full_model
M1_2_way_full_model
Rescaled Coefficients (skipping FRAUD) for all models
Narrower bar width for overall effects

9/8/2019
Changing significance levels dependent on model.
F
R
A
U
D
d
o
c
d
o
c
_
m
e
m
b
d
d
o
c
_
m
e
m
b
d
_
n
u
m
m
d
o
c
_
n
u
m
m
m
e
m
b
d
m
e
m
b
d
_
n
u
m
m
n
o
_
c
l
a
i
m
s
n
u
m
m
o
p
t
o
m
_
p
r
e
s
c
variable
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
p_value2
(Sum)
0.05
M2_3_way_full_model
M1_2_way_full_model
M2_3_way_full_model
M1_2_way_full_model
, P_values (skipping FRAUD) for all models cut at 0.1
Narrower bar width for overall effects

9/8/2019
0.0292 0.0254 0.0299 0.0293
M1_2_way_full_model M1_2_way_skip_member_d M2_3_way_full_model M2_3_way_skip_member_d
model
0.0
0.2
0.4
0.6
0.8
1.0
R-squared
R-squared
R square values for models

9/8/2019
0.1001
1
0
0.1013
M1_2_way_full_model M1_2_way_skip_member_d M2_3_way_full_model M2_3_way_skip_member_d
model
0.0
0.2
0.4
0.6
0.8
1.0
Rescaled
Press
Rescaled Press
Rescaled PRESS values for models
Smaller, better

9/8/2019 Ch. 2.2-106

9/8/2019
Crass but still used Approaches
1) Fit equation with all variables and keep those the
coefficients of which are significant at 25% or so.
2) Re-fit equation with remaining variables.
3) Alternative to 1) above if number of variables too
large: Find all zero-order correlations, and choose
top K (dependent on computing resources, or time
…).
4) Redo 1) above with K variables just found.
Ch. 2.2-107

9/8/2019
Crass but still used Approaches (2 of 2).
(appeared on web, sci.stat.math.group, 11/27/06):
- I heard / read (on the Internet) that the number of variables to put as
explanatory characters must not be greater than n/10 or n/ 20 (n is the sample
size). Does someone can provide me with a serious bibliographic reference
about this ?
- Second, I've got a first set of about 15 variables (for around 150 patients),
interactions excluded. I categorized some into binary characters to yield odds-
ratios. What is the risk ? lack of power ?
- Third, performing a second selection via the stepwise algorithm, is there a
consensus about the significance cut-off (alpha). to use ? I read 20% instead of
5%. Is this usual ?
- Regarding SAS programming (I run the version 8.2 under Windows OS),
which procedure, between Proc Logistic and Proc GLM, should I choose, and
according to which criteria ?
Ch. 2.2-108

Variable Selection practice.
Mostly two approaches:
1) Sequential Inferential: such as Stepwise Family. P-values play critical
role in stopping/dropping/entry mechanism. Variable selection path also given
by partial correlations for forward and stepwise methods.
2) Stopping Mechanisms, e.g., AIC, BIC. May use variable searched
inferentially but typically searches over wider subsets. The mechanisms are
used both in Frequentist and Bayesian statistics. Shtatland et al (2000) apply
BIC and AIC for a Bayesian application of variable selection.
3) Some combination thereof.
9/8/2019 109

Forward Selection (FS, Miller (2002)).
Let Y (n,1) = X (n,p)  (p,1) + ,,,,,,,,,, , usual model, n > p.
FS minimizes (Y – Xi i)’(Y – Xi i ), (SSE) i = 1 …p
where i = Xi’Y / Xi’Xi (OLS estimate, and expanding and replacing in
previous expression) ➔
variable selected maximizes (Xi’Y)2 / Xi’Xi,
which when divided by Y’Y becomes (cos (Xi,Y))2 ➔
Choose Xi variable with smallest angle with Y (most co-linear or most
correlated).
Y
Xi i
Xi
Y - Xi i
orthogonal to Xi.
Xj
9/8/2019 110

Forward Selection.
Assume X(1) was the largest absolute correlated variable. What next?
For all other variables, calculate:
Xj,(1) = Xj – Bj,(1) X(1),,,,,,,,, where Bj,(1) is LS estimator of X(1) on Xj .
Replace Y with Y - Xi Bi,,,, and Xj with Xj,(1) for all j and proceed again as in
previous slide. Y and remaining X variables are orthogonalized to X2.
If vars are centered, second variable chosen has absolute largest partial
correlation with Y given X(1), etc.
Method minimizes RSS (= ESS) at every step ➔ maximizes R2 at every
step.
Method stops when decrease in RSS is not significant at specified level (.5):
(RSSk – RSSk+1) / (RSSk+1 / n – k – 2) compared to ‘F-to-enter’ value.
Entered variable always remains in model.
9/8/2019 Ch. 2.2-111

9/8/2019
Step Action Y X1 X2 … Xp
1 Find most correlated X
variable with Y, say X2.
Y X1 X2 … Xp
2 Regress Y, X1,X3.. and
Xp on X2.
3 Replace Y, X1, X3…Xp
by corresponding
residuals of regression
with X2.
Y -
aX1
X2 – bX1 Xp –
c X1
4 Find most correlated
variable of transformed
Y with transformed X’s,
also called partial
correlations.
5 Repeat process until
change in ESS is not
significant.
Forward Selection: Summary of steps
Ch. 2.2-112

9/8/2019 Ch. 2.2-113
Z
Y
X
Z-bX
Y-aX
After X is selected and orthogonalized
away from Z and Y, in the next step the
actual variables are Z – bX and Y – aX,
where “a” and “b” are OLS coefficients.
The first three variables (Y, X, Z) do not
play any further roles.
Starting
Axis
system.

Do zero order correlations give you
Information as to which variables will appear in the final
Model?
Hardly.
Selection path is given by partial correlations, and since path
depends on selected variables, illusory to obtain all potential
paths for typical large number of variables of Giga-bases.
Indirect remark: Model and variable interpretation typically
conceptualized from zero order correlations. But models involve partial (and
semi-partial) relations, i.e., conditional relationships ➔ model interpretation is
ALWAYS possible, it is just far more difficult when number of variables is large
and variables are related.
Semi-partial correlation is correlation between ORIGINAL Y and partialled Xs.
Partial, semipartial R2s and coefficients are counterparts of traditional R2
concepts.
9/8/2019 Ch. 2.2-114

Nomenclature
Zero order correlations: Original correlations.
First order: First partial.
Second Order: Second partial.
(First-order semipartial, etc).
Big note: Zero order correlations are always
unconditional. All others are conditional on
sequence of orthogonalized predictors.
9/8/2019 Ch. 2.2-115

Stepwise variable selection (Efroymson (1960).
1. Select most correlated variable with Y, say Z1, find linear equation, and
test for significance against F-to-entry (F distribution related to chi-sq).
2. If no significance, stop. Else, examine partial correlations (or semi-partial,
dependent on software) with Y of all Z’s not in regression.
3. Choose largest (say Z2), and regress Y on Z1 and Z2. Check for overall
significance, R2 improvement, and obtain partial F-values for both
variables (F-values = squared t-values).
4. Lowest F-value is compared to F-threshold (F-to-delete) and kept or
rejected. Stop when no more removal or additions is possible.
5. General form of F test: Add/delete X to model already containing Z:
( mod ) ( ) / # var
( )
( ( ) ( , )) / 1
( ( , ) / ( 1)
SSE reduced el SSE full diff s
F
MSE full
SSE Z SSE Z X
F
SSE X Z n p
−
= =
−
=
− −
9/8/2019 Ch. 2.2-116

Stepwise Family considerations
1. In all stepwise selections, true p-values are larger than reported
(Rencher and Pun, 1980; Freedman, 1983), and do not have proper
meaning. Correction very difficult problem.
2. If true variables are deleted, model is biased (omission bias); if redundant
variables kept, variable selection has not been attained. No insurance
about this. All possible regressions tends to produce “small” models,
which is impossible (at present) with large p, because there are 2 p
possible regressions (for p = 10 ➔ 1024, p = 20 ➔ 1,048,576, p=30 ➔
more than 1B..
3. Selection bias occurs when variable selection is not done independently
of coefficient estimation (symptomatic of tree regression).
4. Further, important subsets of variables may be omitted. E.g., if true
relationship is given by ‘distance’ between X and Y, or any linear
combination thereof, no assurance that this subset will be found. In
general, can’t find transformations / nonlinearities.
9/8/2019 Ch. 2.2-117

5. Methods are Discrete: variables either retained or discarded, and often
exhibit high variance and don’t reduce prediction error of full model
(Tibshirani likes to say this).
6. Stepwise yields models with upwardly biased R2s. Model needs
reevaluation in independent data.
7. Severe problems with co-linearity, but debatable.
8. Gives biased regression coefficients that need shrinkage (coefficients are
too large, Tibshirani, 1996).
9. Based on methods (F tests for nested models) that were intended to be
used to test pre-specified hypotheses.
9/8/2019 Ch. 2.2-118

Stepwise Family considerations.
10. Increasing sample size does not improve selection by much (Derksen and
Kesselman, 1992).
11. Induces comfort of not thinking about problem at hand. But thinkers could
be clueless or contradictory. Should be considered exploratory tool.
12. Stepwise alternatives:
1. Replacing 1, 2, 3 … variables at a time.
2. Branch and bound techniques.
3. Sequential Subsets,
4. Ridge Regression.
5. Nonnegative Garrote and Lasso, LARS…
6. Foster/Stine (See Auslender 2005)
7. Stepwise but starting from full model.
9/8/2019 Ch. 2.2-119

13. Freedman (1983) showed by simulation in the case of n/p not very
large that stepwise methods select about 15% of noise variables.
Even if all the variables are noise, a high R2 can be obtained. If
seemingly insignificant variables are dropped, the R2 will still remain high.
➔
BE VERY CAREFUL ABOUT JUMPING in joy or despair WITH
SIGNIFICANCE FINDINGS IN THE CONTEXT OF VARIABLE
SELECTION.
9/8/2019 Ch. 2.2-120

9/8/2019

9/8/2019
Tested models. Too many? .
Requested Analyses: Names & Descriptions. Model #
***
Overall Models
-1
M1 Multivariate regression TOTAL_SPEND 1
M1_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 2
M1_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 3
M1_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 4
M1_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 5
M1_TRN_REGR_NONE Regr TRN NONE 6
M1_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 7
M1_TRN_REGR_STPW Regr TRN STEPWISE
8
M1_TRN_REGR_STPW_NO_HETERO Regr TRN WLS 9
M1_TRN_STPW_GT_2 Regr TRN STEPWISE abs(rstud) >=2 10
M1_TRN_STPW_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 11
M1_TRN_STPW_LT_2 Regr TRN STEPWISE ABS(rstud) < 2 12
M1_TRN_STPW_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 13
M2 Log total spend 14
M2_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 15
M2_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 16
M2_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 17
M2_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 18
M2_TRN_REGR_NONE Regr TRN NONE 19
M2_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 20
M2_TRN_REGR_STPW Regr TRN STEPWISE 21
Requested Analyses: Names & Descriptions. Model #
M2_TRN_REGR_STPW_NO_HETERO Regr TRN WLS 22
M2_TRN_STPW_GT_2 Regr TRN STEPWISE abs(rstud) >=2 23
M2_TRN_STPW_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 24
M2_TRN_STPW_LT_2 Regr TRN STEPWISE ABS(rstud) < 2 25
M2_TRN_STPW_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 26

9/8/2019
M2 model in logs.
M1 model in original
units.

9/8/2019

9/8/2019
Breusch-Pagan Test for Heteroskedasticity
Model Name CHI-SQUARE DF P-VALUE
Num
Obs Hetero?
M1_TRN_NONE_LT_2 49.405089 5 0.000 5,703 YES
M1_TRN_NONE_GT_2 26.637793 5 0.000 257 YES
M1_TRN_REGR_STPW 40.96904 4 0.000 5,960 YES
M1_TRN_STPW_LT_2 48.606669 4 0.000 5,703 YES
M1_TRN_STPW_GT_2 25.054416 3 0.000 257 YES
M2_TRN_NONE_LT_2 15.9612 5 0.007 5,660 YES
M2_TRN_NONE_GT_2 19.9851 5 0.001 300 YES
M2_TRN_REGR_STPW 39.0082 4 0.000 5,960 YES
M2_TRN_STPW_LT_2 14.27328 4 0.006 5,664 YES
M2_TRN_STPW_GT_2 3.991856 1 0.046 296 YES

9/8/2019
VIFs omitted,
None above 2.

9/8/2019

9/8/2019
Models and ranks by Req. GOFs
Requested Gof
_PRESS_ _RMSE_ _RSQ_
Rank Rank Rank
Model Name
1 1 1
M1_NONE_GT_2
M1_NONE_GT_2_NO_HETERO
2 2 2
M1_NONE_LT_2
3 3 3
M1_NONE_LT_2_NO_HETERO
4 4 4
M1_REGR_NONE
5 5 5
M1_REGR_NONE_NO_HETERO
6 6 6
M1_REGR_STPW
7 7 7
M1_REGR_STPW_NO_HETERO
8 8 8
M1_STPW_GT_2
9 9 9
M1_STPW_GT_2_NO_HETERO
10 10 10
M1_STPW_LT_2
11 11 11
M1_STPW_LT_2_NO_HETERO
12 12 12
M2_NONE_GT_2
1 1 1
M2_NONE_GT_2_NO_HETERO
2 2 2
M2_NONE_LT_2
3 3 3
M2_NONE_LT_2_NO_HETERO
4 4 4
M2_REGR_NONE
5 5 5
M2_REGR_NONE_NO_HETERO
6 6 6
M2_REGR_STPW
7 7 7

9/8/2019
Models and ranks by Req. GOFs
Requested Gof
_AIC_ _CP_ _PRESS_ _RMSE_ _RSQADJ_ _RSQ_ _SBC_
Rank Rank Rank Rank Rank Rank Rank
Model Name
4 4 4 4 4 3 4
M1_REGR_NONE
M1_REGR_NONE_NO_HETERO 2 3 2 1 3 2 2
M1_REGR_STPW 3 1 3 3 2 4 3
M1_REGR_STPW_NO_HETERO 1 2 1 2 1 1 1
M2_REGR_BCKW 3 2 3 4 4 4 3
M2_REGR_BCKW_NO_HETERO 1 1 1 1 2 2 1
M2_REGR_NONE 4 3 4 3 3 3 4
M2_REGR_NONE_NO_HETERO 2 4 2 2 1 1 2
Removing LT_2 and GT_2 and adding other GOFs ➔ not
easy to pick up a winner.

9/8/2019
Some observations.
1) Log transformed dep var is binomial, raw-units depvar is certainly not
normal. If transforms wants normality ➔ mixture of two normals in
this case.
2) P-values are quite disparate when comparing LT_2 and GT_2 to non-
outlier models. Still, possible to see some regularity, such as
significant number of visits and claims.
3) Co-linearity is not an issue if 10 is the critical value of VIF.
4) If using RMSE as fit, NONE_LT_2_no hetero seems best performer,
more inconclusive using PRESS. When looking at the comparative
ranks table, the different methods offer different rankings.
5) Possible that alternative transformations may enhance model, and
even transforming predictors and searching for interactions. This is
especially true since all non hetero corrected showed hetero disease.
6) It is also possible to use non-regression models, such as Tree or
Neural network methods that can then be compared in terms of fit,
prediction and possible interpretability.
7) Data set has no missing values, unreal in real world.
8) Very poor model.
9) Model contains NO INTERACTIONS or predictor TRANSFORMATIONS.

9/8/2019
Modeler’s attitude
Statistical inference (Null – Alternative H) generates
attitude that model is homoskedastic, has no interactions,
etc.
Be bold and assume the following and aim at disproving:
Interactions are present
Model is heteroskedastic
Transformations are useful
Outliers are present
Co-linearity is likely
….

9/8/2019

9/8/2019
Omitted Variables Bias.
( ' ) '
( ' ) '( )
( ' ) ( '
2 2
1
1
2 2
1
2
Assume true model is: Y X X where
X is set of variables (with intercept) but X2 is incorrecly
omitted. Correct model should have both sets. By OLS:
X X X Y
X X X X X
X X X X
−
−
−
=  +  + 
 = =
=  +  + 
=  + ) ( ' ) '
ˆ ˆ
( / , ) ( / , )
( ' ) ( ' ) ( ' ) ' ( / , )
( ' ) ( ' )
ˆ
( / , ) (
1
2
2
1
2 2 2
2 2
1 2
0 1 2 2
X X X
Bias of final estimator is:
Bias X X E X X2
X X X X X X 1X E X X
X X X X
For X (entered) and X (omitted) single
variables, bias is:
Bias X X x x
−
−
−
 +  
 =  − 
=  + − 
= 
 = − ,
,
)
ˆ
( / , )
ˆ ˆ ˆ
2
1 1 2 2
1
1 1 2 2
1,2 0 1 0
s
s
s2
Bias X1 X2
s1
If ρ 0, bias in b and b . If = 0, bias in b .
 
 =  


9/8/2019
Un-reviewed topics of importance
Time Series
Spatial regression
Longitudinal Data Analysis
Simultaneous Equation Models
Mixtures
….

9/8/2019

9/8/2019
Proc Reg Data = fraud.fraud PLOTS (MAXPOINTS = NONE) OUTEST =
_estim_ (where = (_type_ = "PARMS")) TABLEOUT
PRESS /* outputs _PRESS in OUTEST data set */
COVOUT outvif RSQUARE ADJRSQ AIC BIC CP MSE SSE
EDF clb;
/* clb: confidence limits for betas */
M1_TRN_NONE : MODEL total_spend = NUM_MEMBERS DOCTOR_VISITS
FRAUD MEMBER_DURATION NO_CLAIMS OPTOM_PRESC / selection = NONE
dw r vif MSE JP CP AIC BIC SBC
ADJRSQ spec influence; /* influence produces all influence statistics
*/
OUTPUT OUT = OUTP (KEEP = p_total_spend total_spend
r_p_total_spend student lower_p_total_spend upper_p_total_spend ) P =
p_total_spend r =
r_p_total_spend student = student lcl = lower_p_total_spend ucl =
upper_p_total_spend ;
TITLE2 "First Regr depvar total_spend" ;
RUN;

9/8/2019

9/8/2019
data lessthan morethan; merge fraud.fraud
OUTP (keep = rstudent);
if abs ( rstudent ) < 2 then output lessthan;
else output morethan;
drop rstudent;
RUN;
Using studentized residual to obtain two data sets.

9/8/2019

9/8/2019
ods output heterotest = hetero_test; /*
REQUIRES ETS PACKAGE */
proc model data = dataset;
parms beta0 beta1 … betap;
depvar =
beta0 + beta1 * var1 + beta2 * var2 ...../ Breusch;
run;

9/8/2019
SAS Programs.
Original linear Regression, influence, variable selection.
Proc Reg Data = fraud.fraud PLOTS (MAXPOINTS = NONE) OUTEST =
_estim_ (where = (_type_ = "PARMS")) TABLEOUT COVOUT outvif RSQUARE ADJRSQ
AIC BIC CP MSE SSE
EDF clb;
/* clb: confidence limits for betas */
M1_TRN_NONE : MODEL total_spend = NUM_MEMBERS DOCTOR_VISITS
FRAUD MEMBER_DURATION NO_CLAIMS OPTOM_PRESC
/ selection = NONE dw r vif MSE JP CP AIC BIC SBC
ADJRSQ spec influence; /* influence produces all influence statistics */
OUTPUT OUT = OUTP (KEEP = p_total_spend total_spend
r_p_total_spend student lower_p_total_spend upper_p_total_spend ) P = p_total_spend r
=
r_p_total_spend student = student lcl = lower_p_total_spend ucl = upper_p_total_spend ;
TITLE2 "First Regr depvar total_spend" ;
RUN;
run;

9/8/2019
splitting the data according to studentized residual
data lessthan morethan;
merge fraud.fraud OUTP (keep = rstudent);
if abs ( rstudent ) < 2 then output lessthan;
else output morethan;
drop rstudent;
RUN;
Breusch-Pagan test
ods output heterotest = hetero_test; /* REQUIRES ETS PACKAGE */
proc model data = dataset;
parms beta0 beta1 ..... betap;
depvar = beta0 + beta1 * var1 + beta2 * var2 .....
/ breusch ;
Run;

9/8/2019
Graphing confidence intervals of fitted values and predictions.
Proc sgplot data = .... ;
Reg x = xvar y = yvar / cli /* or clm for fitted values*/;
Run;
For variable selection using proc reg.
Proc reg …. ;
…
Model Y = ….. / selection = forward. …. ;
Run;
Options: none, forward, backward, stepwise, etc.

9/8/2019

9/8/2019
1) You fit a multiple regression to examine the effect
of a particular variable a worker in another department is interested in. The
variable comes back insignificant, but your co-worker says that this is
impossible as it is known to have an effect. What would you say/do?
2) You have 1000 variables and 100 observations. You
would like to find the significant variables for a particular
response. What would you do (this happens in genomics for instance)?
3) What is your plan for dealing with outliers (if you have a plan)?
How about missing values? How about transformations?
4) How do you prevent over-fitting when you are creating a statistical
model (i.e., model does not fit well other samples).
5) Define/explain what forecasting is, as opposed to prediction, If
possible?

9/8/2019
1) We have a regression problem where the response is a count
variable. Which would you choose in this context, ordinary
least squares or Poisson regression (or maybe some other)?
Explain your choice, what is the main differences among these models?
2) Describe strategies to create a regression model with a very large
number of predictors, and not a lot of time.
3) Explain intuitively why a covariance matrix is positive (semi)
definite, and what that means. How can that fact be used?
4) Explain concept of interaction effects in regression models. Specifically,
can an interaction be significant while corresponding main effects are not?
Is there some difference in interpretation of interaction between OLS and
logistic regression (future chapter)?
5) Is there a problem if regression residuals are not normal?
6) When would you transform a dependent variable? An Independent one?

9/8/2019
(from the Analysis factor, 2018-NOV, theanalysisfactor.com)
1. When you add an interaction to a regression model, you can still
evaluate the main effects of the terms that make up the interaction,
just like in ANOVA.
2. The intercept is usually meaningless in a regression model.
3. In Analysis of Covariance, the covariate is a nuisance variable, and the
real point of the analysis is to evaluate the means after controlling for the
covariate.
4. Standardized regression coefficients are meaningful for dummy-coded
predictors.
5. The only way to evaluate an interaction between two independent
variables is to categorize one or both of them.
All of these are False.

9/8/2019
References
Box, G. (1976): Science and Statistics, Journal of the American Statistical
Association, 71: 791–799
Gladwell M. (2008): Outliers: The Story of Success, Little, Brown and
Company.
Horst P. (1941): The prediction of personnel adjustment. Social Science
Research and Council Bulletin, 48, 431-436.
Spiess A., Newmeyer N. (2010): An evaluation of R2 as an inadequate
measure for nonlinear models in pharmacological and biochemical
research: a Monte Carlo approach, BMC Pharmacology, 10: 6.

9/8/2019

Linear Regression.pdf

Recommended

Recommended

More Related Content

Similar to Linear Regression.pdf

Similar to Linear Regression.pdf (20)

More from Leonardo Auslender

More from Leonardo Auslender (20)

Recently uploaded

Recently uploaded (20)

Linear Regression.pdf