SlideShare a Scribd company logo
1 of 156
Download to read offline
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-1
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 2
9/8/2019
Contents
Types of Regression modeling.
Linear Regression.
Goodness of fit.
Fraud Data quick example
Homework and interview questions.
Non-Linear Regression (quickly).
Problem Areas
Heteroskedasticity
Co-linearity
Outliers
Leverage
Example of problem areas.
QQ Plots
Logging Dependent Variable
Residuals.
Interactions.
Model and Variable Selections.
Full example.
Problem if selected away too many variables.
SAS Programs
References.
Leonardo Auslender –Ch. 1 Copyright 3
9/8/2019
New Applicants,
Fraudsters,
Mortgage
Delinquents,
Diseased,
Etc.
Historical
Data
Predictive
Model
Exploratory
/ Interpretive
Model
Leonardo Auslender –Ch. 1 Copyright 4
9/8/2019
Exploratory/Interpretive and Predictive Models.
Typically, businesses interested in both aspects. By way of
example, Phone company interested in stopping attrition,
interested in understanding characteristics of past attriters
vs. non-attriters, and to predict who may be future ones to
implement anti-attrition plans. Marketing plans typically
dependent on profiling, not directly on predictive scores.
Model purported to provide tools to prevent attrition.
Typically, both aspects of modeling are not easily
achievable.
Typically, data mining, data science, etc. search for better
predictive models, leaving interpretation as secondary aim.
Thus, black-box models (e.g., neural networks) are more
readily employed.
Leonardo Auslender –Ch. 1 Copyright 5
9/8/2019
Types of Regression models.
Dependent on type of dependent or target variable (single):
Continuous:
Linear Regression: standard and most prevalent method.
Ridge Regression: Specialized for the case of co-linearity.
Lasso Regression: least absolute shrinkage and selection operator:
performs variable selection in addition to avoiding co-linearity.
Partial Least Squares: Useful when n << p, reduces # predictor to
small subset of components like PCA, then Linear regression on components.
Non-linear regression: allows for more flexible functional form.
Categorical:
Binary logistic: estimates probability of occurrence of “event”.
Ordinal logistic: Dep var has at least 3 levels, that can be ranked
(e.g.., worse, fair, better).
Nominal (or Polychotomous): More than 2 levels.
Counts (e.g., number of visits to doctors, etc).
Poisson, Negative Binomial, Zero-inflated Regression:
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-6
9/8/2019
All models are wrong, but some are useful – George Box (1976)
(right, which ones are useful?)
Linear models: area that has received the most attention in
statistical modeling. Part of supervised learning in ML, Data
Mining, etc. Unsupervised methods refer to those without
target or dependent variable (e.g., clustering, PCA).
Given dependent (or target) variable Y and set of non-random predictors X,
aim at a) finding linear function f(X) of predictors to predict Y, given some
conditioning criterion, b) interpreting function and c) once f(X) is estimated
and predictions for Y found, “loss” function is used to evaluate prediction
errors, typically squared error function.
This presentation is about Linear Regression, and its estimation
method is called Ordinary Least Squares (OLS), or Least Squares Estimation
(LSE).. Impossible to review all methods well in a single lecture.
Leonardo Auslender –Ch. 1 Copyright 7
9/8/2019
Tendency to just plunge into whatever data are available and obtain a
model. But …
Gladwell (2008, p.1 and subsequent) mentions that mortality in town of Roseto, Pa,
was far lower than in other towns. Heart attacks below age 55 were unknown while they
were prevalent all over the US. Scientists analyzed genetic information (Rosetans were
mostly originals from a small southern Italian town), exercise, diet, but could not find any
true answers at the individual level.
The difference was the social interconnections that the Rosetans had in Roseto, the
friendliness in the streets, the chatting in the old Italian dialect, 3 generations living under
the same roof, the calming effect of the church services, the egalitarian ethos that allowed
people to enrich themselves without flaunting it to others not so fortunate, in short a
protective social structure. They counted 22 civic organizations in a town of 2,000.
Spatial databases will capture the effect but not the reason unless societal
interconnections are first understood, conceptualized and measured. Typical databases
may not include them ➔
UNDERSTAND that raw data may not readily model reality and that NOT everything
is modelable.
Leonardo Auslender –Ch. 1 Copyright 8
9/8/2019
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Model: Y = 0 + 1*X1 + 2*X2 +… + p*Xp + ε =
= X + ε,
Y continuous, dependent or target variable,
X, set of predictors, either binary or continuous, all Numeric.
Linearity is assumed because the function f(X) = X’β, and β are fixed and
not random coefficients, unknown and in need of estimation.
Criterion: Minimize sum of squared errors to find Betas.
Ch. 2.2-9
* * ....
* * ....
.
.
* * ....
1 1 11 2 12 p 1p 1
2 1 21 2 22 p 2p 2
n 1 n1 2 n2 p np n
Y X X X
Y X X X
Y X X X
=  +  + +  + 
=  +  + +  + 
=  +  + +  + 
Leonardo Auslender –Ch. 1 Copyright
9/8/2019 Ch. 2.2-10
or succintly
:
. .
. .
. . . . .
.
. . . . .
.
. .
,
1 11 12 1p
21 22 2p
2
n1 n2 np p p
n
Matrix representation
Y X X X
X X X
Y
X X X
Y
Y X
 
 
 
       
       
 
       
       
 
= +
       
 
       
       
 
     
 
=  + 
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-11
9/8/2019
Idealized view with single predictor X.
At every
Value of
Total
length
There are
Many
values
Of body
Depth
Symmetric
ally
Distributed
.
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Example (Horst, 1941):
Y = pilot performance
X1 = mechanical ability of the pilot
X2 = verbal ability of the pilot
N: undetermined number of observations, p = 2.
Required: N > p (vital).
From the standpoint of computation, no assumptions on
error , because only betas are estimated. Residuals
(estimates of , are obtained by subtracting predicted from
actual.
Ch. 2.2-12
1 1 2 2
1 1 2 2
ˆ ˆ
Model: constant
Estimate constant
Y X X
Y b X b X
  
= + + +
= + +

Leonardo Auslender –Ch. 1 Copyright 13
9/8/2019
Requested Analyses: Names & Descriptions.
Model #
Model Name Model Description
***
Overall Models
-1
M1 Multivariate regression TOTAL_SPEND
1
M1_TRN_REGR_NONE Regr TRN NONE
2
Leonardo Auslender –Ch. 1 Copyright 14
9/8/2019
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 17605721712 2934286952 23.90 <.0001 (1)
Error 5953 7.308901E11 122776763
Corrected Total 5959 7.484958E11
Root MSE 11080 R-Square 0.0235 (2)
Dependent Mean 18608 Adj R-Sq 0.0225
Coeff Var 59.54689 (5)
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 15664 501.50886 31.23 <.0001
NUM_MEMBERS Number of members covered 1 -43.86759 144.12951 -0.30 0.7609 (3)
DOCTOR_VISITS Total visits to a doctor 1 138.06114 20.29884 6.80 <.0001 (4)
FRAUD Fraudulent Activity yes/no 1 -1813.84488 394.55243 -4.60 <.0001
MEMBER_DURATION Membership duration 1 9.29226 1.81950 5.11 <.0001
NO_CLAIMS No of claims made recently 1 -172.63492 142.61993 -1.21 0.2262
OPTOM_PRESC Number of opticals claimed 1 477.92230 88.49029 5.40 <.0001
Leonardo Auslender –Ch. 1 Copyright 15
9/8/2019
Leonardo Auslender –Ch. 1 Copyright
Linear Regression: geometrical projections.
Y
X
Z
Projection of Y
On X.
Projection of Y
On Z.
Y hat = a X + b Z = inner
Product of optimal projections
Of Y on the plane spanned by
X and Z.
9/8/2019 Ch. 2.2-16
Plane spanned
By X and Z
Leonardo Auslender –Ch. 1 Copyright 17
9/8/2019
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Assumptions beyond sheer computation.
1) Nonrandom X matrix of predictors.
2) X matrix is of full rank (vector components linearly independent).
3) ε i uncorrelated across observations, common mean 0 and
unknown positive and constant variance σ2 (E.g., not using height and weight as
predictors)..
4) ε ~ N (0, σ2).
Inferential Problems.
1) Point and CI Estimation of  and c’  (linear combination).
2) Estimation of σ2.
3) Hypothesis testing on  and c’  (linear combination of parameters), c
vector of constants.
4) Prediction (point and interval) of new Y.
Regardless of assumptions, always possible to fit a linear
regression to data. Resulting equation may not be USEFUL,
or may be misleading ➔ use your brains.
Ch. 3-18
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Ordinary Least Squares (OLS): ***
Estimate value that minimizes the residual sum of squares
criterion: (usually also called min SSE, we’ll use either one term):
−
= − = − −
=
  2
i ij j
1
RSS (Y ( x )) (Y X )'(Y X ),
from which:
OLS estimate : (X'X) X' Y
  

OLS estimates are B(est) L(inear) U(nbiased) E(stimators): BLUE,
that is, minimum variance, and are also the MLE (maximum
likelihood estimators). .
−
− −
 = 
= 

 =
−
 =   
 −  −


2 1
2
0 A
1 1
2
Var( ) (X'X)
Fitted or predicted Y: Y X
RSS( )
n p
To test H : C d,H : C d at level of significance :
(C d)'[C(X'X) C'] (C d)
nrow(C)
Compare to upper % point of F (nrow C, n - p)
Ch. 3-19
Leonardo Auslender –Ch. 1 Copyright 20
9/8/2019
To test whether at least one coefficient is NOT ZERO,
Use F-test, where “0” refers to null model and “F” to full.
~ ( , )
0 F
F
SSE SSE
p
F test F p n p
MSE
−
− = −
 =  
=
0 A
H : C d,H : C d
typically, d 0
Leonardo Auslender –Ch. 1 Copyright 21
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 22
9/8/2019
➔Corr (X,Y) = if SD(Y) = SD(X). E.g., if both
Standardized, otherwise same sign at least, and
interpretation from correlation holds in simple regression
case.
Notice that regression of X on Y is NOT inverse of
regression of Y on X because of SD(X) and SD(Y).
ˆ

/
Confusion on signs of coefficients
and interpretation.
( )
ˆ {
( )
} ˆ
( ) ( )
y
i
xy xy
x
i
xy
Y X
s
Y Y
r r
s
X X
sg r sg
  
 
= + +
−
= = 
−
=


2
1 2
2
Leonardo Auslender –Ch. 1 Copyright 23
9/8/2019
In multiple linear regression, previous relationship does
not hold because predictors can be correlated (rxz)
weighted by ryz, hinting at co-linearity and/or relationships
of supression/enhancement.
. .
. 2
2
But in multivariate, e.g.: ,
estimated equation (emphasizing "partial")
and for example:
ˆ ˆ ˆ ,
ˆ
1
ˆ
( ) ( )
( ) ( ) and 1
YX Z YZ X
Y YX YZ XZ
YX Z
X XZ
YX
YX YZ XZ XZ
Y X Z
Y a X Z
s r r r
s r
sg sg r
abs r abs r r r
   
 
 
= + + +

= + +
−
=
−
= 
  
Leonardo Auslender –Ch. 1 Copyright 24
9/8/2019
Illustrating
Beta coefficients
Issue,
with example
From BEDA
Leonardo Auslender –Ch. 1 Copyright 25
9/8/2019
Zero X Y slope > 0, partial Y X / Z slope < 0.
Leonardo Auslender –Ch. 1 Copyright 26
9/8/2019
Note coeff (var_x) < 0 while corr (Var_Y, Var_x) > 0.
Note that non-intercept p-values are significant. .
Estimates
DF
Paramet
er
Estimate
Standar
d Error t Value Pr > |t|
Variable
1 -0.05 0.07 -0.78 0.44
Intercept
var_z
1 0.80 0.11 7.51 0.00
var_x
1 -0.49 0.11 -4.64 0.00
Leonardo Auslender –Ch. 1 Copyright 27
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 9/8/2019
Goodness of fit
Some Necessary Definitions in linear models.
Correlation, R2, etc. (Y-hat = predicted Y).
ˆ
| |
ˆ ˆ
( , ) cos( , )
| |
ˆ ˆ ˆ
| | | | cos( , ) | | | |
( , 1995, .36)
1 cos 1
Y
R Corr Y Y Y Y
Y
Y Y Y Y Y Y
Wickens p
= = = 
= 
−  

Length of predicted vector never larger than original
length of Y ➔ Regression to the mean.
Ch. 2.2-28
Leonardo Auslender –Ch. 1 Copyright
Some Necessary Definitions in linear models, Goodness
of Fit measures: R2 (Coeff. of Determination, r2 is for
simple regr.)
0  R2  1 = Model SS (Regresion SS) / Total SS = 1 – SSE/SST
(computational formula).
With just Y and X in regression, r2 = corr2 (Y, X) (from previous formula).
n
i
i = 1
n
i
i = 1
n
i i
i = 1
2
2
2
Regression Sum of Squares =
ˆ
RSS = (y - y)
Total Sum of Squares =
TSS = (y - y)
Sum of Squares of Error =
ˆ
SSE = (y - y )



Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Y
X
Z
90o
90o
90o

=
Corr(X, Y) cos( ),
zero-order (X,Y ) corr.

ˆ ˆ
Corr(Y, Y) cos( ) Y / Y

= =

2 2 2
R cos ( ) 1 sin ( )
 
= = −
Ŷ
ˆ ˆ
sin( ) (Y Y) / Y e / Y
 = − =

 
−
2 ˆ
R (angle between Y and Y),
and vice versa.
Ch. 3-30
Geometric appreciation.
Leonardo Auslender –Ch. 1 Copyright 31
9/8/2019
Other measures of goodness of fit.
Press residuals.
The prediction error (PRESS residuals) are defined as:
, where is the fitted value of the ith response based on all observations except
the ith one. It can be shown that (and hii element of H matrix (X’X)-1X’)
The Press Statistic.
PRESS is generally regarded as a measure of how well a regression model will
perform in predicting new data.
Small PRESS
An R2- like statistic for prediction (based on PRESS
We expect this model to explain about R2 (prediction)% of variability in predicting
new observations.
Use PRESS to compare models: A model with small PRESS is preferable to one
with large PRESS.
i
(i)
ii
e
e
1 h
=
−
Leonardo Auslender –Ch. 1 Copyright 32
9/8/2019
Evaluation of measures of goodness of fit.
2
2
1
k
Mean Square Error
[0,1], unitless, measures prop. Var(Y) fitted by preds.
ˆ
( )
1
= standard error of fit for =
S y
estimate of error Var
, .
=RM
Root MSE SE=
. ( )
n
i i
i
R
Y Y
MSE
n p
MS
Coeff Variation CV
E
RMSE
Y
=

−
= =
− −
=

( / 2 , 2 )
2
2
/ , 2
k
95%
Prediction interval for k-th observation (already in data set):
s (prediction) = se of prediction or fit for ne
ˆ
:
ˆ
w x
* { },
(
( )
1
,
( )
[ ]
k n
k
i
h s n
and y
y t s prediction
s p ed
CI
r i
x x
n x x
t S
MSE


−
−

−
+
−


2
2
1
( )
1
) [1
( )
]
k
n
i
i
X X
ction MSE
n
X X
=
−
= + +
−

Leonardo Auslender –Ch. 1 Copyright 33
9/8/2019
Evaluation of measures of goodness of fit (cont.).
1) Model better when R2 higher and/or RMSE lower. RMSE called standard error of
regression S by Minitab, but we don’t use that terminology in here.
2) RMSE: average distance between Y and predictions, R2 does not tell this. RMSE can
be used also for non-linear models, R2 cannot. Roughly, -+ 2 * RMSE produces 95%
CI for residuals.
3) 95% Prediction Interval line can be built (above and below regression line) shows
where 95% of data lie in reference to prediction line.
4) 95% Mean Fitted values can be built (above and below regression line) shows where
95% of data lie in reference to fitted line.
5) CV evaluates relative closeness of the predictions to actual values; R2 evaluates how
much of variability in Y is fitted by the model. CV cannot be used when mean(Y) = 0
or when Y has mixtures of negative and positive values in its range.
6) Usefulness: if know that predictions must be within specific interval from data points
(model precision), RMSE provides the info, R2 does not.
Leonardo Auslender –Ch. 1 Copyright 34
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 35
9/8/2019
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 6 17605721712 2934286952 23.90 <.0001 (1)
Error 5953 7.308901E11 122776763
Corrected Total 5959 7.484958E11
Root MSE 11080 R-Square 0.0235 (2)
Dependent Mean 18608 Adj R-Sq 0.0225
Coeff Var 59.54689 (5)
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 15664 501.50886 31.23 <.0001
NUM_MEMBERS Number of members covered 1 -43.86759 144.12951 -0.30 0.7609 (3)
DOCTOR_VISITS Total visits to a doctor 1 138.06114 20.29884 6.80 <.0001 (4)
FRAUD Fraudulent Activity yes/no 1 -1813.84488 394.55243 -4.60 <.0001
MEMBER_DURATION Membership duration 1 9.29226 1.81950 5.11 <.0001
NO_CLAIMS No of claims made recently 1 -172.63492 142.61993 -1.21 0.2262
OPTOM_PRESC Number of opticals claimed 1 477.92230 88.49029 5.40 <.0001
Leonardo Auslender –Ch. 1 Copyright 36
9/8/2019
Variance
Variable Label DF Inflation 95% Confidence Limits
Intercept Intercept 1 0 14681 16647
NUM_MEMBERS Number of members covered 1 1.00194 -326.41369 238.67852
DOCTOR_VISITS Total visits to a doctor 1 1.04606 98.26805 177.85423
FRAUD Fraudulent Activity yes/no 1 1.20681 -2587.31068 -1040.37907
MEMBER_DURATION Membership duration 1 1.08243 5.72537 12.85914
NO_CLAIMS No of claims made recently 1 1.14992 -452.22170 106.95185
OPTOM_PRESC Number of opticals claimed 1 1.03956 304.44924 651.39535
Interpretation:
(1) Implies that the variables in the model do fit the dependent variable significantly.
(2) Is the R-square value, that is very low since it is close to 0 but still significant at alpha = 5%.
(3) The coefficient for Num_members is not significantly different from 0.
(4) The coefficient for doctor_visits is significantly different from 0. An increase in doctor_visits by 1 unit,
keeping the rest of the variables constant, raises total_spending by around $138.
(5) Notice that the 95% confidence limits for variables deemed not to be significant, overlap 0.
(6) Parameter estimates are also called “main effects” of corresponding variables and are constant.
(7) The coefficient of variation (5) is defined as RMSE / mean dep var, and is unitless, can be used to
compared different models, smaller better.
Leonardo Auslender –Ch. 1 Copyright 37
9/8/2019
Caveat about coefficient interpretation:
Classical interpretation of regression coefficient ‘b1’ for variable X1 is: For
given values of all remaining predictors, a change in one unit of X1,
changes the predicted value by ‘b1’.
In a non-experimental setting (as this one), in which data sets are collected
opportunistically, predictors/variables are not necessarily orthogonal.
As a matter of fact, it is better to at least suspect that the predictors are
correlated. In this case, it is not possible to state “keeping the rest of the
variables constant” categorically, because raising the variable of interest
by one unit, affects the values that are supposed to be left constant.
Question:
The coefficient for fraud is -1813. Since fraud = 1 implies that there was
fraud activity, can you interpret it in relation to total_spend alone? Is it
possible to immediately interpret coefficients?
Leonardo Auslender –Ch. 1 Copyright 38
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 39
9/8/2019
From Fitted …..
Leonardo Auslender –Ch. 1 Copyright 40
9/8/2019
To Predicted…..
Leonardo Auslender –Ch. 1 Copyright 41
9/8/2019
Model is seriously bad.
Leonardo Auslender –Ch. 1 Copyright 42
9/8/2019
QQ plot of residuals indicates problem with dependent
Variable. Residuals supposed to be normally distributed.
Leonardo Auslender –Ch. 1 Copyright 43
9/8/2019
No comments.
Leonardo Auslender –Ch. 1 Copyright 44
9/8/2019
End of first
Modeling attempt,
Will try to
improve soon.
Leonardo Auslender –Ch. 1 Copyright 45
9/8/2019
Using categorical variables in linear or logistic regression.
Given variable X with ‘p’ levels (e.g. colors, white, black, yellow, p = 3), LR
can use this information by creating ‘p – 1’ dummy (binary) variables:
Dummy1 = 1 if color = white, else 0
Dummy2 = 1 if color = black, else 0.
If both dummy1 and dummy2 = 0 then color is yellow. Method is called
DUMMY CODING. If we created 3 Dummies for a model, we would be
creating co-linearity among the Predictors.
When just dummy coded variables are our predictors, constant of linear
regression is mean of reference group. Coefficient of dummy predictor is
mean (dummy k) – mean (reference dummy).
In EFFECT CODING, we again create p-1 binary variables, but dummy1 =
dummy2 = -1 when color is yellow.
Leonardo Auslender –Ch. 1 Copyright 46
9/8/2019
Using categorical variables in linear or logistic regression.
In LR with only effect coding predictors, constant is overall mean of dep
var. Coefficient for effect1 is mean of predictor 1 – overall mean, etc.
In case of linear regression with variable selection, ‘p’ dummies can be
constructed, because variable selection will select at most p – 1 of them.
The same rule applies to logistic regression, but not to tree based
methods. Binary Tree based models searches over all possible 2
subgroups of p-levels to find optimal splits (to be reviewed at later
lecture).
Leonardo Auslender –Ch. 1 Copyright 47
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 48
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 49
9/8/2019
Constant RMSEs, different R_squares.
Leonardo Auslender –Ch. 1 Copyright 50
9/8/2019
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-51
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 52
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 53
9/8/2019
2) Grocery store investigates relationship between cardboard
bags provided freely for each purchase versus revenue and
number of customers per visit. Data on more than 10,000
purchases. Quick glance revealed that number of bags utilized
ranged from 0 to 8 with mode of 3, the distribution of revenue is
skewed leftwards and the number of customers per visit ranged
from 1 to 6 with a mode of 1.
Analyst estimated linear regression with following information,
where information in parenthesis corresponds to standard
errors. All diagnostic measures of goodness of fit were
significant and very good.
Leonardo Auslender –Ch. 1 Copyright 54
9/8/2019
a) Provide interpretation to the coefficients.
b) Is it possible to provide an interpretation to the constant?
Did you expect the constant to be 0?
c) Have any model assumptions been violated with just this
information?
d) Would you implement this model? Why? Or why not?
e) Would you consider adding a product interaction to the
equation? What would it mean?
f) Should the dependent variable “Bags” be transformed?
Why? Or why not?
g) Assume that bags are a very costly item, and that it is
important to obtain a very good model. Discuss possible ways (variables,
transformations, model searches) to improve the present one. For
instance, could baggers and bag carriers (into the car) be a solution?
Would there be an interaction with customer gender?
Leonardo Auslender –Ch. 1 Copyright 55
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 56
9/8/2019
Non-linear regressions.
Any regression beyond not linear in β parameters.
Examples (θ, theta, is our previous β).
3 4
1 2
( * ** )
1 2 1
1 4 2 4 3
: * * *
: ( ) *
: * cos( ) ( * cos(2 * )
x
Powerfunction X
Weibull Growth e
Fourier X X
 
 
  
    
−
+ −
+ + + +
For linear models, SSR + SSE = SS Total, from which R2 is
derived.
In nonlinear regression, SSR + SSE ≠ SS Total! This
completely invalidates R2 for nonlinear models, and it is not
bounded between 0 and 1. Still incorrectly used in many
fields (Spiess, Neumeyer, 2010)
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-57
9/8/2019
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-58
9/8/2019
Problem area: Heterokedasticity.
Error term in the original linear equation that links Y to X is assumed to be normal
with constant variance. If this assumption is violated, then the variance (and
consequently inference based on p-values) is affected.
E(ei
2)  2
Heteroskedasticity can be visually detected in a univariate fashion by graphing each
individual predictor versus the residuals, and looking for non-random patterns. In
the context of large data bases, this method is infeasible. Analytically, there are a
serious of tests, such as:
The Breusch-Pagan LM Test (used in the table below)
The Glesjer LM Test
The Harvey-Godfrey LM Test
The Park LM Test
The Goldfeld-Quandt Tests
White’s Test
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-59
9/8/2019
Problem area: Heterokedasticity.
Resolving hetero:
(a) Generalized Least Squares
(b) Weighted Least Squares
(c) Heteroskedasticity-Consistent Estimation Methods
Modeling Heteroskedasticity: Estimated Generalized Least Squares (EGLS).
1. Obtain the original regression results and keep the residuals ei, i = 1 ,,,, n.
2. Obtain log ( )
3. Run a regression with log ( ) as dependent variable on the original predictors, and
obtain predicted values Pred.i.
4. Exponentiate the predicted values Pred.i to obtain
5. Estimate the original equation of Y on X by using weighted least squares, using as
weights.
2
i
e
2
i

1/ i

Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-60
9/8/2019
Problem area: Co-linearity (wrongly called multi-colinearity).
Co-linearity: existence of linear relationships among predictors (independently of Ys)
such that estimated coefficients become unstable when those relationships among Xs
have high r-squares. In extreme case of perfectly linear relationship among predictors, one
predictor can be perfectly modeled from others, ➔ it is irrelevant in regression against Y.
For instance, if regression contains Xs that denote components of wealth and one of those predictors
is merely sum of all others, the use of all the Xs to model Y would yield an unstable equation. (because
covariance matrix is then singular).
Present way to find out about co-linearity is NOT by correlation matrices but by
calculation of variance inflation factor (VIF, we omit other methods in this presentation).
This factor is calculated for every predictor and is the transformation of the r-square of a
regression of every predictor on the rest of the predictors (notice that the dependent
variable is not part of this). The present rule of thumb is that VIF values at or above 10
should be considered too high, and probably the variable in question (or part of the other
predictors) should be studied in more detail and ultimately removed from the analysis.
Leonardo Auslender –Ch. 1 Copyright 61
9/8/2019
Co-linearity: left panel, any beta2 minimizes SS1 along
bottom ridge., beta2 is not unique,
Ridge regression penalizes ‘ridge’ and thus optimal
beta2 is unique on right panel.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-62
9/8/2019
Consequences of Heteroskedasticity and Co-linearity
1) Beta estimators are still linear
2) Beta estimators are still unbiased
3) Beta estimators are not efficient - the minimum variance
property no longer holds, ➔ tend to vary more with data.
4) Estimates of the coefficient variances are biased
5) The standard estimated variance is not an unbiased
estimator.
6) Confidence intervals are too narrow and inference can
be misleading.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-63
9/8/2019
Problem Area: Unusual data points: Outliers, leverage and Influence.
Outlier: data point in which value of dependent variable does not follow
general trend of rest of data. Data point is influential (has leverage) if it unduly
influences any part of regression analysis, such as predicted responses,
estimated slope coefficients, or hypothesis test results. Thus, outlier
represents extreme value of Y w.r.t. other values of Y, not to values of X.
From bi-variate standpoint, an observation could be
1) an outlier if Y is extreme,
2) influential if X is extreme,
3) both outlier and influential if X and Y are extremes,
4) & also outlier and influential if distant from rest of data w/o being
extreme,
Definitions are not precise; from bi-variate standpoint, tendency to analyze
these effects by deletion of suspected observations. However, if
percentage of suspected observations is large, then the issue as to what
constitute the population of reference is in question.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-64
9/8/2019
Working with unusual data points.
Not without further analysis. If the unusual nature is due to data error, then
resolve the issue but in reference to the model at hand. If you are
modeling height of titans in history, then you must expect heights beyond
10ft. If not, heights beyond 7ft are suspect. That is, condition on model at
hand.
A quick way to manage unusual data points is, once a regression is
obtained, to separate those points the absolute standardized residual
value of which is higher than 2. Standardized residuals are commonly
known as studentized residuals. The studentized residuals are the
residuals divided by the standard error of a regression where the i-th
observation has been deleted.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-65
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 66
9/8/2019
Different Regressions for Total_spend: “M1” general
Name for models for Fraud data. TRN: train data.
NONE: No var sel. LT_2 and GT_2: using studentized
Residuals to split data set. No_hetero after using Weighted
Least squares.
Requested Analyses: Names & Descriptions.
Model #
Model Name Model Description
***
Overall Models
-1
M1 Multivariate regression TOTAL_SPEND
1
M1_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2
2
M1_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2
3
M1_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2
4
M1_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2
5
M1_TRN_REGR_NONE Regr TRN NONE
6
M1_TRN_REGR_NONE_NO_HETERO Regr TRN WLS
7
Leonardo Auslender –Ch. 1 Copyright Ch. 3-67
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 68
9/8/2019
Notice variation in p-value significance.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-69
9/8/2019
VIFs – Co-linearity: none if accept VIF > 10 criterion.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-70
9/8/2019
R-Squares by models: typical jump when using studentized residuals.
Leonardo Auslender –Ch. 1 Copyright
Checking Hetero, still persistent and also using stpw
(stepwise variable selection).
Heterokedasticity test.
Breusch-Pagan Test for Heteroskedasticity
Model Name CHI-SQUARE DF P-VALUE
Num
Obs Hetero?
M1_TRN_REGR_NONE 42.50672 5 0.000 5,960 YES
M1_TRN_NONE_LT_2 49.405089 5 0.000 5,703 YES
M1_TRN_NONE_GT_2 26.637793 5 0.000 257 YES
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-72
9/8/2019
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-73
9/8/2019
QQ-plots: Need To transform depvar
In these models.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-74
9/8/2019
Leonardo Auslender –Ch. 1 Copyright
Requested Analyses: Names & Descriptions. Model
#
Model Name Model Description
***
Overall Models
-1
M2 Log total spend
1
M2_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2
2
M2_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2
3
M2_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2
4
M2_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2
5
M2_TRN_REGR_NONE Regr TRN NONE
6
M2_TRN_REGR_NONE_NO_HETER
O
Regr TRN WLS
7
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-76
9/8/2019
0.0235 0.038 0.1976 0.038 0.0381 0.0236 0.0388 0.0424 0.0995 0.0424 0.0429 0.0401
M
1_NO
NE
M
1_NO
NE_G
T_2
M
1_NO
NE_G
T_2_NO
_HETERO
M
1_NO
NE_LT_2
M
1_NO
NE_LT_2_NO
_HETERO
M
1_NO
NE_NO
_HETERO
M
2_NO
NE
M
2_NO
NE_G
T_2
M
2_NO
NE_G
T_2_NO
_HETERO
M
2_NO
NE_LT_2
M
2_NO
NE_LT_2_NO
_HETERO
M
2_NO
NE_NO
_HETERO
All models
0.00
0.05
0.10
0.15
0.20
Mean
_RSQ_
Avg TRN
Dep_interval_system
_RSQ_ by Model Name
Comparing initial M1 with logged M2.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-77
9/8/2019
Log transform works for original model.
Leonardo Auslender –Ch. 1 Copyright 78
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 79
9/8/2019
Log-transformed residuals and -+2 * RMSE as Cis: most of residuals contained in
CI..
More Info in model comparison section.
Leonardo Auslender –Ch. 1 Copyright 80
9/8/2019
M1 models: original total_spend (dep var) variable.
Notice: separate scales for heteroskedastic corrected models next slide.
Leonardo Auslender –Ch. 1 Copyright 81
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 82
9/8/2019
M2 models: logged total_spend (dep var) variable.
Leonardo Auslender –Ch. 1 Copyright 83
9/8/2019
Interview Question:
Is there a problem,
Possible
Improvement?
1. X typically observed,
Y predicted.
2. Predicted more
Concentrated than
observed (look at ranges).
3. A lowess curve
(smoother)
Would provide much info
(there’s linear relationship
in residuals not accounted
for by the original
regression).
Leonardo Auslender –Ch. 1 Copyright 84
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 85
9/8/2019
Interaction effects in regression analysis (moderated regressions).
The typical linear model assumes a constant effect of each predictor on Y
regardless of the level(s) of the other variable(s). The usual way of introducing
interactions is by multiplication: I.e., X1 * X2, but there are infinite functional ways
of representing interactions of X1 and X2. The product of the variables is chosen due
to its ease and because the product is somewhat sensitive to diverse functional
forms.
Arguments against the use of interactions.
A) Difficult to interpret and alter original coefficients of main effects (if
present).
B) Can generate co-linearity because product of X’s is nonlinear function of
the original variables.
C) From A) and B), coefficients of main effects can become insignificant.
Arguments in favor of the use of interactions.
The multiplicative term turns the additive relationship of X1 and X2 into a
conditional one. An example will better illustrate it.
Leonardo Auslender –Ch. 1 Copyright 86
9/8/2019
Let Y = b0 + b1X1 + b2D + b3DX1 + , where X1 is continuous and D a dummy
variable. The presence of D implies that there are in effect two regressions of Y
on X1, according to the values of D: (b1 and b2 typically called main effects of X1
and X2 respectively).
1) D = 0 ➔ Y = b0 + b1X1 + 
2) D = 1 ➔ Y = (b0 + b2) + (b1 + b3) X1 + 
If D is a continuous or a ratio variable instead, which we will call X2, and
rearranging the previous formulae we obtain:
a) For given value of X2,
Y = (b0 + b2 X2) + (b1 + b3 X2) X1 + ……… (1)
b) For given value of X1,
Y = (b0 + b1X1) + (b2 + b3 X1) X2 + ……. (2)
When X1 = X2 = 0, b0 is the intercept for the two equations. The coefficient b3
has different meanings: in equation (1), it is the change in the slope of Y due to a
change in X1 for a given value of X2: Y / X1 | X2. Conversely, in equation (2),
b3 is Y / X2 | X1.
Leonardo Auslender –Ch. 1 Copyright 87
9/8/2019
Y = (b0 + b2 X2) + (b1 + b3 X2) X1 + …… (1)
Y = (b0 + b1 X1) + (b2 + b3 X1) X2 + ……. (2)
b1 and b2 are the baseline slopes. That is, b1 is the slope of Y on X1
when X2 = 0 and the roles are reversed for b2. If X2 (or X1) were not
0, the corresponding conditional slope can be extrapolated to 0.
The two regressions emerge regardless of the actual values
accorded to D. It is D’s binary nature that implies two regressions,
as shown, and two regressions as well in the continuous case.
From this discussion, b1 and b2 do not describe additive main
effects but conditional ones. The resulting surface is not a plane,
but a warped regression one. We illustrate these points in the
graphs below that represent a regression of duration of a loan until
fully paid regressed on age of the customer and amount lent (from
a different data set, not discussed in here).
Leonardo Auslender –Ch. 1 Copyright 88
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 89
9/8/2019
Change in previously additive coefficients: interactive ones describe
particular conditional, not additive, effects.
Interpretation and example.
Y = f (original without X1 * X2 additive effect) (base)
Y = 90 – 2.5 * X1 - .1 *X2 + .5 * X1*X2 + . (original)
Where 5  X1  10 10  X2  100.
At the extreme points of X1, we get: (omitting the error for simplicity)
Y = (90 – 12.5) + (-.1 + 2.5) X2 .. (3) X1 = 5
Y = (90 – 25) + (-1. + .50 ) X2 … (4) X1 = 10
The same exercise can be done over the range of X2. The coefficients in (3)
and (4) cover the corresponding coefficients in the additive model,
dependent on the distribution of X1.
Next: simulation results for original equation and (3) and (4).
Leonardo Auslender –Ch. 1 Copyright 90
9/8/2019
Interaction Models
Variable
Intercept x1 x2 x1_x2
Parame
ter
Estimat
e Pr > |t|
Parame
ter
Estimat
e Pr > |t|
Parame
ter
Estimat
e Pr > |t|
Parame
ter
Estimat
e Pr > |t|
Model Name
-116 .000 24.91 .000 3.657 .000
base
Original
90.07 .000 -2.51 .000 -.103 .000 0.500 .000
Model3
77.51 .000 2.400 .000
Model4
64.99 .000 0.400 .000
Base model estimates are far off “real values” and model
is biased (true model contains interactions). In case of 3
other models, note how close parameters correspond to
previous equations, N = 10000. See residual and prediction
plots next.
Leonardo Auslender –Ch. 1 Copyright 91
9/8/2019
Seriously bad model.
Leonardo Auslender –Ch. 1 Copyright 92
9/8/2019
Better model.
Leonardo Auslender –Ch. 1 Copyright 93
9/8/2019
Original model shows dependence of pred on x2 and on interaction, not captured in
base model. Other two models show different slopes of X2 incorporating x1 effects.
Leonardo Auslender –Ch. 1 Copyright 94
9/8/2019
Interactions and Significance of Coefficients.
Similarly, standard errors become “conditional” standard errors ,
which turns the t-tests “conditional” as well: the effect of one variable
is conditioned on a particular value of a third variable. Repeating the
previous equation:
a) For given X2, Y = (b0 + b2X2) + (b1 + b3X2)X1 +  …… (1)
Conditional t-test for X1 is: t = (b1 + b3X2) / s(b1 + b3X2).
Previously significant “additive” coefficients and present “interaction”
coefficient may be insignificant. But, for a specific value of X2 (within
the observed range), the t-test may prove significant, which entails
that the regression and its ‘main’ effects are valid within specific
ranges of the conditioning variable.
Leonardo Auslender –Ch. 1 Copyright 95
9/8/2019
Interpreting main effects in regression models with
interactions
In models without interaction terms (i.e., without terms constructed as product of
other terms), regression coefficient for a variable is the slope of the regression
surface in the direction of that variable. It is constant, regardless of the values of
other variables, and therefore can be said to measure overall effect of variable.
In models with product interactions, this interpretation can be made without further
qualification only for those variables that are not involved in any interactions.
For variables that are involved in interactions, "main effect" regression coefficient
is the slope of the regression surface in the direction of that variable WHEN ALL
OTHER VARIABLES THAT INTERACT WITH THAT VARIABLE HAVE VALUES OF
ZERO, and significance test of the coefficient refers to the slope of the regression
surface ONLY IN THAT REGION OF THE PREDICTOR SPACE, which may sometimes
be far from the region in which the data lie. In Anova terms, the coefficient is
measuring a simple main effect, not an overall main effect.
Leonardo Auslender –Ch. 1 Copyright 96
9/8/2019
Main and Overall Effects.
Analogous measure of overall effect of variable: average slope of regression
surface in direction of variable, averaging over all N cases in data.
Expressed as weighted sum of regression coefficients of all the terms in the
model that involve that variable.
Weights are awkward to describe but easy to get. A variable's main- effect
coefficient always gets a weight of 1. For each other coefficient of a term
involving that variable, the weight is the mean of the product of the other
variables in that term. For example, if the model is
y = b0 + b1 * x1 + b2 * x2 +
b3 * x3 + b4 * x4 + b5 * x5 +
b12 * x1 * x2 + b13 * x1 * x3 +
b23 * x2 * x3 + b45 * x4 * x5 + b123 * x1 * x2 * x3 +
error
Leonardo Auslender –Ch. 1 Copyright 97
9/8/2019
y = b0 + b1 * x1 + b2 * x2 +
b3 * x3 + b4 * x4 + b5 * x5 +
b12 * x1 * x2 + b13 * x1 * x3 +
b23 * x2 * x3 + b45 * x4 * x5 + b123 * x1 * x2 * x3 +
error
Overall main effects are
B1 = b1 + b12 * M [x2] + b13 * M[x3] + b123 * M [x2 * x3],
B2 = b2 + b12 * M[x1] + b23 * M [x3] + b123*M[x1*x3],
B3 = b3 + b13*M[x1] + b23 * M [x2] + b123 * M [x1 * x2],
B4 = b4 + b45*M[x5],
B5 = b5 + b45 * M[x4],
where M [.] denotes the sample mean of the quantity inside the brackets. All
the product terms inside the brackets are among those that were constructed in
order to do the regression.
Leonardo Auslender –Ch. 1 Copyright 98
9/8/2019
Two-way interactions and centering.
In models that have only main effects and two-way interactions, there is a
simpler way to get the overall effects: center (at their means) all variables that
are involved in interactions. Then all the M[.] expressions will become 0, and
regression coefficients will be interpretable as overall effects. (The values of
the b's will change; the values of the B's will not.)
So, why not always center at the means, routinely? Because means are
sample dependent and model may not be replicable in different studies. I.e.,
coefficients will be different.
Of course, then their main effects will not be overall main effects unless their
means are exactly the same as yours, which is unlikely. Moreover, centering
will convert simple effects to overall effects only for models with no three-way
or higher interactions; if there are such interactions, centering will simplify the b
--> B computations but will not eliminate them.
Leonardo Auslender –Ch. 1 Copyright 99
9/8/2019
Interactions significance.
Significance of the overall effects may be tested by the usual procedures for
testing linear combinations of regression coefficients (see earlier slides on
testing linear combinations of parameters).
However, the results must be interpreted with care because overall effects are
not structural parameters but are design-dependent. The structural
parameters -- the regression coefficients (uncentered, or with rational centering)
and the error variance -- may be expected to remain invariant under changes in
the distribution of the predictors, but the overall effects will generally change.
Overall effects are specific to particular sample and should not be expected to
carry over to other samples with different distributions on predictors.
If an overall effect is significant in one study and not in another, it may reflect
nothing more than a difference in the distribution of the predictors. In particular,
it should not be taken as evidence that the relation of the d.v. to the predictors is
different in the two groups.
Leonardo Auslender –Ch. 1 Copyright 100
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 101
9/8/2019
Added product interactions to Fraud data set.
doc_membd = "Doc Visits * member_dur“
Doc_numm = “Doc Visiits * Num_members”
doc_membd_numm = "Doc_visits * member_dur * num_members“
Leonardo Auslender –Ch. 1 Copyright 102
9/8/2019
Notice that doctor_visits overall effect is smaller than Corresponding
main effect dependent on 2 or 3 way interactions, etc.
doc
doc_m
em
bd
doc_m
em
bd_num
m
doc_num
m
m
em
bd
m
em
bd_num
m
no_claim
s
num
m
optom
_presc
variable
0.0
0.2
0.4
0.6
0.8
1.0
R_estimate
(Sum)
M2_3_way_skip_member_d
M2_3_way_full_model
M1_2_way_skip_member_d
M1_2_way_full_model
M2_3_way_skip_member_d
M2_3_way_full_model
M1_2_way_skip_member_d
M1_2_way_full_model
Rescaled Coefficients (skipping FRAUD) for all models
Narrower bar width for overall effects
Leonardo Auslender –Ch. 1 Copyright 103
9/8/2019
Changing significance levels dependent on model.
F
R
A
U
D
d
o
c
d
o
c
_
m
e
m
b
d
d
o
c
_
m
e
m
b
d
_
n
u
m
m
d
o
c
_
n
u
m
m
m
e
m
b
d
m
e
m
b
d
_
n
u
m
m
n
o
_
c
l
a
i
m
s
n
u
m
m
o
p
t
o
m
_
p
r
e
s
c
variable
-0.01
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
p_value2
(Sum)
0.05
M2_3_way_skip_member_d
M2_3_way_full_model
M1_2_way_skip_member_d
M1_2_way_full_model
M2_3_way_skip_member_d
M2_3_way_full_model
M1_2_way_skip_member_d
M1_2_way_full_model
, P_values (skipping FRAUD) for all models cut at 0.1
Narrower bar width for overall effects
Leonardo Auslender –Ch. 1 Copyright 104
9/8/2019
0.0292 0.0254 0.0299 0.0293
M1_2_way_full_model M1_2_way_skip_member_d M2_3_way_full_model M2_3_way_skip_member_d
model
0.0
0.2
0.4
0.6
0.8
1.0
R-squared
R-squared
R square values for models
Leonardo Auslender –Ch. 1 Copyright 105
9/8/2019
0.1001
1
0
0.1013
M1_2_way_full_model M1_2_way_skip_member_d M2_3_way_full_model M2_3_way_skip_member_d
model
0.0
0.2
0.4
0.6
0.8
1.0
Rescaled
Press
Rescaled Press
Rescaled PRESS values for models
Smaller, better
Leonardo Auslender –Ch. 1 Copyright
9/8/2019 Ch. 2.2-106
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Crass but still used Approaches
1) Fit equation with all variables and keep those the
coefficients of which are significant at 25% or so.
2) Re-fit equation with remaining variables.
3) Alternative to 1) above if number of variables too
large: Find all zero-order correlations, and choose
top K (dependent on computing resources, or time
…).
4) Redo 1) above with K variables just found.
Ch. 2.2-107
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Crass but still used Approaches (2 of 2).
(appeared on web, sci.stat.math.group, 11/27/06):
- I heard / read (on the Internet) that the number of variables to put as
explanatory characters must not be greater than n/10 or n/ 20 (n is the sample
size). Does someone can provide me with a serious bibliographic reference
about this ?
- Second, I've got a first set of about 15 variables (for around 150 patients),
interactions excluded. I categorized some into binary characters to yield odds-
ratios. What is the risk ? lack of power ?
- Third, performing a second selection via the stepwise algorithm, is there a
consensus about the significance cut-off (alpha). to use ? I read 20% instead of
5%. Is this usual ?
- Regarding SAS programming (I run the version 8.2 under Windows OS),
which procedure, between Proc Logistic and Proc GLM, should I choose, and
according to which criteria ?
Ch. 2.2-108
Leonardo Auslender –Ch. 1 Copyright
Variable Selection practice.
Mostly two approaches:
1) Sequential Inferential: such as Stepwise Family. P-values play critical
role in stopping/dropping/entry mechanism. Variable selection path also given
by partial correlations for forward and stepwise methods.
2) Stopping Mechanisms, e.g., AIC, BIC. May use variable searched
inferentially but typically searches over wider subsets. The mechanisms are
used both in Frequentist and Bayesian statistics. Shtatland et al (2000) apply
BIC and AIC for a Bayesian application of variable selection.
3) Some combination thereof.
9/8/2019 109
Leonardo Auslender –Ch. 1 Copyright
Forward Selection (FS, Miller (2002)).
Let Y (n,1) = X (n,p)  (p,1) + ,,,,,,,,,, , usual model, n > p.
FS minimizes (Y – Xi i)’(Y – Xi i ), (SSE) i = 1 …p
where i = Xi’Y / Xi’Xi (OLS estimate, and expanding and replacing in
previous expression) ➔
variable selected maximizes (Xi’Y)2 / Xi’Xi,
which when divided by Y’Y becomes (cos (Xi,Y))2 ➔
Choose Xi variable with smallest angle with Y (most co-linear or most
correlated).
Y
Xi i
Xi
Y - Xi i
orthogonal to Xi.
Xj
9/8/2019 110
Leonardo Auslender –Ch. 1 Copyright
Forward Selection.
Assume X(1) was the largest absolute correlated variable. What next?
For all other variables, calculate:
Xj,(1) = Xj – Bj,(1) X(1),,,,,,,,, where Bj,(1) is LS estimator of X(1) on Xj .
Replace Y with Y - Xi Bi,,,, and Xj with Xj,(1) for all j and proceed again as in
previous slide. Y and remaining X variables are orthogonalized to X2.
If vars are centered, second variable chosen has absolute largest partial
correlation with Y given X(1), etc.
Method minimizes RSS (= ESS) at every step ➔ maximizes R2 at every
step.
Method stops when decrease in RSS is not significant at specified level (.5):
(RSSk – RSSk+1) / (RSSk+1 / n – k – 2) compared to ‘F-to-enter’ value.
Entered variable always remains in model.
9/8/2019 Ch. 2.2-111
Leonardo Auslender –Ch. 1 Copyright
9/8/2019
Step Action Y X1 X2 … Xp
1 Find most correlated X
variable with Y, say X2.
Y X1 X2 … Xp
2 Regress Y, X1,X3.. and
Xp on X2.
3 Replace Y, X1, X3…Xp
by corresponding
residuals of regression
with X2.
Y -
aX1
X2 – bX1 Xp –
c X1
4 Find most correlated
variable of transformed
Y with transformed X’s,
also called partial
correlations.
5 Repeat process until
change in ESS is not
significant.
Forward Selection: Summary of steps
Ch. 2.2-112
Leonardo Auslender –Ch. 1 Copyright
9/8/2019 Ch. 2.2-113
Z
Y
X
Z-bX
Y-aX
After X is selected and orthogonalized
away from Z and Y, in the next step the
actual variables are Z – bX and Y – aX,
where “a” and “b” are OLS coefficients.
The first three variables (Y, X, Z) do not
play any further roles.
Starting
Axis
system.
Leonardo Auslender –Ch. 1 Copyright
Do zero order correlations give you
Information as to which variables will appear in the final
Model?
Hardly.
Selection path is given by partial correlations, and since path
depends on selected variables, illusory to obtain all potential
paths for typical large number of variables of Giga-bases.
Indirect remark: Model and variable interpretation typically
conceptualized from zero order correlations. But models involve partial (and
semi-partial) relations, i.e., conditional relationships ➔ model interpretation is
ALWAYS possible, it is just far more difficult when number of variables is large
and variables are related.
Semi-partial correlation is correlation between ORIGINAL Y and partialled Xs.
Partial, semipartial R2s and coefficients are counterparts of traditional R2
concepts.
9/8/2019 Ch. 2.2-114
Leonardo Auslender –Ch. 1 Copyright
Nomenclature
Zero order correlations: Original correlations.
First order: First partial.
Second Order: Second partial.
(First-order semipartial, etc).
Big note: Zero order correlations are always
unconditional. All others are conditional on
sequence of orthogonalized predictors.
9/8/2019 Ch. 2.2-115
Leonardo Auslender –Ch. 1 Copyright
Stepwise variable selection (Efroymson (1960).
1. Select most correlated variable with Y, say Z1, find linear equation, and
test for significance against F-to-entry (F distribution related to chi-sq).
2. If no significance, stop. Else, examine partial correlations (or semi-partial,
dependent on software) with Y of all Z’s not in regression.
3. Choose largest (say Z2), and regress Y on Z1 and Z2. Check for overall
significance, R2 improvement, and obtain partial F-values for both
variables (F-values = squared t-values).
4. Lowest F-value is compared to F-threshold (F-to-delete) and kept or
rejected. Stop when no more removal or additions is possible.
5. General form of F test: Add/delete X to model already containing Z:
( mod ) ( ) / # var
( )
( ( ) ( , )) / 1
( ( , ) / ( 1)
SSE reduced el SSE full diff s
F
MSE full
SSE Z SSE Z X
F
SSE X Z n p
−
= =
−
=
− −
9/8/2019 Ch. 2.2-116
Leonardo Auslender –Ch. 1 Copyright
Stepwise Family considerations
1. In all stepwise selections, true p-values are larger than reported
(Rencher and Pun, 1980; Freedman, 1983), and do not have proper
meaning. Correction very difficult problem.
2. If true variables are deleted, model is biased (omission bias); if redundant
variables kept, variable selection has not been attained. No insurance
about this. All possible regressions tends to produce “small” models,
which is impossible (at present) with large p, because there are 2 p
possible regressions (for p = 10 ➔ 1024, p = 20 ➔ 1,048,576, p=30 ➔
more than 1B..
3. Selection bias occurs when variable selection is not done independently
of coefficient estimation (symptomatic of tree regression).
4. Further, important subsets of variables may be omitted. E.g., if true
relationship is given by ‘distance’ between X and Y, or any linear
combination thereof, no assurance that this subset will be found. In
general, can’t find transformations / nonlinearities.
9/8/2019 Ch. 2.2-117
Leonardo Auslender –Ch. 1 Copyright
Stepwise Family considerations
5. Methods are Discrete: variables either retained or discarded, and often
exhibit high variance and don’t reduce prediction error of full model
(Tibshirani likes to say this).
6. Stepwise yields models with upwardly biased R2s. Model needs
reevaluation in independent data.
7. Severe problems with co-linearity, but debatable.
8. Gives biased regression coefficients that need shrinkage (coefficients are
too large, Tibshirani, 1996).
9. Based on methods (F tests for nested models) that were intended to be
used to test pre-specified hypotheses.
9/8/2019 Ch. 2.2-118
Leonardo Auslender –Ch. 1 Copyright
Stepwise Family considerations.
10. Increasing sample size does not improve selection by much (Derksen and
Kesselman, 1992).
11. Induces comfort of not thinking about problem at hand. But thinkers could
be clueless or contradictory. Should be considered exploratory tool.
12. Stepwise alternatives:
1. Replacing 1, 2, 3 … variables at a time.
2. Branch and bound techniques.
3. Sequential Subsets,
4. Ridge Regression.
5. Nonnegative Garrote and Lasso, LARS…
6. Foster/Stine (See Auslender 2005)
7. Stepwise but starting from full model.
9/8/2019 Ch. 2.2-119
Leonardo Auslender –Ch. 1 Copyright
Stepwise Family considerations
13. Freedman (1983) showed by simulation in the case of n/p not very
large that stepwise methods select about 15% of noise variables.
Even if all the variables are noise, a high R2 can be obtained. If
seemingly insignificant variables are dropped, the R2 will still remain high.
➔
BE VERY CAREFUL ABOUT JUMPING in joy or despair WITH
SIGNIFICANCE FINDINGS IN THE CONTEXT OF VARIABLE
SELECTION.
9/8/2019 Ch. 2.2-120
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-121
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 122
9/8/2019
Tested models. Too many? .
Requested Analyses: Names & Descriptions. Model #
Model Name Model Description
***
Overall Models
-1
M1 Multivariate regression TOTAL_SPEND 1
M1_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 2
M1_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 3
M1_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 4
M1_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 5
M1_TRN_REGR_NONE Regr TRN NONE 6
M1_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 7
M1_TRN_REGR_STPW Regr TRN STEPWISE
8
M1_TRN_REGR_STPW_NO_HETERO Regr TRN WLS 9
M1_TRN_STPW_GT_2 Regr TRN STEPWISE abs(rstud) >=2 10
M1_TRN_STPW_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 11
M1_TRN_STPW_LT_2 Regr TRN STEPWISE ABS(rstud) < 2 12
M1_TRN_STPW_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 13
M2 Log total spend 14
M2_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 15
M2_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 16
M2_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 17
M2_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 18
M2_TRN_REGR_NONE Regr TRN NONE 19
M2_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 20
M2_TRN_REGR_STPW Regr TRN STEPWISE 21
Requested Analyses: Names & Descriptions. Model #
Model Name Model Description
M2_TRN_REGR_STPW_NO_HETERO Regr TRN WLS 22
M2_TRN_STPW_GT_2 Regr TRN STEPWISE abs(rstud) >=2 23
M2_TRN_STPW_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 24
M2_TRN_STPW_LT_2 Regr TRN STEPWISE ABS(rstud) < 2 25
M2_TRN_STPW_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 26
Leonardo Auslender –Ch. 1 Copyright 123
9/8/2019
M2 model in logs.
M1 model in original
units.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-124
9/8/2019
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-125
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 126
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 127
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 128
9/8/2019
Breusch-Pagan Test for Heteroskedasticity
Model Name CHI-SQUARE DF P-VALUE
Num
Obs Hetero?
M1_TRN_REGR_NONE 42.50672 5 0.000 5,960 YES
M1_TRN_NONE_LT_2 49.405089 5 0.000 5,703 YES
M1_TRN_NONE_GT_2 26.637793 5 0.000 257 YES
M1_TRN_REGR_STPW 40.96904 4 0.000 5,960 YES
M1_TRN_STPW_LT_2 48.606669 4 0.000 5,703 YES
M1_TRN_STPW_GT_2 25.054416 3 0.000 257 YES
M2_TRN_REGR_NONE 40.99884 5 0.000 5,960 YES
M2_TRN_NONE_LT_2 15.9612 5 0.007 5,660 YES
M2_TRN_NONE_GT_2 19.9851 5 0.001 300 YES
M2_TRN_REGR_STPW 39.0082 4 0.000 5,960 YES
M2_TRN_STPW_LT_2 14.27328 4 0.006 5,664 YES
M2_TRN_STPW_GT_2 3.991856 1 0.046 296 YES
Leonardo Auslender –Ch. 1 Copyright 129
9/8/2019
VIFs omitted,
None above 2.
Leonardo Auslender –Ch. 1 Copyright 130
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 131
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 132
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 133
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 134
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 135
9/8/2019
Models and ranks by Req. GOFs
Requested Gof
_PRESS_ _RMSE_ _RSQ_
Rank Rank Rank
Model Name
1 1 1
M1_NONE_GT_2
M1_NONE_GT_2_NO_HETERO
2 2 2
M1_NONE_LT_2
3 3 3
M1_NONE_LT_2_NO_HETERO
4 4 4
M1_REGR_NONE
5 5 5
M1_REGR_NONE_NO_HETERO
6 6 6
M1_REGR_STPW
7 7 7
M1_REGR_STPW_NO_HETERO
8 8 8
M1_STPW_GT_2
9 9 9
M1_STPW_GT_2_NO_HETERO
10 10 10
M1_STPW_LT_2
11 11 11
M1_STPW_LT_2_NO_HETERO
12 12 12
M2_NONE_GT_2
1 1 1
M2_NONE_GT_2_NO_HETERO
2 2 2
M2_NONE_LT_2
3 3 3
M2_NONE_LT_2_NO_HETERO
4 4 4
M2_REGR_NONE
5 5 5
M2_REGR_NONE_NO_HETERO
6 6 6
M2_REGR_STPW
7 7 7
Leonardo Auslender –Ch. 1 Copyright 136
9/8/2019
Models and ranks by Req. GOFs
Requested Gof
_AIC_ _CP_ _PRESS_ _RMSE_ _RSQADJ_ _RSQ_ _SBC_
Rank Rank Rank Rank Rank Rank Rank
Model Name
4 4 4 4 4 3 4
M1_REGR_NONE
M1_REGR_NONE_NO_HETERO 2 3 2 1 3 2 2
M1_REGR_STPW 3 1 3 3 2 4 3
M1_REGR_STPW_NO_HETERO 1 2 1 2 1 1 1
M2_REGR_BCKW 3 2 3 4 4 4 3
M2_REGR_BCKW_NO_HETERO 1 1 1 1 2 2 1
M2_REGR_NONE 4 3 4 3 3 3 4
M2_REGR_NONE_NO_HETERO 2 4 2 2 1 1 2
Removing LT_2 and GT_2 and adding other GOFs ➔ not
easy to pick up a winner.
Leonardo Auslender –Ch. 1 Copyright 137
9/8/2019
Some observations.
1) Log transformed dep var is binomial, raw-units depvar is certainly not
normal. If transforms wants normality ➔ mixture of two normals in
this case.
2) P-values are quite disparate when comparing LT_2 and GT_2 to non-
outlier models. Still, possible to see some regularity, such as
significant number of visits and claims.
3) Co-linearity is not an issue if 10 is the critical value of VIF.
4) If using RMSE as fit, NONE_LT_2_no hetero seems best performer,
more inconclusive using PRESS. When looking at the comparative
ranks table, the different methods offer different rankings.
5) Possible that alternative transformations may enhance model, and
even transforming predictors and searching for interactions. This is
especially true since all non hetero corrected showed hetero disease.
6) It is also possible to use non-regression models, such as Tree or
Neural network methods that can then be compared in terms of fit,
prediction and possible interpretability.
7) Data set has no missing values, unreal in real world.
8) Very poor model.
9) Model contains NO INTERACTIONS or predictor TRANSFORMATIONS.
Leonardo Auslender –Ch. 1 Copyright 138
9/8/2019
Modeler’s attitude
Statistical inference (Null – Alternative H) generates
attitude that model is homoskedastic, has no interactions,
etc.
Be bold and assume the following and aim at disproving:
Interactions are present
Model is heteroskedastic
Transformations are useful
Outliers are present
Co-linearity is likely
….
Leonardo Auslender –Ch. 1 Copyright 139
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 140
9/8/2019
Omitted Variables Bias.
( ' ) '
( ' ) '( )
( ' ) ( '
2 2
1
1
2 2
1
2
Assume true model is: Y X X where
X is set of variables (with intercept) but X2 is incorrecly
omitted. Correct model should have both sets. By OLS:
X X X Y
X X X X X
X X X X
−
−
−
=  +  + 
 = =
=  +  + 
=  + ) ( ' ) '
ˆ ˆ
( / , ) ( / , )
( ' ) ( ' ) ( ' ) ' ( / , )
( ' ) ( ' )
ˆ
( / , ) (
1
2
2
1
2 2 2
2 2
1 2
0 1 2 2
X X X
Bias of final estimator is:
Bias X X E X X2
X X X X X X 1X E X X
X X X X
For X (entered) and X (omitted) single
variables, bias is:
Bias X X x x
−
−
−
 +  
 =  − 
=  + − 
= 
 = − ,
,
)
ˆ
( / , )
ˆ ˆ ˆ
2
1 1 2 2
1
1 1 2 2
1,2 0 1 0
s
s
s2
Bias X1 X2
s1
If ρ 0, bias in b and b . If = 0, bias in b .
 
 =  

Leonardo Auslender –Ch. 1 Copyright 141
9/8/2019
Un-reviewed topics of importance
Time Series
Spatial regression
Longitudinal Data Analysis
Simultaneous Equation Models
Mixtures
….
Leonardo Auslender –Ch. 1 Copyright 142
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 143
9/8/2019
Proc Reg Data = fraud.fraud PLOTS (MAXPOINTS = NONE) OUTEST =
_estim_ (where = (_type_ = "PARMS")) TABLEOUT
PRESS /* outputs _PRESS in OUTEST data set */
COVOUT outvif RSQUARE ADJRSQ AIC BIC CP MSE SSE
EDF clb;
/* clb: confidence limits for betas */
M1_TRN_NONE : MODEL total_spend = NUM_MEMBERS DOCTOR_VISITS
FRAUD MEMBER_DURATION NO_CLAIMS OPTOM_PRESC / selection = NONE
dw r vif MSE JP CP AIC BIC SBC
ADJRSQ spec influence; /* influence produces all influence statistics
*/
OUTPUT OUT = OUTP (KEEP = p_total_spend total_spend
r_p_total_spend student lower_p_total_spend upper_p_total_spend ) P =
p_total_spend r =
r_p_total_spend student = student lcl = lower_p_total_spend ucl =
upper_p_total_spend ;
TITLE2 "First Regr depvar total_spend" ;
RUN;
Leonardo Auslender –Ch. 1 Copyright 144
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 145
9/8/2019
data lessthan morethan; merge fraud.fraud
OUTP (keep = rstudent);
if abs ( rstudent ) < 2 then output lessthan;
else output morethan;
drop rstudent;
RUN;
Using studentized residual to obtain two data sets.
Leonardo Auslender –Ch. 1 Copyright 146
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 147
9/8/2019
ods output heterotest = hetero_test; /*
REQUIRES ETS PACKAGE */
proc model data = dataset;
parms beta0 beta1 … betap;
depvar =
beta0 + beta1 * var1 + beta2 * var2 ...../ Breusch;
run;
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-148
9/8/2019
SAS Programs.
Original linear Regression, influence, variable selection.
Proc Reg Data = fraud.fraud PLOTS (MAXPOINTS = NONE) OUTEST =
_estim_ (where = (_type_ = "PARMS")) TABLEOUT COVOUT outvif RSQUARE ADJRSQ
AIC BIC CP MSE SSE
EDF clb;
/* clb: confidence limits for betas */
M1_TRN_NONE : MODEL total_spend = NUM_MEMBERS DOCTOR_VISITS
FRAUD MEMBER_DURATION NO_CLAIMS OPTOM_PRESC
/ selection = NONE dw r vif MSE JP CP AIC BIC SBC
ADJRSQ spec influence; /* influence produces all influence statistics */
OUTPUT OUT = OUTP (KEEP = p_total_spend total_spend
r_p_total_spend student lower_p_total_spend upper_p_total_spend ) P = p_total_spend r
=
r_p_total_spend student = student lcl = lower_p_total_spend ucl = upper_p_total_spend ;
TITLE2 "First Regr depvar total_spend" ;
RUN;
run;
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-149
9/8/2019
splitting the data according to studentized residual
data lessthan morethan;
merge fraud.fraud OUTP (keep = rstudent);
if abs ( rstudent ) < 2 then output lessthan;
else output morethan;
drop rstudent;
RUN;
Breusch-Pagan test
ods output heterotest = hetero_test; /* REQUIRES ETS PACKAGE */
proc model data = dataset;
parms beta0 beta1 ..... betap;
depvar = beta0 + beta1 * var1 + beta2 * var2 .....
/ breusch ;
Run;
Leonardo Auslender –Ch. 1 Copyright 150
9/8/2019
Graphing confidence intervals of fitted values and predictions.
Proc sgplot data = .... ;
Reg x = xvar y = yvar / cli /* or clm for fitted values*/;
Run;
For variable selection using proc reg.
Proc reg …. ;
…
Model Y = ….. / selection = forward. …. ;
Run;
Options: none, forward, backward, stepwise, etc.
Leonardo Auslender –Ch. 1 Copyright 151
9/8/2019
Leonardo Auslender –Ch. 1 Copyright 152
9/8/2019
1) You fit a multiple regression to examine the effect
of a particular variable a worker in another department is interested in. The
variable comes back insignificant, but your co-worker says that this is
impossible as it is known to have an effect. What would you say/do?
2) You have 1000 variables and 100 observations. You
would like to find the significant variables for a particular
response. What would you do (this happens in genomics for instance)?
3) What is your plan for dealing with outliers (if you have a plan)?
How about missing values? How about transformations?
4) How do you prevent over-fitting when you are creating a statistical
model (i.e., model does not fit well other samples).
5) Define/explain what forecasting is, as opposed to prediction, If
possible?
Leonardo Auslender –Ch. 1 Copyright 153
9/8/2019
1) We have a regression problem where the response is a count
variable. Which would you choose in this context, ordinary
least squares or Poisson regression (or maybe some other)?
Explain your choice, what is the main differences among these models?
2) Describe strategies to create a regression model with a very large
number of predictors, and not a lot of time.
3) Explain intuitively why a covariance matrix is positive (semi)
definite, and what that means. How can that fact be used?
4) Explain concept of interaction effects in regression models. Specifically,
can an interaction be significant while corresponding main effects are not?
Is there some difference in interpretation of interaction between OLS and
logistic regression (future chapter)?
5) Is there a problem if regression residuals are not normal?
6) When would you transform a dependent variable? An Independent one?
Leonardo Auslender –Ch. 1 Copyright 154
9/8/2019
(from the Analysis factor, 2018-NOV, theanalysisfactor.com)
1. When you add an interaction to a regression model, you can still
evaluate the main effects of the terms that make up the interaction,
just like in ANOVA.
2. The intercept is usually meaningless in a regression model.
3. In Analysis of Covariance, the covariate is a nuisance variable, and the
real point of the analysis is to evaluate the means after controlling for the
covariate.
4. Standardized regression coefficients are meaningful for dummy-coded
predictors.
5. The only way to evaluate an interaction between two independent
variables is to categorize one or both of them.
All of these are False.
Leonardo Auslender –Ch. 1 Copyright 155
9/8/2019
References
Box, G. (1976): Science and Statistics, Journal of the American Statistical
Association, 71: 791–799
Gladwell M. (2008): Outliers: The Story of Success, Little, Brown and
Company.
Horst P. (1941): The prediction of personnel adjustment. Social Science
Research and Council Bulletin, 48, 431-436.
Spiess A., Newmeyer N. (2010): An evaluation of R2 as an inadequate
measure for nonlinear models in pharmacological and biochemical
research: a Monte Carlo approach, BMC Pharmacology, 10: 6.
Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-156
9/8/2019

More Related Content

Similar to Linear Regression.pdf

Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.pptTanyaWadhwani4
 
Reduction of Response Surface Design Model: Nested Approach
Reduction of Response Surface Design Model: Nested ApproachReduction of Response Surface Design Model: Nested Approach
Reduction of Response Surface Design Model: Nested Approachinventionjournals
 
Some Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population MeanSome Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population Meaninventionjournals
 
Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2AbdelmonsifFadl
 
Exponential lindley additive failure rate model
Exponential lindley additive failure rate modelExponential lindley additive failure rate model
Exponential lindley additive failure rate modeleSAT Journals
 
Mm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithmsMm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithmsEellekwameowusu
 
Econometric (Indonesia's Economy).pptx
Econometric (Indonesia's Economy).pptxEconometric (Indonesia's Economy).pptx
Econometric (Indonesia's Economy).pptxIndraYu2
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)준식 최
 
IRJET-Debarred Objects Recognition by PFL Operator
IRJET-Debarred Objects Recognition by PFL OperatorIRJET-Debarred Objects Recognition by PFL Operator
IRJET-Debarred Objects Recognition by PFL OperatorIRJET Journal
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptxMarceloHenriques20
 
Lecture Notes in Econometrics Arsen Palestini.pdf
Lecture Notes in Econometrics Arsen Palestini.pdfLecture Notes in Econometrics Arsen Palestini.pdf
Lecture Notes in Econometrics Arsen Palestini.pdfMDNomanCh
 
Hyper variance and autonomous bus
Hyper variance and autonomous busHyper variance and autonomous bus
Hyper variance and autonomous busJun Steed Huang
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdfLeonardo Auslender
 
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised LearningShumet Tadesse
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
 

Similar to Linear Regression.pdf (20)

Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
Ensembles.pdf
Ensembles.pdfEnsembles.pdf
Ensembles.pdf
 
Reduction of Response Surface Design Model: Nested Approach
Reduction of Response Surface Design Model: Nested ApproachReduction of Response Surface Design Model: Nested Approach
Reduction of Response Surface Design Model: Nested Approach
 
Some Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population MeanSome Unbiased Classes of Estimators of Finite Population Mean
Some Unbiased Classes of Estimators of Finite Population Mean
 
Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2Applied Business Statistics ,ken black , ch 3 part 2
Applied Business Statistics ,ken black , ch 3 part 2
 
Exponential lindley additive failure rate model
Exponential lindley additive failure rate modelExponential lindley additive failure rate model
Exponential lindley additive failure rate model
 
Mm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithmsMm chap08 -_lossy_compression_algorithms
Mm chap08 -_lossy_compression_algorithms
 
Econometric (Indonesia's Economy).pptx
Econometric (Indonesia's Economy).pptxEconometric (Indonesia's Economy).pptx
Econometric (Indonesia's Economy).pptx
 
Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)Paper Summary of Disentangling by Factorising (Factor-VAE)
Paper Summary of Disentangling by Factorising (Factor-VAE)
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
IRJET-Debarred Objects Recognition by PFL Operator
IRJET-Debarred Objects Recognition by PFL OperatorIRJET-Debarred Objects Recognition by PFL Operator
IRJET-Debarred Objects Recognition by PFL Operator
 
22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx22_RepeatedMeasuresDesign_Complete.pptx
22_RepeatedMeasuresDesign_Complete.pptx
 
Lecture Notes in Econometrics Arsen Palestini.pdf
Lecture Notes in Econometrics Arsen Palestini.pdfLecture Notes in Econometrics Arsen Palestini.pdf
Lecture Notes in Econometrics Arsen Palestini.pdf
 
Presentation socg
Presentation socgPresentation socg
Presentation socg
 
Hyper variance and autonomous bus
Hyper variance and autonomous busHyper variance and autonomous bus
Hyper variance and autonomous bus
 
Classification methods and assessment.pdf
Classification methods and assessment.pdfClassification methods and assessment.pdf
Classification methods and assessment.pdf
 
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning,  Chapter 2- Supervised LearningCourse Title: Introduction to Machine Learning,  Chapter 2- Supervised Learning
Course Title: Introduction to Machine Learning, Chapter 2- Supervised Learning
 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
 
Senior Project4-29
Senior Project4-29Senior Project4-29
Senior Project4-29
 
Chapter5.pdf.pdf
Chapter5.pdf.pdfChapter5.pdf.pdf
Chapter5.pdf.pdf
 

More from Leonardo Auslender

4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdfLeonardo Auslender
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdfLeonardo Auslender
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdfLeonardo Auslender
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdfLeonardo Auslender
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdfLeonardo Auslender
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07Leonardo Auslender
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07Leonardo Auslender
 

More from Leonardo Auslender (20)

1 UMI.pdf
1 UMI.pdf1 UMI.pdf
1 UMI.pdf
 
Suppression Enhancement.pdf
Suppression Enhancement.pdfSuppression Enhancement.pdf
Suppression Enhancement.pdf
 
4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf4_2_Ensemble models and gradient boosting2.pdf
4_2_Ensemble models and gradient boosting2.pdf
 
4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf4_5_Model Interpretation and diagnostics part 4_B.pdf
4_5_Model Interpretation and diagnostics part 4_B.pdf
 
4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf4_2_Ensemble models and grad boost part 2.pdf
4_2_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf4_2_Ensemble models and grad boost part 3.pdf
4_2_Ensemble models and grad boost part 3.pdf
 
4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf4_5_Model Interpretation and diagnostics part 4.pdf
4_5_Model Interpretation and diagnostics part 4.pdf
 
4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf4_3_Ensemble models and grad boost part 2.pdf
4_3_Ensemble models and grad boost part 2.pdf
 
4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf4_2_Ensemble models and grad boost part 1.pdf
4_2_Ensemble models and grad boost part 1.pdf
 
4_1_Tree World.pdf
4_1_Tree World.pdf4_1_Tree World.pdf
4_1_Tree World.pdf
 
4 MEDA.pdf
4 MEDA.pdf4 MEDA.pdf
4 MEDA.pdf
 
3 BEDA.pdf
3 BEDA.pdf3 BEDA.pdf
3 BEDA.pdf
 
1 EDA.pdf
1 EDA.pdf1 EDA.pdf
1 EDA.pdf
 
0 Statistics Intro.pdf
0 Statistics Intro.pdf0 Statistics Intro.pdf
0 Statistics Intro.pdf
 
0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf0 Model Interpretation setting.pdf
0 Model Interpretation setting.pdf
 
4 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-074 2 ensemble models and grad boost part 3 2019-10-07
4 2 ensemble models and grad boost part 3 2019-10-07
 
4 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-074 2 ensemble models and grad boost part 2 2019-10-07
4 2 ensemble models and grad boost part 2 2019-10-07
 
4 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-074 2 ensemble models and grad boost part 1 2019-10-07
4 2 ensemble models and grad boost part 1 2019-10-07
 
4 meda
4 meda4 meda
4 meda
 
3 beda
3 beda3 beda
3 beda
 

Recently uploaded

社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token PredictionNABLAS株式会社
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单aqpto5bt
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...yulianti213969
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...ThinkInnovation
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 

Recently uploaded (20)

社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 

Linear Regression.pdf

  • 1. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-1 9/8/2019
  • 2. Leonardo Auslender –Ch. 1 Copyright 2 9/8/2019 Contents Types of Regression modeling. Linear Regression. Goodness of fit. Fraud Data quick example Homework and interview questions. Non-Linear Regression (quickly). Problem Areas Heteroskedasticity Co-linearity Outliers Leverage Example of problem areas. QQ Plots Logging Dependent Variable Residuals. Interactions. Model and Variable Selections. Full example. Problem if selected away too many variables. SAS Programs References.
  • 3. Leonardo Auslender –Ch. 1 Copyright 3 9/8/2019 New Applicants, Fraudsters, Mortgage Delinquents, Diseased, Etc. Historical Data Predictive Model Exploratory / Interpretive Model
  • 4. Leonardo Auslender –Ch. 1 Copyright 4 9/8/2019 Exploratory/Interpretive and Predictive Models. Typically, businesses interested in both aspects. By way of example, Phone company interested in stopping attrition, interested in understanding characteristics of past attriters vs. non-attriters, and to predict who may be future ones to implement anti-attrition plans. Marketing plans typically dependent on profiling, not directly on predictive scores. Model purported to provide tools to prevent attrition. Typically, both aspects of modeling are not easily achievable. Typically, data mining, data science, etc. search for better predictive models, leaving interpretation as secondary aim. Thus, black-box models (e.g., neural networks) are more readily employed.
  • 5. Leonardo Auslender –Ch. 1 Copyright 5 9/8/2019 Types of Regression models. Dependent on type of dependent or target variable (single): Continuous: Linear Regression: standard and most prevalent method. Ridge Regression: Specialized for the case of co-linearity. Lasso Regression: least absolute shrinkage and selection operator: performs variable selection in addition to avoiding co-linearity. Partial Least Squares: Useful when n << p, reduces # predictor to small subset of components like PCA, then Linear regression on components. Non-linear regression: allows for more flexible functional form. Categorical: Binary logistic: estimates probability of occurrence of “event”. Ordinal logistic: Dep var has at least 3 levels, that can be ranked (e.g.., worse, fair, better). Nominal (or Polychotomous): More than 2 levels. Counts (e.g., number of visits to doctors, etc). Poisson, Negative Binomial, Zero-inflated Regression:
  • 6. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-6 9/8/2019 All models are wrong, but some are useful – George Box (1976) (right, which ones are useful?) Linear models: area that has received the most attention in statistical modeling. Part of supervised learning in ML, Data Mining, etc. Unsupervised methods refer to those without target or dependent variable (e.g., clustering, PCA). Given dependent (or target) variable Y and set of non-random predictors X, aim at a) finding linear function f(X) of predictors to predict Y, given some conditioning criterion, b) interpreting function and c) once f(X) is estimated and predictions for Y found, “loss” function is used to evaluate prediction errors, typically squared error function. This presentation is about Linear Regression, and its estimation method is called Ordinary Least Squares (OLS), or Least Squares Estimation (LSE).. Impossible to review all methods well in a single lecture.
  • 7. Leonardo Auslender –Ch. 1 Copyright 7 9/8/2019 Tendency to just plunge into whatever data are available and obtain a model. But … Gladwell (2008, p.1 and subsequent) mentions that mortality in town of Roseto, Pa, was far lower than in other towns. Heart attacks below age 55 were unknown while they were prevalent all over the US. Scientists analyzed genetic information (Rosetans were mostly originals from a small southern Italian town), exercise, diet, but could not find any true answers at the individual level. The difference was the social interconnections that the Rosetans had in Roseto, the friendliness in the streets, the chatting in the old Italian dialect, 3 generations living under the same roof, the calming effect of the church services, the egalitarian ethos that allowed people to enrich themselves without flaunting it to others not so fortunate, in short a protective social structure. They counted 22 civic organizations in a town of 2,000. Spatial databases will capture the effect but not the reason unless societal interconnections are first understood, conceptualized and measured. Typical databases may not include them ➔ UNDERSTAND that raw data may not readily model reality and that NOT everything is modelable.
  • 8. Leonardo Auslender –Ch. 1 Copyright 8 9/8/2019
  • 9. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Model: Y = 0 + 1*X1 + 2*X2 +… + p*Xp + ε = = X + ε, Y continuous, dependent or target variable, X, set of predictors, either binary or continuous, all Numeric. Linearity is assumed because the function f(X) = X’β, and β are fixed and not random coefficients, unknown and in need of estimation. Criterion: Minimize sum of squared errors to find Betas. Ch. 2.2-9 * * .... * * .... . . * * .... 1 1 11 2 12 p 1p 1 2 1 21 2 22 p 2p 2 n 1 n1 2 n2 p np n Y X X X Y X X X Y X X X =  +  + +  +  =  +  + +  +  =  +  + +  + 
  • 10. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Ch. 2.2-10 or succintly : . . . . . . . . . . . . . . . . . . , 1 11 12 1p 21 22 2p 2 n1 n2 np p p n Matrix representation Y X X X X X X Y X X X Y Y X                                           = +                                     =  + 
  • 11. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-11 9/8/2019 Idealized view with single predictor X. At every Value of Total length There are Many values Of body Depth Symmetric ally Distributed .
  • 12. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Example (Horst, 1941): Y = pilot performance X1 = mechanical ability of the pilot X2 = verbal ability of the pilot N: undetermined number of observations, p = 2. Required: N > p (vital). From the standpoint of computation, no assumptions on error , because only betas are estimated. Residuals (estimates of , are obtained by subtracting predicted from actual. Ch. 2.2-12 1 1 2 2 1 1 2 2 ˆ ˆ Model: constant Estimate constant Y X X Y b X b X    = + + + = + + 
  • 13. Leonardo Auslender –Ch. 1 Copyright 13 9/8/2019 Requested Analyses: Names & Descriptions. Model # Model Name Model Description *** Overall Models -1 M1 Multivariate regression TOTAL_SPEND 1 M1_TRN_REGR_NONE Regr TRN NONE 2
  • 14. Leonardo Auslender –Ch. 1 Copyright 14 9/8/2019 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 17605721712 2934286952 23.90 <.0001 (1) Error 5953 7.308901E11 122776763 Corrected Total 5959 7.484958E11 Root MSE 11080 R-Square 0.0235 (2) Dependent Mean 18608 Adj R-Sq 0.0225 Coeff Var 59.54689 (5) Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 15664 501.50886 31.23 <.0001 NUM_MEMBERS Number of members covered 1 -43.86759 144.12951 -0.30 0.7609 (3) DOCTOR_VISITS Total visits to a doctor 1 138.06114 20.29884 6.80 <.0001 (4) FRAUD Fraudulent Activity yes/no 1 -1813.84488 394.55243 -4.60 <.0001 MEMBER_DURATION Membership duration 1 9.29226 1.81950 5.11 <.0001 NO_CLAIMS No of claims made recently 1 -172.63492 142.61993 -1.21 0.2262 OPTOM_PRESC Number of opticals claimed 1 477.92230 88.49029 5.40 <.0001
  • 15. Leonardo Auslender –Ch. 1 Copyright 15 9/8/2019
  • 16. Leonardo Auslender –Ch. 1 Copyright Linear Regression: geometrical projections. Y X Z Projection of Y On X. Projection of Y On Z. Y hat = a X + b Z = inner Product of optimal projections Of Y on the plane spanned by X and Z. 9/8/2019 Ch. 2.2-16 Plane spanned By X and Z
  • 17. Leonardo Auslender –Ch. 1 Copyright 17 9/8/2019
  • 18. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Assumptions beyond sheer computation. 1) Nonrandom X matrix of predictors. 2) X matrix is of full rank (vector components linearly independent). 3) ε i uncorrelated across observations, common mean 0 and unknown positive and constant variance σ2 (E.g., not using height and weight as predictors).. 4) ε ~ N (0, σ2). Inferential Problems. 1) Point and CI Estimation of  and c’  (linear combination). 2) Estimation of σ2. 3) Hypothesis testing on  and c’  (linear combination of parameters), c vector of constants. 4) Prediction (point and interval) of new Y. Regardless of assumptions, always possible to fit a linear regression to data. Resulting equation may not be USEFUL, or may be misleading ➔ use your brains. Ch. 3-18
  • 19. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Ordinary Least Squares (OLS): *** Estimate value that minimizes the residual sum of squares criterion: (usually also called min SSE, we’ll use either one term): − = − = − − =   2 i ij j 1 RSS (Y ( x )) (Y X )'(Y X ), from which: OLS estimate : (X'X) X' Y     OLS estimates are B(est) L(inear) U(nbiased) E(stimators): BLUE, that is, minimum variance, and are also the MLE (maximum likelihood estimators). . − − −  =  =    = −  =     −  −   2 1 2 0 A 1 1 2 Var( ) (X'X) Fitted or predicted Y: Y X RSS( ) n p To test H : C d,H : C d at level of significance : (C d)'[C(X'X) C'] (C d) nrow(C) Compare to upper % point of F (nrow C, n - p) Ch. 3-19
  • 20. Leonardo Auslender –Ch. 1 Copyright 20 9/8/2019 To test whether at least one coefficient is NOT ZERO, Use F-test, where “0” refers to null model and “F” to full. ~ ( , ) 0 F F SSE SSE p F test F p n p MSE − − = −  =   = 0 A H : C d,H : C d typically, d 0
  • 21. Leonardo Auslender –Ch. 1 Copyright 21 9/8/2019
  • 22. Leonardo Auslender –Ch. 1 Copyright 22 9/8/2019 ➔Corr (X,Y) = if SD(Y) = SD(X). E.g., if both Standardized, otherwise same sign at least, and interpretation from correlation holds in simple regression case. Notice that regression of X on Y is NOT inverse of regression of Y on X because of SD(X) and SD(Y). ˆ  / Confusion on signs of coefficients and interpretation. ( ) ˆ { ( ) } ˆ ( ) ( ) y i xy xy x i xy Y X s Y Y r r s X X sg r sg      = + + − = =  − =   2 1 2 2
  • 23. Leonardo Auslender –Ch. 1 Copyright 23 9/8/2019 In multiple linear regression, previous relationship does not hold because predictors can be correlated (rxz) weighted by ryz, hinting at co-linearity and/or relationships of supression/enhancement. . . . 2 2 But in multivariate, e.g.: , estimated equation (emphasizing "partial") and for example: ˆ ˆ ˆ , ˆ 1 ˆ ( ) ( ) ( ) ( ) and 1 YX Z YZ X Y YX YZ XZ YX Z X XZ YX YX YZ XZ XZ Y X Z Y a X Z s r r r s r sg sg r abs r abs r r r         = + + +  = + + − = − =    
  • 24. Leonardo Auslender –Ch. 1 Copyright 24 9/8/2019 Illustrating Beta coefficients Issue, with example From BEDA
  • 25. Leonardo Auslender –Ch. 1 Copyright 25 9/8/2019 Zero X Y slope > 0, partial Y X / Z slope < 0.
  • 26. Leonardo Auslender –Ch. 1 Copyright 26 9/8/2019 Note coeff (var_x) < 0 while corr (Var_Y, Var_x) > 0. Note that non-intercept p-values are significant. . Estimates DF Paramet er Estimate Standar d Error t Value Pr > |t| Variable 1 -0.05 0.07 -0.78 0.44 Intercept var_z 1 0.80 0.11 7.51 0.00 var_x 1 -0.49 0.11 -4.64 0.00
  • 27. Leonardo Auslender –Ch. 1 Copyright 27 9/8/2019
  • 28. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Goodness of fit Some Necessary Definitions in linear models. Correlation, R2, etc. (Y-hat = predicted Y). ˆ | | ˆ ˆ ( , ) cos( , ) | | ˆ ˆ ˆ | | | | cos( , ) | | | | ( , 1995, .36) 1 cos 1 Y R Corr Y Y Y Y Y Y Y Y Y Y Y Wickens p = = =  =  −    Length of predicted vector never larger than original length of Y ➔ Regression to the mean. Ch. 2.2-28
  • 29. Leonardo Auslender –Ch. 1 Copyright Some Necessary Definitions in linear models, Goodness of Fit measures: R2 (Coeff. of Determination, r2 is for simple regr.) 0  R2  1 = Model SS (Regresion SS) / Total SS = 1 – SSE/SST (computational formula). With just Y and X in regression, r2 = corr2 (Y, X) (from previous formula). n i i = 1 n i i = 1 n i i i = 1 2 2 2 Regression Sum of Squares = ˆ RSS = (y - y) Total Sum of Squares = TSS = (y - y) Sum of Squares of Error = ˆ SSE = (y - y )   
  • 30. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Y X Z 90o 90o 90o  = Corr(X, Y) cos( ), zero-order (X,Y ) corr.  ˆ ˆ Corr(Y, Y) cos( ) Y / Y  = =  2 2 2 R cos ( ) 1 sin ( )   = = − Ŷ ˆ ˆ sin( ) (Y Y) / Y e / Y  = − =    − 2 ˆ R (angle between Y and Y), and vice versa. Ch. 3-30 Geometric appreciation.
  • 31. Leonardo Auslender –Ch. 1 Copyright 31 9/8/2019 Other measures of goodness of fit. Press residuals. The prediction error (PRESS residuals) are defined as: , where is the fitted value of the ith response based on all observations except the ith one. It can be shown that (and hii element of H matrix (X’X)-1X’) The Press Statistic. PRESS is generally regarded as a measure of how well a regression model will perform in predicting new data. Small PRESS An R2- like statistic for prediction (based on PRESS We expect this model to explain about R2 (prediction)% of variability in predicting new observations. Use PRESS to compare models: A model with small PRESS is preferable to one with large PRESS. i (i) ii e e 1 h = −
  • 32. Leonardo Auslender –Ch. 1 Copyright 32 9/8/2019 Evaluation of measures of goodness of fit. 2 2 1 k Mean Square Error [0,1], unitless, measures prop. Var(Y) fitted by preds. ˆ ( ) 1 = standard error of fit for = S y estimate of error Var , . =RM Root MSE SE= . ( ) n i i i R Y Y MSE n p MS Coeff Variation CV E RMSE Y =  − = = − − =  ( / 2 , 2 ) 2 2 / , 2 k 95% Prediction interval for k-th observation (already in data set): s (prediction) = se of prediction or fit for ne ˆ : ˆ w x * { }, ( ( ) 1 , ( ) [ ] k n k i h s n and y y t s prediction s p ed CI r i x x n x x t S MSE   − −  − + −   2 2 1 ( ) 1 ) [1 ( ) ] k n i i X X ction MSE n X X = − = + + − 
  • 33. Leonardo Auslender –Ch. 1 Copyright 33 9/8/2019 Evaluation of measures of goodness of fit (cont.). 1) Model better when R2 higher and/or RMSE lower. RMSE called standard error of regression S by Minitab, but we don’t use that terminology in here. 2) RMSE: average distance between Y and predictions, R2 does not tell this. RMSE can be used also for non-linear models, R2 cannot. Roughly, -+ 2 * RMSE produces 95% CI for residuals. 3) 95% Prediction Interval line can be built (above and below regression line) shows where 95% of data lie in reference to prediction line. 4) 95% Mean Fitted values can be built (above and below regression line) shows where 95% of data lie in reference to fitted line. 5) CV evaluates relative closeness of the predictions to actual values; R2 evaluates how much of variability in Y is fitted by the model. CV cannot be used when mean(Y) = 0 or when Y has mixtures of negative and positive values in its range. 6) Usefulness: if know that predictions must be within specific interval from data points (model precision), RMSE provides the info, R2 does not.
  • 34. Leonardo Auslender –Ch. 1 Copyright 34 9/8/2019
  • 35. Leonardo Auslender –Ch. 1 Copyright 35 9/8/2019 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 6 17605721712 2934286952 23.90 <.0001 (1) Error 5953 7.308901E11 122776763 Corrected Total 5959 7.484958E11 Root MSE 11080 R-Square 0.0235 (2) Dependent Mean 18608 Adj R-Sq 0.0225 Coeff Var 59.54689 (5) Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 15664 501.50886 31.23 <.0001 NUM_MEMBERS Number of members covered 1 -43.86759 144.12951 -0.30 0.7609 (3) DOCTOR_VISITS Total visits to a doctor 1 138.06114 20.29884 6.80 <.0001 (4) FRAUD Fraudulent Activity yes/no 1 -1813.84488 394.55243 -4.60 <.0001 MEMBER_DURATION Membership duration 1 9.29226 1.81950 5.11 <.0001 NO_CLAIMS No of claims made recently 1 -172.63492 142.61993 -1.21 0.2262 OPTOM_PRESC Number of opticals claimed 1 477.92230 88.49029 5.40 <.0001
  • 36. Leonardo Auslender –Ch. 1 Copyright 36 9/8/2019 Variance Variable Label DF Inflation 95% Confidence Limits Intercept Intercept 1 0 14681 16647 NUM_MEMBERS Number of members covered 1 1.00194 -326.41369 238.67852 DOCTOR_VISITS Total visits to a doctor 1 1.04606 98.26805 177.85423 FRAUD Fraudulent Activity yes/no 1 1.20681 -2587.31068 -1040.37907 MEMBER_DURATION Membership duration 1 1.08243 5.72537 12.85914 NO_CLAIMS No of claims made recently 1 1.14992 -452.22170 106.95185 OPTOM_PRESC Number of opticals claimed 1 1.03956 304.44924 651.39535 Interpretation: (1) Implies that the variables in the model do fit the dependent variable significantly. (2) Is the R-square value, that is very low since it is close to 0 but still significant at alpha = 5%. (3) The coefficient for Num_members is not significantly different from 0. (4) The coefficient for doctor_visits is significantly different from 0. An increase in doctor_visits by 1 unit, keeping the rest of the variables constant, raises total_spending by around $138. (5) Notice that the 95% confidence limits for variables deemed not to be significant, overlap 0. (6) Parameter estimates are also called “main effects” of corresponding variables and are constant. (7) The coefficient of variation (5) is defined as RMSE / mean dep var, and is unitless, can be used to compared different models, smaller better.
  • 37. Leonardo Auslender –Ch. 1 Copyright 37 9/8/2019 Caveat about coefficient interpretation: Classical interpretation of regression coefficient ‘b1’ for variable X1 is: For given values of all remaining predictors, a change in one unit of X1, changes the predicted value by ‘b1’. In a non-experimental setting (as this one), in which data sets are collected opportunistically, predictors/variables are not necessarily orthogonal. As a matter of fact, it is better to at least suspect that the predictors are correlated. In this case, it is not possible to state “keeping the rest of the variables constant” categorically, because raising the variable of interest by one unit, affects the values that are supposed to be left constant. Question: The coefficient for fraud is -1813. Since fraud = 1 implies that there was fraud activity, can you interpret it in relation to total_spend alone? Is it possible to immediately interpret coefficients?
  • 38. Leonardo Auslender –Ch. 1 Copyright 38 9/8/2019
  • 39. Leonardo Auslender –Ch. 1 Copyright 39 9/8/2019 From Fitted …..
  • 40. Leonardo Auslender –Ch. 1 Copyright 40 9/8/2019 To Predicted…..
  • 41. Leonardo Auslender –Ch. 1 Copyright 41 9/8/2019 Model is seriously bad.
  • 42. Leonardo Auslender –Ch. 1 Copyright 42 9/8/2019 QQ plot of residuals indicates problem with dependent Variable. Residuals supposed to be normally distributed.
  • 43. Leonardo Auslender –Ch. 1 Copyright 43 9/8/2019 No comments.
  • 44. Leonardo Auslender –Ch. 1 Copyright 44 9/8/2019 End of first Modeling attempt, Will try to improve soon.
  • 45. Leonardo Auslender –Ch. 1 Copyright 45 9/8/2019 Using categorical variables in linear or logistic regression. Given variable X with ‘p’ levels (e.g. colors, white, black, yellow, p = 3), LR can use this information by creating ‘p – 1’ dummy (binary) variables: Dummy1 = 1 if color = white, else 0 Dummy2 = 1 if color = black, else 0. If both dummy1 and dummy2 = 0 then color is yellow. Method is called DUMMY CODING. If we created 3 Dummies for a model, we would be creating co-linearity among the Predictors. When just dummy coded variables are our predictors, constant of linear regression is mean of reference group. Coefficient of dummy predictor is mean (dummy k) – mean (reference dummy). In EFFECT CODING, we again create p-1 binary variables, but dummy1 = dummy2 = -1 when color is yellow.
  • 46. Leonardo Auslender –Ch. 1 Copyright 46 9/8/2019 Using categorical variables in linear or logistic regression. In LR with only effect coding predictors, constant is overall mean of dep var. Coefficient for effect1 is mean of predictor 1 – overall mean, etc. In case of linear regression with variable selection, ‘p’ dummies can be constructed, because variable selection will select at most p – 1 of them. The same rule applies to logistic regression, but not to tree based methods. Binary Tree based models searches over all possible 2 subgroups of p-levels to find optimal splits (to be reviewed at later lecture).
  • 47. Leonardo Auslender –Ch. 1 Copyright 47 9/8/2019
  • 48. Leonardo Auslender –Ch. 1 Copyright 48 9/8/2019
  • 49. Leonardo Auslender –Ch. 1 Copyright 49 9/8/2019 Constant RMSEs, different R_squares.
  • 50. Leonardo Auslender –Ch. 1 Copyright 50 9/8/2019
  • 51. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-51 9/8/2019
  • 52. Leonardo Auslender –Ch. 1 Copyright 52 9/8/2019
  • 53. Leonardo Auslender –Ch. 1 Copyright 53 9/8/2019 2) Grocery store investigates relationship between cardboard bags provided freely for each purchase versus revenue and number of customers per visit. Data on more than 10,000 purchases. Quick glance revealed that number of bags utilized ranged from 0 to 8 with mode of 3, the distribution of revenue is skewed leftwards and the number of customers per visit ranged from 1 to 6 with a mode of 1. Analyst estimated linear regression with following information, where information in parenthesis corresponds to standard errors. All diagnostic measures of goodness of fit were significant and very good.
  • 54. Leonardo Auslender –Ch. 1 Copyright 54 9/8/2019 a) Provide interpretation to the coefficients. b) Is it possible to provide an interpretation to the constant? Did you expect the constant to be 0? c) Have any model assumptions been violated with just this information? d) Would you implement this model? Why? Or why not? e) Would you consider adding a product interaction to the equation? What would it mean? f) Should the dependent variable “Bags” be transformed? Why? Or why not? g) Assume that bags are a very costly item, and that it is important to obtain a very good model. Discuss possible ways (variables, transformations, model searches) to improve the present one. For instance, could baggers and bag carriers (into the car) be a solution? Would there be an interaction with customer gender?
  • 55. Leonardo Auslender –Ch. 1 Copyright 55 9/8/2019
  • 56. Leonardo Auslender –Ch. 1 Copyright 56 9/8/2019 Non-linear regressions. Any regression beyond not linear in β parameters. Examples (θ, theta, is our previous β). 3 4 1 2 ( * ** ) 1 2 1 1 4 2 4 3 : * * * : ( ) * : * cos( ) ( * cos(2 * ) x Powerfunction X Weibull Growth e Fourier X X             − + − + + + + For linear models, SSR + SSE = SS Total, from which R2 is derived. In nonlinear regression, SSR + SSE ≠ SS Total! This completely invalidates R2 for nonlinear models, and it is not bounded between 0 and 1. Still incorrectly used in many fields (Spiess, Neumeyer, 2010)
  • 57. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-57 9/8/2019
  • 58. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-58 9/8/2019 Problem area: Heterokedasticity. Error term in the original linear equation that links Y to X is assumed to be normal with constant variance. If this assumption is violated, then the variance (and consequently inference based on p-values) is affected. E(ei 2)  2 Heteroskedasticity can be visually detected in a univariate fashion by graphing each individual predictor versus the residuals, and looking for non-random patterns. In the context of large data bases, this method is infeasible. Analytically, there are a serious of tests, such as: The Breusch-Pagan LM Test (used in the table below) The Glesjer LM Test The Harvey-Godfrey LM Test The Park LM Test The Goldfeld-Quandt Tests White’s Test
  • 59. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-59 9/8/2019 Problem area: Heterokedasticity. Resolving hetero: (a) Generalized Least Squares (b) Weighted Least Squares (c) Heteroskedasticity-Consistent Estimation Methods Modeling Heteroskedasticity: Estimated Generalized Least Squares (EGLS). 1. Obtain the original regression results and keep the residuals ei, i = 1 ,,,, n. 2. Obtain log ( ) 3. Run a regression with log ( ) as dependent variable on the original predictors, and obtain predicted values Pred.i. 4. Exponentiate the predicted values Pred.i to obtain 5. Estimate the original equation of Y on X by using weighted least squares, using as weights. 2 i e 2 i  1/ i 
  • 60. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-60 9/8/2019 Problem area: Co-linearity (wrongly called multi-colinearity). Co-linearity: existence of linear relationships among predictors (independently of Ys) such that estimated coefficients become unstable when those relationships among Xs have high r-squares. In extreme case of perfectly linear relationship among predictors, one predictor can be perfectly modeled from others, ➔ it is irrelevant in regression against Y. For instance, if regression contains Xs that denote components of wealth and one of those predictors is merely sum of all others, the use of all the Xs to model Y would yield an unstable equation. (because covariance matrix is then singular). Present way to find out about co-linearity is NOT by correlation matrices but by calculation of variance inflation factor (VIF, we omit other methods in this presentation). This factor is calculated for every predictor and is the transformation of the r-square of a regression of every predictor on the rest of the predictors (notice that the dependent variable is not part of this). The present rule of thumb is that VIF values at or above 10 should be considered too high, and probably the variable in question (or part of the other predictors) should be studied in more detail and ultimately removed from the analysis.
  • 61. Leonardo Auslender –Ch. 1 Copyright 61 9/8/2019 Co-linearity: left panel, any beta2 minimizes SS1 along bottom ridge., beta2 is not unique, Ridge regression penalizes ‘ridge’ and thus optimal beta2 is unique on right panel.
  • 62. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-62 9/8/2019 Consequences of Heteroskedasticity and Co-linearity 1) Beta estimators are still linear 2) Beta estimators are still unbiased 3) Beta estimators are not efficient - the minimum variance property no longer holds, ➔ tend to vary more with data. 4) Estimates of the coefficient variances are biased 5) The standard estimated variance is not an unbiased estimator. 6) Confidence intervals are too narrow and inference can be misleading.
  • 63. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-63 9/8/2019 Problem Area: Unusual data points: Outliers, leverage and Influence. Outlier: data point in which value of dependent variable does not follow general trend of rest of data. Data point is influential (has leverage) if it unduly influences any part of regression analysis, such as predicted responses, estimated slope coefficients, or hypothesis test results. Thus, outlier represents extreme value of Y w.r.t. other values of Y, not to values of X. From bi-variate standpoint, an observation could be 1) an outlier if Y is extreme, 2) influential if X is extreme, 3) both outlier and influential if X and Y are extremes, 4) & also outlier and influential if distant from rest of data w/o being extreme, Definitions are not precise; from bi-variate standpoint, tendency to analyze these effects by deletion of suspected observations. However, if percentage of suspected observations is large, then the issue as to what constitute the population of reference is in question.
  • 64. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-64 9/8/2019 Working with unusual data points. Not without further analysis. If the unusual nature is due to data error, then resolve the issue but in reference to the model at hand. If you are modeling height of titans in history, then you must expect heights beyond 10ft. If not, heights beyond 7ft are suspect. That is, condition on model at hand. A quick way to manage unusual data points is, once a regression is obtained, to separate those points the absolute standardized residual value of which is higher than 2. Standardized residuals are commonly known as studentized residuals. The studentized residuals are the residuals divided by the standard error of a regression where the i-th observation has been deleted.
  • 65. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-65 9/8/2019
  • 66. Leonardo Auslender –Ch. 1 Copyright 66 9/8/2019 Different Regressions for Total_spend: “M1” general Name for models for Fraud data. TRN: train data. NONE: No var sel. LT_2 and GT_2: using studentized Residuals to split data set. No_hetero after using Weighted Least squares. Requested Analyses: Names & Descriptions. Model # Model Name Model Description *** Overall Models -1 M1 Multivariate regression TOTAL_SPEND 1 M1_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 2 M1_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 3 M1_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 4 M1_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 5 M1_TRN_REGR_NONE Regr TRN NONE 6 M1_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 7
  • 67. Leonardo Auslender –Ch. 1 Copyright Ch. 3-67 9/8/2019
  • 68. Leonardo Auslender –Ch. 1 Copyright 68 9/8/2019 Notice variation in p-value significance.
  • 69. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-69 9/8/2019 VIFs – Co-linearity: none if accept VIF > 10 criterion.
  • 70. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-70 9/8/2019 R-Squares by models: typical jump when using studentized residuals.
  • 71. Leonardo Auslender –Ch. 1 Copyright Checking Hetero, still persistent and also using stpw (stepwise variable selection). Heterokedasticity test. Breusch-Pagan Test for Heteroskedasticity Model Name CHI-SQUARE DF P-VALUE Num Obs Hetero? M1_TRN_REGR_NONE 42.50672 5 0.000 5,960 YES M1_TRN_NONE_LT_2 49.405089 5 0.000 5,703 YES M1_TRN_NONE_GT_2 26.637793 5 0.000 257 YES
  • 72. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-72 9/8/2019
  • 73. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-73 9/8/2019 QQ-plots: Need To transform depvar In these models.
  • 74. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-74 9/8/2019
  • 75. Leonardo Auslender –Ch. 1 Copyright Requested Analyses: Names & Descriptions. Model # Model Name Model Description *** Overall Models -1 M2 Log total spend 1 M2_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 2 M2_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 3 M2_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 4 M2_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 5 M2_TRN_REGR_NONE Regr TRN NONE 6 M2_TRN_REGR_NONE_NO_HETER O Regr TRN WLS 7
  • 76. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-76 9/8/2019 0.0235 0.038 0.1976 0.038 0.0381 0.0236 0.0388 0.0424 0.0995 0.0424 0.0429 0.0401 M 1_NO NE M 1_NO NE_G T_2 M 1_NO NE_G T_2_NO _HETERO M 1_NO NE_LT_2 M 1_NO NE_LT_2_NO _HETERO M 1_NO NE_NO _HETERO M 2_NO NE M 2_NO NE_G T_2 M 2_NO NE_G T_2_NO _HETERO M 2_NO NE_LT_2 M 2_NO NE_LT_2_NO _HETERO M 2_NO NE_NO _HETERO All models 0.00 0.05 0.10 0.15 0.20 Mean _RSQ_ Avg TRN Dep_interval_system _RSQ_ by Model Name Comparing initial M1 with logged M2.
  • 77. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-77 9/8/2019 Log transform works for original model.
  • 78. Leonardo Auslender –Ch. 1 Copyright 78 9/8/2019
  • 79. Leonardo Auslender –Ch. 1 Copyright 79 9/8/2019 Log-transformed residuals and -+2 * RMSE as Cis: most of residuals contained in CI.. More Info in model comparison section.
  • 80. Leonardo Auslender –Ch. 1 Copyright 80 9/8/2019 M1 models: original total_spend (dep var) variable. Notice: separate scales for heteroskedastic corrected models next slide.
  • 81. Leonardo Auslender –Ch. 1 Copyright 81 9/8/2019
  • 82. Leonardo Auslender –Ch. 1 Copyright 82 9/8/2019 M2 models: logged total_spend (dep var) variable.
  • 83. Leonardo Auslender –Ch. 1 Copyright 83 9/8/2019 Interview Question: Is there a problem, Possible Improvement? 1. X typically observed, Y predicted. 2. Predicted more Concentrated than observed (look at ranges). 3. A lowess curve (smoother) Would provide much info (there’s linear relationship in residuals not accounted for by the original regression).
  • 84. Leonardo Auslender –Ch. 1 Copyright 84 9/8/2019
  • 85. Leonardo Auslender –Ch. 1 Copyright 85 9/8/2019 Interaction effects in regression analysis (moderated regressions). The typical linear model assumes a constant effect of each predictor on Y regardless of the level(s) of the other variable(s). The usual way of introducing interactions is by multiplication: I.e., X1 * X2, but there are infinite functional ways of representing interactions of X1 and X2. The product of the variables is chosen due to its ease and because the product is somewhat sensitive to diverse functional forms. Arguments against the use of interactions. A) Difficult to interpret and alter original coefficients of main effects (if present). B) Can generate co-linearity because product of X’s is nonlinear function of the original variables. C) From A) and B), coefficients of main effects can become insignificant. Arguments in favor of the use of interactions. The multiplicative term turns the additive relationship of X1 and X2 into a conditional one. An example will better illustrate it.
  • 86. Leonardo Auslender –Ch. 1 Copyright 86 9/8/2019 Let Y = b0 + b1X1 + b2D + b3DX1 + , where X1 is continuous and D a dummy variable. The presence of D implies that there are in effect two regressions of Y on X1, according to the values of D: (b1 and b2 typically called main effects of X1 and X2 respectively). 1) D = 0 ➔ Y = b0 + b1X1 +  2) D = 1 ➔ Y = (b0 + b2) + (b1 + b3) X1 +  If D is a continuous or a ratio variable instead, which we will call X2, and rearranging the previous formulae we obtain: a) For given value of X2, Y = (b0 + b2 X2) + (b1 + b3 X2) X1 + ……… (1) b) For given value of X1, Y = (b0 + b1X1) + (b2 + b3 X1) X2 + ……. (2) When X1 = X2 = 0, b0 is the intercept for the two equations. The coefficient b3 has different meanings: in equation (1), it is the change in the slope of Y due to a change in X1 for a given value of X2: Y / X1 | X2. Conversely, in equation (2), b3 is Y / X2 | X1.
  • 87. Leonardo Auslender –Ch. 1 Copyright 87 9/8/2019 Y = (b0 + b2 X2) + (b1 + b3 X2) X1 + …… (1) Y = (b0 + b1 X1) + (b2 + b3 X1) X2 + ……. (2) b1 and b2 are the baseline slopes. That is, b1 is the slope of Y on X1 when X2 = 0 and the roles are reversed for b2. If X2 (or X1) were not 0, the corresponding conditional slope can be extrapolated to 0. The two regressions emerge regardless of the actual values accorded to D. It is D’s binary nature that implies two regressions, as shown, and two regressions as well in the continuous case. From this discussion, b1 and b2 do not describe additive main effects but conditional ones. The resulting surface is not a plane, but a warped regression one. We illustrate these points in the graphs below that represent a regression of duration of a loan until fully paid regressed on age of the customer and amount lent (from a different data set, not discussed in here).
  • 88. Leonardo Auslender –Ch. 1 Copyright 88 9/8/2019
  • 89. Leonardo Auslender –Ch. 1 Copyright 89 9/8/2019 Change in previously additive coefficients: interactive ones describe particular conditional, not additive, effects. Interpretation and example. Y = f (original without X1 * X2 additive effect) (base) Y = 90 – 2.5 * X1 - .1 *X2 + .5 * X1*X2 + . (original) Where 5  X1  10 10  X2  100. At the extreme points of X1, we get: (omitting the error for simplicity) Y = (90 – 12.5) + (-.1 + 2.5) X2 .. (3) X1 = 5 Y = (90 – 25) + (-1. + .50 ) X2 … (4) X1 = 10 The same exercise can be done over the range of X2. The coefficients in (3) and (4) cover the corresponding coefficients in the additive model, dependent on the distribution of X1. Next: simulation results for original equation and (3) and (4).
  • 90. Leonardo Auslender –Ch. 1 Copyright 90 9/8/2019 Interaction Models Variable Intercept x1 x2 x1_x2 Parame ter Estimat e Pr > |t| Parame ter Estimat e Pr > |t| Parame ter Estimat e Pr > |t| Parame ter Estimat e Pr > |t| Model Name -116 .000 24.91 .000 3.657 .000 base Original 90.07 .000 -2.51 .000 -.103 .000 0.500 .000 Model3 77.51 .000 2.400 .000 Model4 64.99 .000 0.400 .000 Base model estimates are far off “real values” and model is biased (true model contains interactions). In case of 3 other models, note how close parameters correspond to previous equations, N = 10000. See residual and prediction plots next.
  • 91. Leonardo Auslender –Ch. 1 Copyright 91 9/8/2019 Seriously bad model.
  • 92. Leonardo Auslender –Ch. 1 Copyright 92 9/8/2019 Better model.
  • 93. Leonardo Auslender –Ch. 1 Copyright 93 9/8/2019 Original model shows dependence of pred on x2 and on interaction, not captured in base model. Other two models show different slopes of X2 incorporating x1 effects.
  • 94. Leonardo Auslender –Ch. 1 Copyright 94 9/8/2019 Interactions and Significance of Coefficients. Similarly, standard errors become “conditional” standard errors , which turns the t-tests “conditional” as well: the effect of one variable is conditioned on a particular value of a third variable. Repeating the previous equation: a) For given X2, Y = (b0 + b2X2) + (b1 + b3X2)X1 +  …… (1) Conditional t-test for X1 is: t = (b1 + b3X2) / s(b1 + b3X2). Previously significant “additive” coefficients and present “interaction” coefficient may be insignificant. But, for a specific value of X2 (within the observed range), the t-test may prove significant, which entails that the regression and its ‘main’ effects are valid within specific ranges of the conditioning variable.
  • 95. Leonardo Auslender –Ch. 1 Copyright 95 9/8/2019 Interpreting main effects in regression models with interactions In models without interaction terms (i.e., without terms constructed as product of other terms), regression coefficient for a variable is the slope of the regression surface in the direction of that variable. It is constant, regardless of the values of other variables, and therefore can be said to measure overall effect of variable. In models with product interactions, this interpretation can be made without further qualification only for those variables that are not involved in any interactions. For variables that are involved in interactions, "main effect" regression coefficient is the slope of the regression surface in the direction of that variable WHEN ALL OTHER VARIABLES THAT INTERACT WITH THAT VARIABLE HAVE VALUES OF ZERO, and significance test of the coefficient refers to the slope of the regression surface ONLY IN THAT REGION OF THE PREDICTOR SPACE, which may sometimes be far from the region in which the data lie. In Anova terms, the coefficient is measuring a simple main effect, not an overall main effect.
  • 96. Leonardo Auslender –Ch. 1 Copyright 96 9/8/2019 Main and Overall Effects. Analogous measure of overall effect of variable: average slope of regression surface in direction of variable, averaging over all N cases in data. Expressed as weighted sum of regression coefficients of all the terms in the model that involve that variable. Weights are awkward to describe but easy to get. A variable's main- effect coefficient always gets a weight of 1. For each other coefficient of a term involving that variable, the weight is the mean of the product of the other variables in that term. For example, if the model is y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b12 * x1 * x2 + b13 * x1 * x3 + b23 * x2 * x3 + b45 * x4 * x5 + b123 * x1 * x2 * x3 + error
  • 97. Leonardo Auslender –Ch. 1 Copyright 97 9/8/2019 y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5 + b12 * x1 * x2 + b13 * x1 * x3 + b23 * x2 * x3 + b45 * x4 * x5 + b123 * x1 * x2 * x3 + error Overall main effects are B1 = b1 + b12 * M [x2] + b13 * M[x3] + b123 * M [x2 * x3], B2 = b2 + b12 * M[x1] + b23 * M [x3] + b123*M[x1*x3], B3 = b3 + b13*M[x1] + b23 * M [x2] + b123 * M [x1 * x2], B4 = b4 + b45*M[x5], B5 = b5 + b45 * M[x4], where M [.] denotes the sample mean of the quantity inside the brackets. All the product terms inside the brackets are among those that were constructed in order to do the regression.
  • 98. Leonardo Auslender –Ch. 1 Copyright 98 9/8/2019 Two-way interactions and centering. In models that have only main effects and two-way interactions, there is a simpler way to get the overall effects: center (at their means) all variables that are involved in interactions. Then all the M[.] expressions will become 0, and regression coefficients will be interpretable as overall effects. (The values of the b's will change; the values of the B's will not.) So, why not always center at the means, routinely? Because means are sample dependent and model may not be replicable in different studies. I.e., coefficients will be different. Of course, then their main effects will not be overall main effects unless their means are exactly the same as yours, which is unlikely. Moreover, centering will convert simple effects to overall effects only for models with no three-way or higher interactions; if there are such interactions, centering will simplify the b --> B computations but will not eliminate them.
  • 99. Leonardo Auslender –Ch. 1 Copyright 99 9/8/2019 Interactions significance. Significance of the overall effects may be tested by the usual procedures for testing linear combinations of regression coefficients (see earlier slides on testing linear combinations of parameters). However, the results must be interpreted with care because overall effects are not structural parameters but are design-dependent. The structural parameters -- the regression coefficients (uncentered, or with rational centering) and the error variance -- may be expected to remain invariant under changes in the distribution of the predictors, but the overall effects will generally change. Overall effects are specific to particular sample and should not be expected to carry over to other samples with different distributions on predictors. If an overall effect is significant in one study and not in another, it may reflect nothing more than a difference in the distribution of the predictors. In particular, it should not be taken as evidence that the relation of the d.v. to the predictors is different in the two groups.
  • 100. Leonardo Auslender –Ch. 1 Copyright 100 9/8/2019
  • 101. Leonardo Auslender –Ch. 1 Copyright 101 9/8/2019 Added product interactions to Fraud data set. doc_membd = "Doc Visits * member_dur“ Doc_numm = “Doc Visiits * Num_members” doc_membd_numm = "Doc_visits * member_dur * num_members“
  • 102. Leonardo Auslender –Ch. 1 Copyright 102 9/8/2019 Notice that doctor_visits overall effect is smaller than Corresponding main effect dependent on 2 or 3 way interactions, etc. doc doc_m em bd doc_m em bd_num m doc_num m m em bd m em bd_num m no_claim s num m optom _presc variable 0.0 0.2 0.4 0.6 0.8 1.0 R_estimate (Sum) M2_3_way_skip_member_d M2_3_way_full_model M1_2_way_skip_member_d M1_2_way_full_model M2_3_way_skip_member_d M2_3_way_full_model M1_2_way_skip_member_d M1_2_way_full_model Rescaled Coefficients (skipping FRAUD) for all models Narrower bar width for overall effects
  • 103. Leonardo Auslender –Ch. 1 Copyright 103 9/8/2019 Changing significance levels dependent on model. F R A U D d o c d o c _ m e m b d d o c _ m e m b d _ n u m m d o c _ n u m m m e m b d m e m b d _ n u m m n o _ c l a i m s n u m m o p t o m _ p r e s c variable -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 p_value2 (Sum) 0.05 M2_3_way_skip_member_d M2_3_way_full_model M1_2_way_skip_member_d M1_2_way_full_model M2_3_way_skip_member_d M2_3_way_full_model M1_2_way_skip_member_d M1_2_way_full_model , P_values (skipping FRAUD) for all models cut at 0.1 Narrower bar width for overall effects
  • 104. Leonardo Auslender –Ch. 1 Copyright 104 9/8/2019 0.0292 0.0254 0.0299 0.0293 M1_2_way_full_model M1_2_way_skip_member_d M2_3_way_full_model M2_3_way_skip_member_d model 0.0 0.2 0.4 0.6 0.8 1.0 R-squared R-squared R square values for models
  • 105. Leonardo Auslender –Ch. 1 Copyright 105 9/8/2019 0.1001 1 0 0.1013 M1_2_way_full_model M1_2_way_skip_member_d M2_3_way_full_model M2_3_way_skip_member_d model 0.0 0.2 0.4 0.6 0.8 1.0 Rescaled Press Rescaled Press Rescaled PRESS values for models Smaller, better
  • 106. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Ch. 2.2-106
  • 107. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Crass but still used Approaches 1) Fit equation with all variables and keep those the coefficients of which are significant at 25% or so. 2) Re-fit equation with remaining variables. 3) Alternative to 1) above if number of variables too large: Find all zero-order correlations, and choose top K (dependent on computing resources, or time …). 4) Redo 1) above with K variables just found. Ch. 2.2-107
  • 108. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Crass but still used Approaches (2 of 2). (appeared on web, sci.stat.math.group, 11/27/06): - I heard / read (on the Internet) that the number of variables to put as explanatory characters must not be greater than n/10 or n/ 20 (n is the sample size). Does someone can provide me with a serious bibliographic reference about this ? - Second, I've got a first set of about 15 variables (for around 150 patients), interactions excluded. I categorized some into binary characters to yield odds- ratios. What is the risk ? lack of power ? - Third, performing a second selection via the stepwise algorithm, is there a consensus about the significance cut-off (alpha). to use ? I read 20% instead of 5%. Is this usual ? - Regarding SAS programming (I run the version 8.2 under Windows OS), which procedure, between Proc Logistic and Proc GLM, should I choose, and according to which criteria ? Ch. 2.2-108
  • 109. Leonardo Auslender –Ch. 1 Copyright Variable Selection practice. Mostly two approaches: 1) Sequential Inferential: such as Stepwise Family. P-values play critical role in stopping/dropping/entry mechanism. Variable selection path also given by partial correlations for forward and stepwise methods. 2) Stopping Mechanisms, e.g., AIC, BIC. May use variable searched inferentially but typically searches over wider subsets. The mechanisms are used both in Frequentist and Bayesian statistics. Shtatland et al (2000) apply BIC and AIC for a Bayesian application of variable selection. 3) Some combination thereof. 9/8/2019 109
  • 110. Leonardo Auslender –Ch. 1 Copyright Forward Selection (FS, Miller (2002)). Let Y (n,1) = X (n,p)  (p,1) + ,,,,,,,,,, , usual model, n > p. FS minimizes (Y – Xi i)’(Y – Xi i ), (SSE) i = 1 …p where i = Xi’Y / Xi’Xi (OLS estimate, and expanding and replacing in previous expression) ➔ variable selected maximizes (Xi’Y)2 / Xi’Xi, which when divided by Y’Y becomes (cos (Xi,Y))2 ➔ Choose Xi variable with smallest angle with Y (most co-linear or most correlated). Y Xi i Xi Y - Xi i orthogonal to Xi. Xj 9/8/2019 110
  • 111. Leonardo Auslender –Ch. 1 Copyright Forward Selection. Assume X(1) was the largest absolute correlated variable. What next? For all other variables, calculate: Xj,(1) = Xj – Bj,(1) X(1),,,,,,,,, where Bj,(1) is LS estimator of X(1) on Xj . Replace Y with Y - Xi Bi,,,, and Xj with Xj,(1) for all j and proceed again as in previous slide. Y and remaining X variables are orthogonalized to X2. If vars are centered, second variable chosen has absolute largest partial correlation with Y given X(1), etc. Method minimizes RSS (= ESS) at every step ➔ maximizes R2 at every step. Method stops when decrease in RSS is not significant at specified level (.5): (RSSk – RSSk+1) / (RSSk+1 / n – k – 2) compared to ‘F-to-enter’ value. Entered variable always remains in model. 9/8/2019 Ch. 2.2-111
  • 112. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Step Action Y X1 X2 … Xp 1 Find most correlated X variable with Y, say X2. Y X1 X2 … Xp 2 Regress Y, X1,X3.. and Xp on X2. 3 Replace Y, X1, X3…Xp by corresponding residuals of regression with X2. Y - aX1 X2 – bX1 Xp – c X1 4 Find most correlated variable of transformed Y with transformed X’s, also called partial correlations. 5 Repeat process until change in ESS is not significant. Forward Selection: Summary of steps Ch. 2.2-112
  • 113. Leonardo Auslender –Ch. 1 Copyright 9/8/2019 Ch. 2.2-113 Z Y X Z-bX Y-aX After X is selected and orthogonalized away from Z and Y, in the next step the actual variables are Z – bX and Y – aX, where “a” and “b” are OLS coefficients. The first three variables (Y, X, Z) do not play any further roles. Starting Axis system.
  • 114. Leonardo Auslender –Ch. 1 Copyright Do zero order correlations give you Information as to which variables will appear in the final Model? Hardly. Selection path is given by partial correlations, and since path depends on selected variables, illusory to obtain all potential paths for typical large number of variables of Giga-bases. Indirect remark: Model and variable interpretation typically conceptualized from zero order correlations. But models involve partial (and semi-partial) relations, i.e., conditional relationships ➔ model interpretation is ALWAYS possible, it is just far more difficult when number of variables is large and variables are related. Semi-partial correlation is correlation between ORIGINAL Y and partialled Xs. Partial, semipartial R2s and coefficients are counterparts of traditional R2 concepts. 9/8/2019 Ch. 2.2-114
  • 115. Leonardo Auslender –Ch. 1 Copyright Nomenclature Zero order correlations: Original correlations. First order: First partial. Second Order: Second partial. (First-order semipartial, etc). Big note: Zero order correlations are always unconditional. All others are conditional on sequence of orthogonalized predictors. 9/8/2019 Ch. 2.2-115
  • 116. Leonardo Auslender –Ch. 1 Copyright Stepwise variable selection (Efroymson (1960). 1. Select most correlated variable with Y, say Z1, find linear equation, and test for significance against F-to-entry (F distribution related to chi-sq). 2. If no significance, stop. Else, examine partial correlations (or semi-partial, dependent on software) with Y of all Z’s not in regression. 3. Choose largest (say Z2), and regress Y on Z1 and Z2. Check for overall significance, R2 improvement, and obtain partial F-values for both variables (F-values = squared t-values). 4. Lowest F-value is compared to F-threshold (F-to-delete) and kept or rejected. Stop when no more removal or additions is possible. 5. General form of F test: Add/delete X to model already containing Z: ( mod ) ( ) / # var ( ) ( ( ) ( , )) / 1 ( ( , ) / ( 1) SSE reduced el SSE full diff s F MSE full SSE Z SSE Z X F SSE X Z n p − = = − = − − 9/8/2019 Ch. 2.2-116
  • 117. Leonardo Auslender –Ch. 1 Copyright Stepwise Family considerations 1. In all stepwise selections, true p-values are larger than reported (Rencher and Pun, 1980; Freedman, 1983), and do not have proper meaning. Correction very difficult problem. 2. If true variables are deleted, model is biased (omission bias); if redundant variables kept, variable selection has not been attained. No insurance about this. All possible regressions tends to produce “small” models, which is impossible (at present) with large p, because there are 2 p possible regressions (for p = 10 ➔ 1024, p = 20 ➔ 1,048,576, p=30 ➔ more than 1B.. 3. Selection bias occurs when variable selection is not done independently of coefficient estimation (symptomatic of tree regression). 4. Further, important subsets of variables may be omitted. E.g., if true relationship is given by ‘distance’ between X and Y, or any linear combination thereof, no assurance that this subset will be found. In general, can’t find transformations / nonlinearities. 9/8/2019 Ch. 2.2-117
  • 118. Leonardo Auslender –Ch. 1 Copyright Stepwise Family considerations 5. Methods are Discrete: variables either retained or discarded, and often exhibit high variance and don’t reduce prediction error of full model (Tibshirani likes to say this). 6. Stepwise yields models with upwardly biased R2s. Model needs reevaluation in independent data. 7. Severe problems with co-linearity, but debatable. 8. Gives biased regression coefficients that need shrinkage (coefficients are too large, Tibshirani, 1996). 9. Based on methods (F tests for nested models) that were intended to be used to test pre-specified hypotheses. 9/8/2019 Ch. 2.2-118
  • 119. Leonardo Auslender –Ch. 1 Copyright Stepwise Family considerations. 10. Increasing sample size does not improve selection by much (Derksen and Kesselman, 1992). 11. Induces comfort of not thinking about problem at hand. But thinkers could be clueless or contradictory. Should be considered exploratory tool. 12. Stepwise alternatives: 1. Replacing 1, 2, 3 … variables at a time. 2. Branch and bound techniques. 3. Sequential Subsets, 4. Ridge Regression. 5. Nonnegative Garrote and Lasso, LARS… 6. Foster/Stine (See Auslender 2005) 7. Stepwise but starting from full model. 9/8/2019 Ch. 2.2-119
  • 120. Leonardo Auslender –Ch. 1 Copyright Stepwise Family considerations 13. Freedman (1983) showed by simulation in the case of n/p not very large that stepwise methods select about 15% of noise variables. Even if all the variables are noise, a high R2 can be obtained. If seemingly insignificant variables are dropped, the R2 will still remain high. ➔ BE VERY CAREFUL ABOUT JUMPING in joy or despair WITH SIGNIFICANCE FINDINGS IN THE CONTEXT OF VARIABLE SELECTION. 9/8/2019 Ch. 2.2-120
  • 121. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-121 9/8/2019
  • 122. Leonardo Auslender –Ch. 1 Copyright 122 9/8/2019 Tested models. Too many? . Requested Analyses: Names & Descriptions. Model # Model Name Model Description *** Overall Models -1 M1 Multivariate regression TOTAL_SPEND 1 M1_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 2 M1_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 3 M1_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 4 M1_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 5 M1_TRN_REGR_NONE Regr TRN NONE 6 M1_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 7 M1_TRN_REGR_STPW Regr TRN STEPWISE 8 M1_TRN_REGR_STPW_NO_HETERO Regr TRN WLS 9 M1_TRN_STPW_GT_2 Regr TRN STEPWISE abs(rstud) >=2 10 M1_TRN_STPW_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 11 M1_TRN_STPW_LT_2 Regr TRN STEPWISE ABS(rstud) < 2 12 M1_TRN_STPW_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 13 M2 Log total spend 14 M2_TRN_NONE_GT_2 Regr TRN NONE abs(rstud) >=2 15 M2_TRN_NONE_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 16 M2_TRN_NONE_LT_2 Regr TRN NONE ABS(rstud) < 2 17 M2_TRN_NONE_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 18 M2_TRN_REGR_NONE Regr TRN NONE 19 M2_TRN_REGR_NONE_NO_HETERO Regr TRN WLS 20 M2_TRN_REGR_STPW Regr TRN STEPWISE 21 Requested Analyses: Names & Descriptions. Model # Model Name Model Description M2_TRN_REGR_STPW_NO_HETERO Regr TRN WLS 22 M2_TRN_STPW_GT_2 Regr TRN STEPWISE abs(rstud) >=2 23 M2_TRN_STPW_GT_2_NO_HETERO Regr TRN WLS abs(rstud) >=2 24 M2_TRN_STPW_LT_2 Regr TRN STEPWISE ABS(rstud) < 2 25 M2_TRN_STPW_LT_2_NO_HETERO Regr TRN WLS ABS(rstud) < 2 26
  • 123. Leonardo Auslender –Ch. 1 Copyright 123 9/8/2019 M2 model in logs. M1 model in original units.
  • 124. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-124 9/8/2019
  • 125. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-125 9/8/2019
  • 126. Leonardo Auslender –Ch. 1 Copyright 126 9/8/2019
  • 127. Leonardo Auslender –Ch. 1 Copyright 127 9/8/2019
  • 128. Leonardo Auslender –Ch. 1 Copyright 128 9/8/2019 Breusch-Pagan Test for Heteroskedasticity Model Name CHI-SQUARE DF P-VALUE Num Obs Hetero? M1_TRN_REGR_NONE 42.50672 5 0.000 5,960 YES M1_TRN_NONE_LT_2 49.405089 5 0.000 5,703 YES M1_TRN_NONE_GT_2 26.637793 5 0.000 257 YES M1_TRN_REGR_STPW 40.96904 4 0.000 5,960 YES M1_TRN_STPW_LT_2 48.606669 4 0.000 5,703 YES M1_TRN_STPW_GT_2 25.054416 3 0.000 257 YES M2_TRN_REGR_NONE 40.99884 5 0.000 5,960 YES M2_TRN_NONE_LT_2 15.9612 5 0.007 5,660 YES M2_TRN_NONE_GT_2 19.9851 5 0.001 300 YES M2_TRN_REGR_STPW 39.0082 4 0.000 5,960 YES M2_TRN_STPW_LT_2 14.27328 4 0.006 5,664 YES M2_TRN_STPW_GT_2 3.991856 1 0.046 296 YES
  • 129. Leonardo Auslender –Ch. 1 Copyright 129 9/8/2019 VIFs omitted, None above 2.
  • 130. Leonardo Auslender –Ch. 1 Copyright 130 9/8/2019
  • 131. Leonardo Auslender –Ch. 1 Copyright 131 9/8/2019
  • 132. Leonardo Auslender –Ch. 1 Copyright 132 9/8/2019
  • 133. Leonardo Auslender –Ch. 1 Copyright 133 9/8/2019
  • 134. Leonardo Auslender –Ch. 1 Copyright 134 9/8/2019
  • 135. Leonardo Auslender –Ch. 1 Copyright 135 9/8/2019 Models and ranks by Req. GOFs Requested Gof _PRESS_ _RMSE_ _RSQ_ Rank Rank Rank Model Name 1 1 1 M1_NONE_GT_2 M1_NONE_GT_2_NO_HETERO 2 2 2 M1_NONE_LT_2 3 3 3 M1_NONE_LT_2_NO_HETERO 4 4 4 M1_REGR_NONE 5 5 5 M1_REGR_NONE_NO_HETERO 6 6 6 M1_REGR_STPW 7 7 7 M1_REGR_STPW_NO_HETERO 8 8 8 M1_STPW_GT_2 9 9 9 M1_STPW_GT_2_NO_HETERO 10 10 10 M1_STPW_LT_2 11 11 11 M1_STPW_LT_2_NO_HETERO 12 12 12 M2_NONE_GT_2 1 1 1 M2_NONE_GT_2_NO_HETERO 2 2 2 M2_NONE_LT_2 3 3 3 M2_NONE_LT_2_NO_HETERO 4 4 4 M2_REGR_NONE 5 5 5 M2_REGR_NONE_NO_HETERO 6 6 6 M2_REGR_STPW 7 7 7
  • 136. Leonardo Auslender –Ch. 1 Copyright 136 9/8/2019 Models and ranks by Req. GOFs Requested Gof _AIC_ _CP_ _PRESS_ _RMSE_ _RSQADJ_ _RSQ_ _SBC_ Rank Rank Rank Rank Rank Rank Rank Model Name 4 4 4 4 4 3 4 M1_REGR_NONE M1_REGR_NONE_NO_HETERO 2 3 2 1 3 2 2 M1_REGR_STPW 3 1 3 3 2 4 3 M1_REGR_STPW_NO_HETERO 1 2 1 2 1 1 1 M2_REGR_BCKW 3 2 3 4 4 4 3 M2_REGR_BCKW_NO_HETERO 1 1 1 1 2 2 1 M2_REGR_NONE 4 3 4 3 3 3 4 M2_REGR_NONE_NO_HETERO 2 4 2 2 1 1 2 Removing LT_2 and GT_2 and adding other GOFs ➔ not easy to pick up a winner.
  • 137. Leonardo Auslender –Ch. 1 Copyright 137 9/8/2019 Some observations. 1) Log transformed dep var is binomial, raw-units depvar is certainly not normal. If transforms wants normality ➔ mixture of two normals in this case. 2) P-values are quite disparate when comparing LT_2 and GT_2 to non- outlier models. Still, possible to see some regularity, such as significant number of visits and claims. 3) Co-linearity is not an issue if 10 is the critical value of VIF. 4) If using RMSE as fit, NONE_LT_2_no hetero seems best performer, more inconclusive using PRESS. When looking at the comparative ranks table, the different methods offer different rankings. 5) Possible that alternative transformations may enhance model, and even transforming predictors and searching for interactions. This is especially true since all non hetero corrected showed hetero disease. 6) It is also possible to use non-regression models, such as Tree or Neural network methods that can then be compared in terms of fit, prediction and possible interpretability. 7) Data set has no missing values, unreal in real world. 8) Very poor model. 9) Model contains NO INTERACTIONS or predictor TRANSFORMATIONS.
  • 138. Leonardo Auslender –Ch. 1 Copyright 138 9/8/2019 Modeler’s attitude Statistical inference (Null – Alternative H) generates attitude that model is homoskedastic, has no interactions, etc. Be bold and assume the following and aim at disproving: Interactions are present Model is heteroskedastic Transformations are useful Outliers are present Co-linearity is likely ….
  • 139. Leonardo Auslender –Ch. 1 Copyright 139 9/8/2019
  • 140. Leonardo Auslender –Ch. 1 Copyright 140 9/8/2019 Omitted Variables Bias. ( ' ) ' ( ' ) '( ) ( ' ) ( ' 2 2 1 1 2 2 1 2 Assume true model is: Y X X where X is set of variables (with intercept) but X2 is incorrecly omitted. Correct model should have both sets. By OLS: X X X Y X X X X X X X X X − − − =  +  +   = = =  +  +  =  + ) ( ' ) ' ˆ ˆ ( / , ) ( / , ) ( ' ) ( ' ) ( ' ) ' ( / , ) ( ' ) ( ' ) ˆ ( / , ) ( 1 2 2 1 2 2 2 2 2 1 2 0 1 2 2 X X X Bias of final estimator is: Bias X X E X X2 X X X X X X 1X E X X X X X X For X (entered) and X (omitted) single variables, bias is: Bias X X x x − − −  +    =  −  =  + −  =   = − , , ) ˆ ( / , ) ˆ ˆ ˆ 2 1 1 2 2 1 1 1 2 2 1,2 0 1 0 s s s2 Bias X1 X2 s1 If ρ 0, bias in b and b . If = 0, bias in b .    =   
  • 141. Leonardo Auslender –Ch. 1 Copyright 141 9/8/2019 Un-reviewed topics of importance Time Series Spatial regression Longitudinal Data Analysis Simultaneous Equation Models Mixtures ….
  • 142. Leonardo Auslender –Ch. 1 Copyright 142 9/8/2019
  • 143. Leonardo Auslender –Ch. 1 Copyright 143 9/8/2019 Proc Reg Data = fraud.fraud PLOTS (MAXPOINTS = NONE) OUTEST = _estim_ (where = (_type_ = "PARMS")) TABLEOUT PRESS /* outputs _PRESS in OUTEST data set */ COVOUT outvif RSQUARE ADJRSQ AIC BIC CP MSE SSE EDF clb; /* clb: confidence limits for betas */ M1_TRN_NONE : MODEL total_spend = NUM_MEMBERS DOCTOR_VISITS FRAUD MEMBER_DURATION NO_CLAIMS OPTOM_PRESC / selection = NONE dw r vif MSE JP CP AIC BIC SBC ADJRSQ spec influence; /* influence produces all influence statistics */ OUTPUT OUT = OUTP (KEEP = p_total_spend total_spend r_p_total_spend student lower_p_total_spend upper_p_total_spend ) P = p_total_spend r = r_p_total_spend student = student lcl = lower_p_total_spend ucl = upper_p_total_spend ; TITLE2 "First Regr depvar total_spend" ; RUN;
  • 144. Leonardo Auslender –Ch. 1 Copyright 144 9/8/2019
  • 145. Leonardo Auslender –Ch. 1 Copyright 145 9/8/2019 data lessthan morethan; merge fraud.fraud OUTP (keep = rstudent); if abs ( rstudent ) < 2 then output lessthan; else output morethan; drop rstudent; RUN; Using studentized residual to obtain two data sets.
  • 146. Leonardo Auslender –Ch. 1 Copyright 146 9/8/2019
  • 147. Leonardo Auslender –Ch. 1 Copyright 147 9/8/2019 ods output heterotest = hetero_test; /* REQUIRES ETS PACKAGE */ proc model data = dataset; parms beta0 beta1 … betap; depvar = beta0 + beta1 * var1 + beta2 * var2 ...../ Breusch; run;
  • 148. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-148 9/8/2019 SAS Programs. Original linear Regression, influence, variable selection. Proc Reg Data = fraud.fraud PLOTS (MAXPOINTS = NONE) OUTEST = _estim_ (where = (_type_ = "PARMS")) TABLEOUT COVOUT outvif RSQUARE ADJRSQ AIC BIC CP MSE SSE EDF clb; /* clb: confidence limits for betas */ M1_TRN_NONE : MODEL total_spend = NUM_MEMBERS DOCTOR_VISITS FRAUD MEMBER_DURATION NO_CLAIMS OPTOM_PRESC / selection = NONE dw r vif MSE JP CP AIC BIC SBC ADJRSQ spec influence; /* influence produces all influence statistics */ OUTPUT OUT = OUTP (KEEP = p_total_spend total_spend r_p_total_spend student lower_p_total_spend upper_p_total_spend ) P = p_total_spend r = r_p_total_spend student = student lcl = lower_p_total_spend ucl = upper_p_total_spend ; TITLE2 "First Regr depvar total_spend" ; RUN; run;
  • 149. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-149 9/8/2019 splitting the data according to studentized residual data lessthan morethan; merge fraud.fraud OUTP (keep = rstudent); if abs ( rstudent ) < 2 then output lessthan; else output morethan; drop rstudent; RUN; Breusch-Pagan test ods output heterotest = hetero_test; /* REQUIRES ETS PACKAGE */ proc model data = dataset; parms beta0 beta1 ..... betap; depvar = beta0 + beta1 * var1 + beta2 * var2 ..... / breusch ; Run;
  • 150. Leonardo Auslender –Ch. 1 Copyright 150 9/8/2019 Graphing confidence intervals of fitted values and predictions. Proc sgplot data = .... ; Reg x = xvar y = yvar / cli /* or clm for fitted values*/; Run; For variable selection using proc reg. Proc reg …. ; … Model Y = ….. / selection = forward. …. ; Run; Options: none, forward, backward, stepwise, etc.
  • 151. Leonardo Auslender –Ch. 1 Copyright 151 9/8/2019
  • 152. Leonardo Auslender –Ch. 1 Copyright 152 9/8/2019 1) You fit a multiple regression to examine the effect of a particular variable a worker in another department is interested in. The variable comes back insignificant, but your co-worker says that this is impossible as it is known to have an effect. What would you say/do? 2) You have 1000 variables and 100 observations. You would like to find the significant variables for a particular response. What would you do (this happens in genomics for instance)? 3) What is your plan for dealing with outliers (if you have a plan)? How about missing values? How about transformations? 4) How do you prevent over-fitting when you are creating a statistical model (i.e., model does not fit well other samples). 5) Define/explain what forecasting is, as opposed to prediction, If possible?
  • 153. Leonardo Auslender –Ch. 1 Copyright 153 9/8/2019 1) We have a regression problem where the response is a count variable. Which would you choose in this context, ordinary least squares or Poisson regression (or maybe some other)? Explain your choice, what is the main differences among these models? 2) Describe strategies to create a regression model with a very large number of predictors, and not a lot of time. 3) Explain intuitively why a covariance matrix is positive (semi) definite, and what that means. How can that fact be used? 4) Explain concept of interaction effects in regression models. Specifically, can an interaction be significant while corresponding main effects are not? Is there some difference in interpretation of interaction between OLS and logistic regression (future chapter)? 5) Is there a problem if regression residuals are not normal? 6) When would you transform a dependent variable? An Independent one?
  • 154. Leonardo Auslender –Ch. 1 Copyright 154 9/8/2019 (from the Analysis factor, 2018-NOV, theanalysisfactor.com) 1. When you add an interaction to a regression model, you can still evaluate the main effects of the terms that make up the interaction, just like in ANOVA. 2. The intercept is usually meaningless in a regression model. 3. In Analysis of Covariance, the covariate is a nuisance variable, and the real point of the analysis is to evaluate the means after controlling for the covariate. 4. Standardized regression coefficients are meaningful for dummy-coded predictors. 5. The only way to evaluate an interaction between two independent variables is to categorize one or both of them. All of these are False.
  • 155. Leonardo Auslender –Ch. 1 Copyright 155 9/8/2019 References Box, G. (1976): Science and Statistics, Journal of the American Statistical Association, 71: 791–799 Gladwell M. (2008): Outliers: The Story of Success, Little, Brown and Company. Horst P. (1941): The prediction of personnel adjustment. Social Science Research and Council Bulletin, 48, 431-436. Spiess A., Newmeyer N. (2010): An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: a Monte Carlo approach, BMC Pharmacology, 10: 6.
  • 156. Leonardo Auslender –Ch. 1 Copyright Ch. 1.1-156 9/8/2019