Regression Analysis

U N I V E R S I T Y O F S O U T H F L O R I D A //
Regression Analysis
Dr. S. Shivendu

U N I V E R S I T Y O F S O U T H F L O R I D A // 2
Objectives
Regression Analysis
Analyze the multicollinearity and inference
with interaction terms in regression analysis.
01
Analyze the partial correlation and
interpretation procedures of statistical data.
02
Conduct appropriate model selection based
on statistical data.
03

Agenda
Regression Analysis
Regression Analysis
Regression Diagnostics and Advanced Regression topics
Multicollinearity
Interaction
Partial Regression
Concepts
Model Selection
Concepts and decision-making
Working with data
SAS procedures

Multiple Regression Analysis
Method for studying the relationship between a dependent
variable and two or more independent variables.
Prediction Explanation
Theory building
Purposes

Assumptions
Independence
The scores of any subject are
independent of the scores of
all other subjects
Homoscedasticity
In the population, the
variances of the dependent X
variables are equal.
Normality
In the population, the scores
on the dependent variable are
normally distributed
Linearity
The relation between the
dependent and independent
variables is linear when all the
others are held constant.

VS
 One dependent variable Y predicted from one independent
variable X
 One regression coefficient
 r2: proportion of variation in the dependent variable Y
predictable from X
Simple Regression
 One dependent variable Y predicted from a set of
independent variables (X1, X2 ….Xk)
 One regression coefficient for each independent variable
 R2: proportion of variation in the dependent variable Y
predictable by a set of independent variables (X’s)
Multiple Regression
Regression

VS
R = the magnitude of the relationship between the
dependent variable and the best linear combination
of the predictor variables
Multiple Correlation Coefficient (R)
R2 = the proportion of variation in Y accounted for
by the set of independent variables (X’s).
Coefficient of Multiple Determination
(R2)
Differences

Self Concept and Academic Achievement (N=103)

The Model
 The b’s are called partial regression coefficients
 Our example-Predicting AA:
 Y’= 36.83 + (3.52)XASC + (-.44)XGSC
 Predicted AA for person with GSC of 4 and ASC of 6
 Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23
Y’ a b x
= + +
1 1 b x +
2 2
b x
k k

Variation: How much?
Total Variation in Y
Unpredictable Variation
Predictable variation by the combination of
independent variables

Proportion of Predictable and Unpredictable Variation
 Where:
 Y= AA
 X1 = ASC
 X2 =GSC
R2 = Predictable (explained)
variation in Y
(1-R2) = Unpredictable (unexplained)
variation in Y
X
X
Y
1
2

Various Significance Tests
Testing R2
 Test R2 through an F test
 Test of competing models (difference between R2)
through an F test of difference of R2s
Testing b
 Test each partial regression coefficient (b) by t-tests
 Comparison of partial regression coefficients with each
other - t-test of difference between standardized
partial regression coefficients ()

Testing R2
Example
 What proportion of variation in AA can be
predicted from GSC and ASC?
 Compute R2: R2 = .16 (R = .41) : 16%
of the variance in AA can be
accounted for by the composite of
GSC and ASC
 Is R2 statistically significant from 0?
 F test: Fobserved = 9.52, Fcrit (05/2,100) =
3.09
 Reject H0: in the population there is
a significant relationship between AA
and the linear composite of GSC and
ASC

Comparing Models - Testing R2
Example
Comparing models
 Model 1: Y’= 35.37 +
(3.38)XASC
 Model 2: Y’= 36.83 +
(3.52)XASC + (-.44)XGSC
Compute R2 for each model
 Model 1: R
2 = r
2 = .160
 Model 2: R
2 = .161
Test difference between R2s
 Fobs = .119, Fcrit(.05/1,100) =
3.94
 Conclude that GSC does
not add significantly to
ASC in predicting AA

Residual Analysis
 The residual for observation i, ei, is the difference between its observed and
predicted value
 Check the assumptions of regression by examining the residuals
 Examine for linearity assumption
 Evaluate independence assumption
 Evaluate normal distribution assumption
 Examine for constant variance for all levels of X (homoscedasticity)
e Y Y
= -
i
i i
ˆ

Residual Analysis for Linearity
Not Linear Linear
x
residuals
x
Y
x
Y
x
residuals

Residual Analysis for Independence
Not Independent Independent
X
X
residuals
residuals
X
residuals

Check for Normality
Examine the Stem-and-Leaf Display of the Residuals
Examine the Boxplot of the Residuals
Examine the Histogram of the Residuals
Construct a Normal Probability Plot of the Residuals

Residual Analysis for Normality
Percent
Residual
When using a normal probability plot, normal errors will
approximately display in a straight line
-3 -2 -1 0 1 2 3
0
100

Residual Analysis for Equal Variance
Non-constant Constant
x x
Y
x x
Y
residuals
residuals

Multicollinearity
Collinear = highly correlated
Multicollinearity = inclusion of highly correlated
independent variables in a single regression model
High correlation of X variables causes problems for
estimation of slopes (b’s)
Variable denominators approach zero, coefficients may be
wrong/too large.

Multicollinearity Symptoms
Unusually large standard
errors and betas
Two variables have the
same large effect when
included separately
 Compared if both collinear
variables aren’t included
 Betas often exceed 1.0
 When putting together the
effects of both variables
shrink
 One remains positive, and the
other flips sign

Multicollinearity
• What does multicollinearity do to models?
• Note: It does not violate regression assumptions
• But it can mess things up anyway
• Multicollinearity can inflate standard error estimates
• Large standard errors = small t-values = no rejected null hypotheses
• Note: Only collinear variables are affected. The rest of the model results are
OK.
• It leads to instability of coefficient estimates
• Variable coefficients may fluctuate wildly when a collinear variable is added
• These fluctuations may not be “real”, but may just reflect amplification of
“noise” and “error”
• One variable may only be slightly better at predicting Y… but SPSS will
give it a MUCH higher coefficient.

Multicollinearity
Look at correlations of all
independent vars
 Correlation >.8 is a
concern
Watch out for the “symptoms”
Sometimes problems aren’t
always bivariate… and don’t
show up in bivariate
correlations
Compute diagnostic statistics
 VIF (Variance Inflation
Factor).

Multicollinearity
Tolerance is based on doing a
regression: X1 is dependent; X2
and X3 are independent.
Tolerance for X1 is simply 1
minus regression R-square.
Regression r-square will be
high… 1 minus r-square will be
low… indicating a problem.
If you have 3 independent variables: X1, X2, X3…
If a variable (X1) is highly correlated with
all the others (X2, X3) then they will do a
good job of predicting it in a regression

Multicollinearity
Variance Inflation Factor (VIF) is the reciprocal of
tolerance: 1/tolerance
High VIF indicates multicollinearity
Gives an indication of how much the Standard Error of a
variable grows due to the presence of other variables.

Multicollinearity
Drop unnecessary variables
If two collinear variables are really measuring the same
thing, drop one or make an index
Advanced techniques like the Ridge regression. It uses a
more efficient estimator (but not BLUE – may introduce
bias).
Solutions to multicollinearity. It can be difficult if a fully
specified model requires several collinear variables

Dummy Variables
How can we incorporate nominal variables (e.g., race,
gender) into regression?
Option 1: Analyze each sub-group separately. Generates different
slopes, constant for each group
Option 2: Dummy variables, a dichotomous variable coded to indicate
the presence or absence of something. Absence coded as zero, presence
coded as 1.

Dummy Variables: Interpretation
INCOME
100000
80000
60000
40000
20000
0
HAPPY
10
9
8
7
6
5
4
3
2
1
0
The overall slope for all
data points
Note: Line for men, women
have the same slope… but one
is high other is lower. The
constant differs!
If women=1, men=0: The
constant (a) reflects men only.
Dummy coefficient (b) reflects an
increase for women (relative to
men)
Visually: Women = blue, Men = red

Dummy Variables
Dummy coefficients shouldn’t be called slopes
 Referring to the “slope” of gender doesn’t make sense
 Rather, it is the difference in the constant (or “level”)
The contrast is always with the nominal category that was left out of the equation
 If DFEMALE is included, the contrast is with males
 If DBLACK, DOTHER are included, coefficients reflect difference in constant compared to whites.

Interaction Terms
What if you suspect that a variable has a totally different slope for
two different sub-groups in your data?
Perhaps men are more materialistic -- an extra dollar increases
their happiness a lot
If women are less materialistic, each dollar has a smaller effect
on income (compared to men)
Rather, the slope of a variable (income) differs across groups
The issue isn’t men = “more” or “less” than women
Example: Income and Happiness

Interaction Terms
Visually: Women = blue, Men = red
INCOME
100000
80000
60000
40000
20000
0
HAPPY
10
9
8
7
6
5
4
3
2
1
0
Overall slope for all data
points
Note: Here, the slope for men
and women differs.
The effect of income on
happiness (X1 on Y) varies with
gender (X2). This is called an
“interaction effect”

Interactions Terms
 Examples of interaction:
 Effect of education on income may interact with type of school attended (public vs.
private)
 Private schooling has bigger effect on income
 Effect of aspirations on educational attainment interacts with poverty
 Aspirations matter less if you don’t have money to pay for college
 Question: Can you think of examples of two variables that might interact?
 From your final project? Or anything else?

Interaction Terms
 Interaction effects: Differences in the relationship (slope) between two
variables for each category of a third variable
 Option #1: Analyze each group separately
 Look for different sized slope in each group
 Option #2: Multiply the two variables of interest: (DFEMALE, INCOME)
to create a new variable
 Called: DFEMALE*INCOME
 Add that variable to the multiple regression model.

Interaction Terms
Consider the following regression equation:
i
i
i
i e
INC
DFEM
b
INCOME
b
a
Y 


 *
2
1
 Question: What if the case is male?
 Answer: DFEMALE is 0, so b2(DFEM*INC) drops out of the
equation
 Result: Males are modeled using the ordinary regression
equation: a + b1X + e.

Interaction Terms
Consider the following regression equation:
i
i
i
i e
INC
DFEM
b
INCOME
b
a
Y 


 *
2
1
 Question: What if the case is female?
 Answer: DFEMALE is 1, so b2(DFEM*INC) becomes b2*INCOME, which is added to b1
 Result: Females are modeled using a different regression line: a + (b1+b2) X + e
 Thus, the coefficient of b2 reflects difference in the slope of INCOME for women.

Interpreting Interaction Terms
• Interpreting interaction terms:
• A positive b for DFEMALE*INCOME indicates the slope for income is
higher for women vs. men
• A negative effect indicates the slope is lower
• Size of coefficient indicates actual difference in slope
• Example: DFEMALE*INCOME. Observed b’s:
• Income: b = .5
• DFEMALE * INCOME: b = -.2
• Interpretation: Slope is .5 for men, .3 for women.

Interaction Terms
 Two continuous variables can also interact
 Example: Effect of education and income on happiness
 Perhaps highly educated people are less materialistic
 As education increases, the slope between between income and happiness would decrease
 Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME”
 And add it to the model.

Interpreting Interaction Terms
 How do you interpret continuous variable interactions?
 Example: EDUCATION*INCOME: Coefficient = 2.0
 Answer: For each unit change in education, the slope of income vs. happiness increases by 2
 Note: coefficient is symmetrical: For each unit change in income, education slope
increases by 2
 Dummy interactions effectively estimate 2 slopes: one for each group
 Continuous interactions result in many slopes: Each value of education*income
yields a different slope.=

Dummy Interactions
 It is also possible to construct interaction terms based on two dummy variables
 Instead of a “slope” interaction, dummy interactions show difference in constants
 Constant (not slope) differs across values of a third variable
 Example: Effect of race on school success varies by gender
 African Americans do less well in school; but the difference is much larger for black
males.

Interaction Terms
 If you make an interaction, you should also include the component variables in the model:
 A model with “DFEMALE * INCOME” should also include DFEMALE and INCOME
 There are rare exceptions. But when in doubt, include them
 Sometimes interaction terms are highly correlated with its components
 That can cause problems (multicollinearity – which we’ll discuss next week).
 Make sure you have enough cases in each group for your interaction terms
 Interaction terms involve estimating slopes based on sub-groups in your data (e.g., black
females).
 If you there are hardly any black females in the dataset, you can have problems.

Partial Correlation
 A partial correlation measures the relationship between two variables (X and Y) while
eliminating the influence of a third variable (Z).
 Partial correlations are used to reveal the real, underlying relationship between two
variables when researchers suspect that the apparent relation may be distorted by a
third variable.
42

Partial Correlation
 For example, there probably is no underlying relationship between weight and mathematics skill for
elementary school children.
 However, both variables are positively related to age: Older children weigh more and, because they have
spent more years in school, have higher mathematics skills.
43

Partial Correlation
 As a result, weight and mathematics skill will show a positive correlation for a sample of
children that includes several different ages.
 A partial correlation between weight and mathematics skill, holding age constant, would
eliminate the influence of age and show the true correlation which is near zero.
44

Properties of Partial Correlation
 It Falls between -1 and +1.
 The larger the absolute value, the stronger the association, controlling for the other variables
 Does not depend on units of measurement
 Has the same sign as the corresponding partial slope in the prediction equation
 Can regard as approximating the ordinary correlation between y and x1 at a fixed value
of x2.
 Equals ordinary correlation found for data points in the corresponding partial regression plot
 Squared partial correlation has a proportional reduction in error (PRE) interpretation for
predicting y using that predictor, controlling for other explain Var’s in the model.
1 2
.
yx x
r

 Stepwise Regression
 Forward Selection
 Backward Elimination
Iterative; one independent
variable at a time is added or
deleted based on the F statistic
Different subsets of the
independent variables
are evaluated
 Best-Subsets Regression
The first 3 procedures are heuristics.
There is no guarantee that the
best model will be found.
Model Selection: Variable Selection Procedures

Variable Selection: Stepwise Regression
If no variable can be removed and no variable can be added, the procedure stops.
 At each iteration, the first consideration is to see whether the least significant
variable currently in the model can be removed because its F value is less than
the user-specified or default Alpha to remove.
 If no variable can be removed, the procedure checks to see whether the most
significant variable not in the model can be added because its F value is
greater than the user-specified or default Alpha to enter.

Variable Selection: Forward Selection
 This procedure is like stepwise regression but does not permit a
variable to be deleted.
 This forward-selection procedure starts with no independent variables.
 It adds variables one at a time as long as a significant reduction in the
error sum of squares (SSE) can be achieved.

Variable Selection: Backward Elimination
 This procedure begins with a model that includes all the
independent variables the modeler wants to be considered.
 It then attempts to delete one variable at a time by determining
whether the least significant variable currently in the model can be
removed because its p-value is less than the user-specified or
default value.
 Once a variable has been removed from the model it cannot
reenter at a subsequent step.

Variable Selection: Best-Subsets Regression
 Some software packages include best-subsets regression that enables
the user to find, given a specified number of independent variables, the
best regression model.
 The three preceding procedures are one-variable-at-a-time methods
offering no guarantee that the best model for a given number of
variables will be found.

 With positive autocorrelation, we expect a positive residual in one
period to be followed by a positive residual in the next period.
 With positive autocorrelation, we expect a negative residual in one
period to be followed by a negative residual in the next period.
 With negative autocorrelation, we expect a positive residual in one
period to be followed by a negative residual in the next period, then a
positive residual, and so on.
Regression with Time Series Data:
Autocorrelation and the Durbin-Watson Test

 When autocorrelation is present, one of the regression assumptions is
violated: the error terms are not independent.
 When autocorrelation is present, serious errors can be made in
performing tests of significance based upon the assumed regression
model.
 The Durbin-Watson statistic can be used to detect first-order
autocorrelation.

 If successive values are far apart (negative autocorrelation is
present), the statistic will be large.
 The statistic ranges in value from zero to four.
 A value of two indicates no autocorrelation.
 If successive values of the residuals are close together (positive
autocorrelation is present) the statistic will be small.

Key Takeaway
 Checking for assumptions of the regression model is key to interpreting the
results.
 Even if the regression model assumption is met, the presence of
multicollinearity can lead to a bad inference.
 Dummy variables and interaction terms are powerful tools for building
insightful models.
 Model selection is key for inferential purposes.

U N I V E R S I T Y O F S O U T H F L O R I D A //
You have reached the end
of the presentation.

Regression Analysis

Recommended

Recommended

More Related Content

Similar to Regression Analysis

Similar to Regression Analysis (20)

More from Michael770443

More from Michael770443 (8)

Recently uploaded

Recently uploaded (20)

Regression Analysis