SlideShare a Scribd company logo
1 of 55
U N I V E R S I T Y O F S O U T H F L O R I D A //
Regression Analysis
Dr. S. Shivendu
U N I V E R S I T Y O F S O U T H F L O R I D A // 2
Objectives
Regression Analysis
Analyze the multicollinearity and inference
with interaction terms in regression analysis.
01
Analyze the partial correlation and
interpretation procedures of statistical data.
02
Conduct appropriate model selection based
on statistical data.
03
U N I V E R S I T Y O F S O U T H F L O R I D A // 3
Agenda
Regression Analysis
Regression Analysis
Regression Diagnostics and Advanced Regression topics
Multicollinearity
Interaction
Partial Regression
Concepts
Model Selection
Concepts and decision-making
Working with data
SAS procedures
U N I V E R S I T Y O F S O U T H F L O R I D A // 4
Multiple Regression Analysis
Method for studying the relationship between a dependent
variable and two or more independent variables.
Prediction Explanation
Theory building
Purposes
U N I V E R S I T Y O F S O U T H F L O R I D A // 5
Assumptions
Independence
The scores of any subject are
independent of the scores of
all other subjects
Homoscedasticity
In the population, the
variances of the dependent X
variables are equal.
Normality
In the population, the scores
on the dependent variable are
normally distributed
Linearity
The relation between the
dependent and independent
variables is linear when all the
others are held constant.
U N I V E R S I T Y O F S O U T H F L O R I D A // 6
VS
 One dependent variable Y predicted from one independent
variable X
 One regression coefficient
 r2: proportion of variation in the dependent variable Y
predictable from X
Simple Regression
 One dependent variable Y predicted from a set of
independent variables (X1, X2 ….Xk)
 One regression coefficient for each independent variable
 R2: proportion of variation in the dependent variable Y
predictable by a set of independent variables (X’s)
Multiple Regression
Regression
U N I V E R S I T Y O F S O U T H F L O R I D A // 7
VS
R = the magnitude of the relationship between the
dependent variable and the best linear combination
of the predictor variables
Multiple Correlation Coefficient (R)
R2 = the proportion of variation in Y accounted for
by the set of independent variables (X’s).
Coefficient of Multiple Determination
(R2)
Differences
U N I V E R S I T Y O F S O U T H F L O R I D A // 8
Self Concept and Academic Achievement (N=103)
U N I V E R S I T Y O F S O U T H F L O R I D A // 9
The Model
 The b’s are called partial regression coefficients
 Our example-Predicting AA:
 Y’= 36.83 + (3.52)XASC + (-.44)XGSC
 Predicted AA for person with GSC of 4 and ASC of 6
 Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23
Y’ a b x
= + +
1 1 b x +
2 2
b x
k k
U N I V E R S I T Y O F S O U T H F L O R I D A // 10
Variation: How much?
Total Variation in Y
Unpredictable Variation
Predictable variation by the combination of
independent variables
U N I V E R S I T Y O F S O U T H F L O R I D A // 11
Proportion of Predictable and Unpredictable Variation
 Where:
 Y= AA
 X1 = ASC
 X2 =GSC
R2 = Predictable (explained)
variation in Y
(1-R2) = Unpredictable (unexplained)
variation in Y
X
X
Y
1
2
U N I V E R S I T Y O F S O U T H F L O R I D A // 12
Various Significance Tests
Testing R2
 Test R2 through an F test
 Test of competing models (difference between R2)
through an F test of difference of R2s
Testing b
 Test each partial regression coefficient (b) by t-tests
 Comparison of partial regression coefficients with each
other - t-test of difference between standardized
partial regression coefficients ()
U N I V E R S I T Y O F S O U T H F L O R I D A // 13
Testing R2
Example
 What proportion of variation in AA can be
predicted from GSC and ASC?
 Compute R2: R2 = .16 (R = .41) : 16%
of the variance in AA can be
accounted for by the composite of
GSC and ASC
 Is R2 statistically significant from 0?
 F test: Fobserved = 9.52, Fcrit (05/2,100) =
3.09
 Reject H0: in the population there is
a significant relationship between AA
and the linear composite of GSC and
ASC
U N I V E R S I T Y O F S O U T H F L O R I D A // 14
Comparing Models - Testing R2
Example
Comparing models
 Model 1: Y’= 35.37 +
(3.38)XASC
 Model 2: Y’= 36.83 +
(3.52)XASC + (-.44)XGSC
Compute R2 for each model
 Model 1: R
2 = r
2 = .160
 Model 2: R
2 = .161
Test difference between R2s
 Fobs = .119, Fcrit(.05/1,100) =
3.94
 Conclude that GSC does
not add significantly to
ASC in predicting AA
U N I V E R S I T Y O F S O U T H F L O R I D A // 15
Residual Analysis
 The residual for observation i, ei, is the difference between its observed and
predicted value
 Check the assumptions of regression by examining the residuals
 Examine for linearity assumption
 Evaluate independence assumption
 Evaluate normal distribution assumption
 Examine for constant variance for all levels of X (homoscedasticity)
e Y Y
= -
i
i i
ˆ
U N I V E R S I T Y O F S O U T H F L O R I D A // 16
Residual Analysis for Linearity
Not Linear Linear
x
residuals
x
Y
x
Y
x
residuals
U N I V E R S I T Y O F S O U T H F L O R I D A // 17
Residual Analysis for Independence
Not Independent Independent
X
X
residuals
residuals
X
residuals
U N I V E R S I T Y O F S O U T H F L O R I D A // 18
Check for Normality
Examine the Stem-and-Leaf Display of the Residuals
Examine the Boxplot of the Residuals
Examine the Histogram of the Residuals
Construct a Normal Probability Plot of the Residuals
U N I V E R S I T Y O F S O U T H F L O R I D A // 19
Residual Analysis for Normality
Percent
Residual
When using a normal probability plot, normal errors will
approximately display in a straight line
-3 -2 -1 0 1 2 3
0
100
U N I V E R S I T Y O F S O U T H F L O R I D A // 20
Residual Analysis for Equal Variance
Non-constant Constant
x x
Y
x x
Y
residuals
residuals
U N I V E R S I T Y O F S O U T H F L O R I D A // 21
Multicollinearity
Collinear = highly correlated
Multicollinearity = inclusion of highly correlated
independent variables in a single regression model
High correlation of X variables causes problems for
estimation of slopes (b’s)
Variable denominators approach zero, coefficients may be
wrong/too large.
U N I V E R S I T Y O F S O U T H F L O R I D A // 22
Multicollinearity Symptoms
Unusually large standard
errors and betas
Two variables have the
same large effect when
included separately
 Compared if both collinear
variables aren’t included
 Betas often exceed 1.0
 When putting together the
effects of both variables
shrink
 One remains positive, and the
other flips sign
U N I V E R S I T Y O F S O U T H F L O R I D A // 23
Multicollinearity
• What does multicollinearity do to models?
• Note: It does not violate regression assumptions
• But it can mess things up anyway
• Multicollinearity can inflate standard error estimates
• Large standard errors = small t-values = no rejected null hypotheses
• Note: Only collinear variables are affected. The rest of the model results are
OK.
• It leads to instability of coefficient estimates
• Variable coefficients may fluctuate wildly when a collinear variable is added
• These fluctuations may not be “real”, but may just reflect amplification of
“noise” and “error”
• One variable may only be slightly better at predicting Y… but SPSS will
give it a MUCH higher coefficient.
U N I V E R S I T Y O F S O U T H F L O R I D A // 24
Multicollinearity
Look at correlations of all
independent vars
 Correlation >.8 is a
concern
Watch out for the “symptoms”
Sometimes problems aren’t
always bivariate… and don’t
show up in bivariate
correlations
Compute diagnostic statistics
 VIF (Variance Inflation
Factor).
U N I V E R S I T Y O F S O U T H F L O R I D A // 25
Multicollinearity
Tolerance is based on doing a
regression: X1 is dependent; X2
and X3 are independent.
Tolerance for X1 is simply 1
minus regression R-square.
Regression r-square will be
high… 1 minus r-square will be
low… indicating a problem.
If you have 3 independent variables: X1, X2, X3…
If a variable (X1) is highly correlated with
all the others (X2, X3) then they will do a
good job of predicting it in a regression
U N I V E R S I T Y O F S O U T H F L O R I D A // 26
Multicollinearity
Variance Inflation Factor (VIF) is the reciprocal of
tolerance: 1/tolerance
High VIF indicates multicollinearity
Gives an indication of how much the Standard Error of a
variable grows due to the presence of other variables.
U N I V E R S I T Y O F S O U T H F L O R I D A // 27
Multicollinearity
Drop unnecessary variables
If two collinear variables are really measuring the same
thing, drop one or make an index
Advanced techniques like the Ridge regression. It uses a
more efficient estimator (but not BLUE – may introduce
bias).
Solutions to multicollinearity. It can be difficult if a fully
specified model requires several collinear variables
U N I V E R S I T Y O F S O U T H F L O R I D A // 28
Dummy Variables
How can we incorporate nominal variables (e.g., race,
gender) into regression?
Option 1: Analyze each sub-group separately. Generates different
slopes, constant for each group
Option 2: Dummy variables, a dichotomous variable coded to indicate
the presence or absence of something. Absence coded as zero, presence
coded as 1.
U N I V E R S I T Y O F S O U T H F L O R I D A // 29
Dummy Variables: Interpretation
INCOME
100000
80000
60000
40000
20000
0
HAPPY
10
9
8
7
6
5
4
3
2
1
0
The overall slope for all
data points
Note: Line for men, women
have the same slope… but one
is high other is lower. The
constant differs!
If women=1, men=0: The
constant (a) reflects men only.
Dummy coefficient (b) reflects an
increase for women (relative to
men)
Visually: Women = blue, Men = red
U N I V E R S I T Y O F S O U T H F L O R I D A // 30
Dummy Variables
Dummy coefficients shouldn’t be called slopes
 Referring to the “slope” of gender doesn’t make sense
 Rather, it is the difference in the constant (or “level”)
The contrast is always with the nominal category that was left out of the equation
 If DFEMALE is included, the contrast is with males
 If DBLACK, DOTHER are included, coefficients reflect difference in constant compared to whites.
U N I V E R S I T Y O F S O U T H F L O R I D A // 31
Interaction Terms
What if you suspect that a variable has a totally different slope for
two different sub-groups in your data?
Perhaps men are more materialistic -- an extra dollar increases
their happiness a lot
If women are less materialistic, each dollar has a smaller effect
on income (compared to men)
Rather, the slope of a variable (income) differs across groups
The issue isn’t men = “more” or “less” than women
Example: Income and Happiness
U N I V E R S I T Y O F S O U T H F L O R I D A // 32
Interaction Terms
Visually: Women = blue, Men = red
INCOME
100000
80000
60000
40000
20000
0
HAPPY
10
9
8
7
6
5
4
3
2
1
0
Overall slope for all data
points
Note: Here, the slope for men
and women differs.
The effect of income on
happiness (X1 on Y) varies with
gender (X2). This is called an
“interaction effect”
U N I V E R S I T Y O F S O U T H F L O R I D A // 33
Interactions Terms
 Examples of interaction:
 Effect of education on income may interact with type of school attended (public vs.
private)
 Private schooling has bigger effect on income
 Effect of aspirations on educational attainment interacts with poverty
 Aspirations matter less if you don’t have money to pay for college
 Question: Can you think of examples of two variables that might interact?
 From your final project? Or anything else?
U N I V E R S I T Y O F S O U T H F L O R I D A // 34
Interaction Terms
 Interaction effects: Differences in the relationship (slope) between two
variables for each category of a third variable
 Option #1: Analyze each group separately
 Look for different sized slope in each group
 Option #2: Multiply the two variables of interest: (DFEMALE, INCOME)
to create a new variable
 Called: DFEMALE*INCOME
 Add that variable to the multiple regression model.
U N I V E R S I T Y O F S O U T H F L O R I D A // 35
Interaction Terms
Consider the following regression equation:
i
i
i
i e
INC
DFEM
b
INCOME
b
a
Y 


 *
2
1
 Question: What if the case is male?
 Answer: DFEMALE is 0, so b2(DFEM*INC) drops out of the
equation
 Result: Males are modeled using the ordinary regression
equation: a + b1X + e.
U N I V E R S I T Y O F S O U T H F L O R I D A // 36
Interaction Terms
Consider the following regression equation:
i
i
i
i e
INC
DFEM
b
INCOME
b
a
Y 


 *
2
1
 Question: What if the case is female?
 Answer: DFEMALE is 1, so b2(DFEM*INC) becomes b2*INCOME, which is added to b1
 Result: Females are modeled using a different regression line: a + (b1+b2) X + e
 Thus, the coefficient of b2 reflects difference in the slope of INCOME for women.
U N I V E R S I T Y O F S O U T H F L O R I D A // 37
Interpreting Interaction Terms
• Interpreting interaction terms:
• A positive b for DFEMALE*INCOME indicates the slope for income is
higher for women vs. men
• A negative effect indicates the slope is lower
• Size of coefficient indicates actual difference in slope
• Example: DFEMALE*INCOME. Observed b’s:
• Income: b = .5
• DFEMALE * INCOME: b = -.2
• Interpretation: Slope is .5 for men, .3 for women.
U N I V E R S I T Y O F S O U T H F L O R I D A // 38
Interaction Terms
 Two continuous variables can also interact
 Example: Effect of education and income on happiness
 Perhaps highly educated people are less materialistic
 As education increases, the slope between between income and happiness would decrease
 Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME”
 And add it to the model.
U N I V E R S I T Y O F S O U T H F L O R I D A // 39
Interpreting Interaction Terms
 How do you interpret continuous variable interactions?
 Example: EDUCATION*INCOME: Coefficient = 2.0
 Answer: For each unit change in education, the slope of income vs. happiness increases by 2
 Note: coefficient is symmetrical: For each unit change in income, education slope
increases by 2
 Dummy interactions effectively estimate 2 slopes: one for each group
 Continuous interactions result in many slopes: Each value of education*income
yields a different slope.=
U N I V E R S I T Y O F S O U T H F L O R I D A // 40
Dummy Interactions
 It is also possible to construct interaction terms based on two dummy variables
 Instead of a “slope” interaction, dummy interactions show difference in constants
 Constant (not slope) differs across values of a third variable
 Example: Effect of race on school success varies by gender
 African Americans do less well in school; but the difference is much larger for black
males.
U N I V E R S I T Y O F S O U T H F L O R I D A // 41
Interaction Terms
 If you make an interaction, you should also include the component variables in the model:
 A model with “DFEMALE * INCOME” should also include DFEMALE and INCOME
 There are rare exceptions. But when in doubt, include them
 Sometimes interaction terms are highly correlated with its components
 That can cause problems (multicollinearity – which we’ll discuss next week).
 Make sure you have enough cases in each group for your interaction terms
 Interaction terms involve estimating slopes based on sub-groups in your data (e.g., black
females).
 If you there are hardly any black females in the dataset, you can have problems.
U N I V E R S I T Y O F S O U T H F L O R I D A // 42
Partial Correlation
 A partial correlation measures the relationship between two variables (X and Y) while
eliminating the influence of a third variable (Z).
 Partial correlations are used to reveal the real, underlying relationship between two
variables when researchers suspect that the apparent relation may be distorted by a
third variable.
42
U N I V E R S I T Y O F S O U T H F L O R I D A // 43
Partial Correlation
 For example, there probably is no underlying relationship between weight and mathematics skill for
elementary school children.
 However, both variables are positively related to age: Older children weigh more and, because they have
spent more years in school, have higher mathematics skills.
43
U N I V E R S I T Y O F S O U T H F L O R I D A // 44
Partial Correlation
 As a result, weight and mathematics skill will show a positive correlation for a sample of
children that includes several different ages.
 A partial correlation between weight and mathematics skill, holding age constant, would
eliminate the influence of age and show the true correlation which is near zero.
44
U N I V E R S I T Y O F S O U T H F L O R I D A // 45
Properties of Partial Correlation
 It Falls between -1 and +1.
 The larger the absolute value, the stronger the association, controlling for the other variables
 Does not depend on units of measurement
 Has the same sign as the corresponding partial slope in the prediction equation
 Can regard as approximating the ordinary correlation between y and x1 at a fixed value
of x2.
 Equals ordinary correlation found for data points in the corresponding partial regression plot
 Squared partial correlation has a proportional reduction in error (PRE) interpretation for
predicting y using that predictor, controlling for other explain Var’s in the model.
1 2
.
yx x
r
U N I V E R S I T Y O F S O U T H F L O R I D A // 46
 Stepwise Regression
 Forward Selection
 Backward Elimination
Iterative; one independent
variable at a time is added or
deleted based on the F statistic
Different subsets of the
independent variables
are evaluated
 Best-Subsets Regression
The first 3 procedures are heuristics.
There is no guarantee that the
best model will be found.
Model Selection: Variable Selection Procedures
U N I V E R S I T Y O F S O U T H F L O R I D A // 47
Variable Selection: Stepwise Regression
If no variable can be removed and no variable can be added, the procedure stops.
 At each iteration, the first consideration is to see whether the least significant
variable currently in the model can be removed because its F value is less than
the user-specified or default Alpha to remove.
 If no variable can be removed, the procedure checks to see whether the most
significant variable not in the model can be added because its F value is
greater than the user-specified or default Alpha to enter.
U N I V E R S I T Y O F S O U T H F L O R I D A // 48
Variable Selection: Forward Selection
 This procedure is like stepwise regression but does not permit a
variable to be deleted.
 This forward-selection procedure starts with no independent variables.
 It adds variables one at a time as long as a significant reduction in the
error sum of squares (SSE) can be achieved.
U N I V E R S I T Y O F S O U T H F L O R I D A // 49
Variable Selection: Backward Elimination
 This procedure begins with a model that includes all the
independent variables the modeler wants to be considered.
 It then attempts to delete one variable at a time by determining
whether the least significant variable currently in the model can be
removed because its p-value is less than the user-specified or
default value.
 Once a variable has been removed from the model it cannot
reenter at a subsequent step.
U N I V E R S I T Y O F S O U T H F L O R I D A // 50
Variable Selection: Best-Subsets Regression
 Some software packages include best-subsets regression that enables
the user to find, given a specified number of independent variables, the
best regression model.
 The three preceding procedures are one-variable-at-a-time methods
offering no guarantee that the best model for a given number of
variables will be found.
U N I V E R S I T Y O F S O U T H F L O R I D A // 51
 With positive autocorrelation, we expect a positive residual in one
period to be followed by a positive residual in the next period.
 With positive autocorrelation, we expect a negative residual in one
period to be followed by a negative residual in the next period.
 With negative autocorrelation, we expect a positive residual in one
period to be followed by a negative residual in the next period, then a
positive residual, and so on.
Regression with Time Series Data:
Autocorrelation and the Durbin-Watson Test
U N I V E R S I T Y O F S O U T H F L O R I D A // 52
 When autocorrelation is present, one of the regression assumptions is
violated: the error terms are not independent.
 When autocorrelation is present, serious errors can be made in
performing tests of significance based upon the assumed regression
model.
 The Durbin-Watson statistic can be used to detect first-order
autocorrelation.
Autocorrelation and the Durbin-Watson Test
U N I V E R S I T Y O F S O U T H F L O R I D A // 53
 If successive values are far apart (negative autocorrelation is
present), the statistic will be large.
 The statistic ranges in value from zero to four.
 A value of two indicates no autocorrelation.
 If successive values of the residuals are close together (positive
autocorrelation is present) the statistic will be small.
Autocorrelation and the Durbin-Watson Test
U N I V E R S I T Y O F S O U T H F L O R I D A // 54
Key Takeaway
 Checking for assumptions of the regression model is key to interpreting the
results.
 Even if the regression model assumption is met, the presence of
multicollinearity can lead to a bad inference.
 Dummy variables and interaction terms are powerful tools for building
insightful models.
 Model selection is key for inferential purposes.
U N I V E R S I T Y O F S O U T H F L O R I D A //
You have reached the end
of the presentation.

More Related Content

Similar to Regression Analysis

Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02Stephen Ong
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.pptarkian3
 
SimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptSimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptAdnanAli861711
 
Linear regression.ppt
Linear regression.pptLinear regression.ppt
Linear regression.pptbranlymbunga1
 
Slideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.pptSlideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.pptrahulrkmgb09
 
Regression.ppt basic introduction of regression with example
Regression.ppt basic introduction of regression with exampleRegression.ppt basic introduction of regression with example
Regression.ppt basic introduction of regression with exampleshivshankarshiva98
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierAl Arizmendez
 
regression analysis .ppt
regression analysis .pptregression analysis .ppt
regression analysis .pptTapanKumarDash3
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression AnalysisASAD ALI
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentationCarlo Magno
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.pptEkoGaniarto
 

Similar to Regression Analysis (20)

Regression for class teaching
Regression for class teachingRegression for class teaching
Regression for class teaching
 
Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02Mba2216 week 11 data analysis part 02
Mba2216 week 11 data analysis part 02
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
SimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.pptSimpleLinearRegressionAnalysisWithExamples.ppt
SimpleLinearRegressionAnalysisWithExamples.ppt
 
Linear regression.ppt
Linear regression.pptLinear regression.ppt
Linear regression.ppt
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
Slideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.pptSlideset Simple Linear Regression models.ppt
Slideset Simple Linear Regression models.ppt
 
lecture13.ppt
lecture13.pptlecture13.ppt
lecture13.ppt
 
Regression.ppt basic introduction of regression with example
Regression.ppt basic introduction of regression with exampleRegression.ppt basic introduction of regression with example
Regression.ppt basic introduction of regression with example
 
Regression
RegressionRegression
Regression
 
Regression
RegressionRegression
Regression
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
 
Ders 2 ols .ppt
Ders 2 ols .pptDers 2 ols .ppt
Ders 2 ols .ppt
 
Simple Regression.pptx
Simple Regression.pptxSimple Regression.pptx
Simple Regression.pptx
 
regression analysis .ppt
regression analysis .pptregression analysis .ppt
regression analysis .ppt
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Multiple regression presentation
Multiple regression presentationMultiple regression presentation
Multiple regression presentation
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 

More from Michael770443

Discrete Choice Model - Part 2
Discrete Choice Model - Part 2Discrete Choice Model - Part 2
Discrete Choice Model - Part 2Michael770443
 
Discrete Choice Model
Discrete Choice ModelDiscrete Choice Model
Discrete Choice ModelMichael770443
 
Categorical Data and Statistical Analysis
Categorical Data and Statistical AnalysisCategorical Data and Statistical Analysis
Categorical Data and Statistical AnalysisMichael770443
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of VarianceMichael770443
 
Segmentation: Clustering and Classification
Segmentation: Clustering and ClassificationSegmentation: Clustering and Classification
Segmentation: Clustering and ClassificationMichael770443
 
Introduction to Statistical Methods
Introduction to Statistical MethodsIntroduction to Statistical Methods
Introduction to Statistical MethodsMichael770443
 
Overview of Statistical Concepts
Overview of Statistical ConceptsOverview of Statistical Concepts
Overview of Statistical ConceptsMichael770443
 

More from Michael770443 (8)

Discrete Choice Model - Part 2
Discrete Choice Model - Part 2Discrete Choice Model - Part 2
Discrete Choice Model - Part 2
 
Discrete Choice Model
Discrete Choice ModelDiscrete Choice Model
Discrete Choice Model
 
Categorical Data and Statistical Analysis
Categorical Data and Statistical AnalysisCategorical Data and Statistical Analysis
Categorical Data and Statistical Analysis
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of Variance
 
Classification
ClassificationClassification
Classification
 
Segmentation: Clustering and Classification
Segmentation: Clustering and ClassificationSegmentation: Clustering and Classification
Segmentation: Clustering and Classification
 
Introduction to Statistical Methods
Introduction to Statistical MethodsIntroduction to Statistical Methods
Introduction to Statistical Methods
 
Overview of Statistical Concepts
Overview of Statistical ConceptsOverview of Statistical Concepts
Overview of Statistical Concepts
 

Recently uploaded

Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxAnaBeatriceAblay2
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 

Recently uploaded (20)

Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 

Regression Analysis

  • 1. U N I V E R S I T Y O F S O U T H F L O R I D A // Regression Analysis Dr. S. Shivendu
  • 2. U N I V E R S I T Y O F S O U T H F L O R I D A // 2 Objectives Regression Analysis Analyze the multicollinearity and inference with interaction terms in regression analysis. 01 Analyze the partial correlation and interpretation procedures of statistical data. 02 Conduct appropriate model selection based on statistical data. 03
  • 3. U N I V E R S I T Y O F S O U T H F L O R I D A // 3 Agenda Regression Analysis Regression Analysis Regression Diagnostics and Advanced Regression topics Multicollinearity Interaction Partial Regression Concepts Model Selection Concepts and decision-making Working with data SAS procedures
  • 4. U N I V E R S I T Y O F S O U T H F L O R I D A // 4 Multiple Regression Analysis Method for studying the relationship between a dependent variable and two or more independent variables. Prediction Explanation Theory building Purposes
  • 5. U N I V E R S I T Y O F S O U T H F L O R I D A // 5 Assumptions Independence The scores of any subject are independent of the scores of all other subjects Homoscedasticity In the population, the variances of the dependent X variables are equal. Normality In the population, the scores on the dependent variable are normally distributed Linearity The relation between the dependent and independent variables is linear when all the others are held constant.
  • 6. U N I V E R S I T Y O F S O U T H F L O R I D A // 6 VS  One dependent variable Y predicted from one independent variable X  One regression coefficient  r2: proportion of variation in the dependent variable Y predictable from X Simple Regression  One dependent variable Y predicted from a set of independent variables (X1, X2 ….Xk)  One regression coefficient for each independent variable  R2: proportion of variation in the dependent variable Y predictable by a set of independent variables (X’s) Multiple Regression Regression
  • 7. U N I V E R S I T Y O F S O U T H F L O R I D A // 7 VS R = the magnitude of the relationship between the dependent variable and the best linear combination of the predictor variables Multiple Correlation Coefficient (R) R2 = the proportion of variation in Y accounted for by the set of independent variables (X’s). Coefficient of Multiple Determination (R2) Differences
  • 8. U N I V E R S I T Y O F S O U T H F L O R I D A // 8 Self Concept and Academic Achievement (N=103)
  • 9. U N I V E R S I T Y O F S O U T H F L O R I D A // 9 The Model  The b’s are called partial regression coefficients  Our example-Predicting AA:  Y’= 36.83 + (3.52)XASC + (-.44)XGSC  Predicted AA for person with GSC of 4 and ASC of 6  Y’= 36.83 + (3.52)(6) + (-.44)(4) = 56.23 Y’ a b x = + + 1 1 b x + 2 2 b x k k
  • 10. U N I V E R S I T Y O F S O U T H F L O R I D A // 10 Variation: How much? Total Variation in Y Unpredictable Variation Predictable variation by the combination of independent variables
  • 11. U N I V E R S I T Y O F S O U T H F L O R I D A // 11 Proportion of Predictable and Unpredictable Variation  Where:  Y= AA  X1 = ASC  X2 =GSC R2 = Predictable (explained) variation in Y (1-R2) = Unpredictable (unexplained) variation in Y X X Y 1 2
  • 12. U N I V E R S I T Y O F S O U T H F L O R I D A // 12 Various Significance Tests Testing R2  Test R2 through an F test  Test of competing models (difference between R2) through an F test of difference of R2s Testing b  Test each partial regression coefficient (b) by t-tests  Comparison of partial regression coefficients with each other - t-test of difference between standardized partial regression coefficients ()
  • 13. U N I V E R S I T Y O F S O U T H F L O R I D A // 13 Testing R2 Example  What proportion of variation in AA can be predicted from GSC and ASC?  Compute R2: R2 = .16 (R = .41) : 16% of the variance in AA can be accounted for by the composite of GSC and ASC  Is R2 statistically significant from 0?  F test: Fobserved = 9.52, Fcrit (05/2,100) = 3.09  Reject H0: in the population there is a significant relationship between AA and the linear composite of GSC and ASC
  • 14. U N I V E R S I T Y O F S O U T H F L O R I D A // 14 Comparing Models - Testing R2 Example Comparing models  Model 1: Y’= 35.37 + (3.38)XASC  Model 2: Y’= 36.83 + (3.52)XASC + (-.44)XGSC Compute R2 for each model  Model 1: R 2 = r 2 = .160  Model 2: R 2 = .161 Test difference between R2s  Fobs = .119, Fcrit(.05/1,100) = 3.94  Conclude that GSC does not add significantly to ASC in predicting AA
  • 15. U N I V E R S I T Y O F S O U T H F L O R I D A // 15 Residual Analysis  The residual for observation i, ei, is the difference between its observed and predicted value  Check the assumptions of regression by examining the residuals  Examine for linearity assumption  Evaluate independence assumption  Evaluate normal distribution assumption  Examine for constant variance for all levels of X (homoscedasticity) e Y Y = - i i i ˆ
  • 16. U N I V E R S I T Y O F S O U T H F L O R I D A // 16 Residual Analysis for Linearity Not Linear Linear x residuals x Y x Y x residuals
  • 17. U N I V E R S I T Y O F S O U T H F L O R I D A // 17 Residual Analysis for Independence Not Independent Independent X X residuals residuals X residuals
  • 18. U N I V E R S I T Y O F S O U T H F L O R I D A // 18 Check for Normality Examine the Stem-and-Leaf Display of the Residuals Examine the Boxplot of the Residuals Examine the Histogram of the Residuals Construct a Normal Probability Plot of the Residuals
  • 19. U N I V E R S I T Y O F S O U T H F L O R I D A // 19 Residual Analysis for Normality Percent Residual When using a normal probability plot, normal errors will approximately display in a straight line -3 -2 -1 0 1 2 3 0 100
  • 20. U N I V E R S I T Y O F S O U T H F L O R I D A // 20 Residual Analysis for Equal Variance Non-constant Constant x x Y x x Y residuals residuals
  • 21. U N I V E R S I T Y O F S O U T H F L O R I D A // 21 Multicollinearity Collinear = highly correlated Multicollinearity = inclusion of highly correlated independent variables in a single regression model High correlation of X variables causes problems for estimation of slopes (b’s) Variable denominators approach zero, coefficients may be wrong/too large.
  • 22. U N I V E R S I T Y O F S O U T H F L O R I D A // 22 Multicollinearity Symptoms Unusually large standard errors and betas Two variables have the same large effect when included separately  Compared if both collinear variables aren’t included  Betas often exceed 1.0  When putting together the effects of both variables shrink  One remains positive, and the other flips sign
  • 23. U N I V E R S I T Y O F S O U T H F L O R I D A // 23 Multicollinearity • What does multicollinearity do to models? • Note: It does not violate regression assumptions • But it can mess things up anyway • Multicollinearity can inflate standard error estimates • Large standard errors = small t-values = no rejected null hypotheses • Note: Only collinear variables are affected. The rest of the model results are OK. • It leads to instability of coefficient estimates • Variable coefficients may fluctuate wildly when a collinear variable is added • These fluctuations may not be “real”, but may just reflect amplification of “noise” and “error” • One variable may only be slightly better at predicting Y… but SPSS will give it a MUCH higher coefficient.
  • 24. U N I V E R S I T Y O F S O U T H F L O R I D A // 24 Multicollinearity Look at correlations of all independent vars  Correlation >.8 is a concern Watch out for the “symptoms” Sometimes problems aren’t always bivariate… and don’t show up in bivariate correlations Compute diagnostic statistics  VIF (Variance Inflation Factor).
  • 25. U N I V E R S I T Y O F S O U T H F L O R I D A // 25 Multicollinearity Tolerance is based on doing a regression: X1 is dependent; X2 and X3 are independent. Tolerance for X1 is simply 1 minus regression R-square. Regression r-square will be high… 1 minus r-square will be low… indicating a problem. If you have 3 independent variables: X1, X2, X3… If a variable (X1) is highly correlated with all the others (X2, X3) then they will do a good job of predicting it in a regression
  • 26. U N I V E R S I T Y O F S O U T H F L O R I D A // 26 Multicollinearity Variance Inflation Factor (VIF) is the reciprocal of tolerance: 1/tolerance High VIF indicates multicollinearity Gives an indication of how much the Standard Error of a variable grows due to the presence of other variables.
  • 27. U N I V E R S I T Y O F S O U T H F L O R I D A // 27 Multicollinearity Drop unnecessary variables If two collinear variables are really measuring the same thing, drop one or make an index Advanced techniques like the Ridge regression. It uses a more efficient estimator (but not BLUE – may introduce bias). Solutions to multicollinearity. It can be difficult if a fully specified model requires several collinear variables
  • 28. U N I V E R S I T Y O F S O U T H F L O R I D A // 28 Dummy Variables How can we incorporate nominal variables (e.g., race, gender) into regression? Option 1: Analyze each sub-group separately. Generates different slopes, constant for each group Option 2: Dummy variables, a dichotomous variable coded to indicate the presence or absence of something. Absence coded as zero, presence coded as 1.
  • 29. U N I V E R S I T Y O F S O U T H F L O R I D A // 29 Dummy Variables: Interpretation INCOME 100000 80000 60000 40000 20000 0 HAPPY 10 9 8 7 6 5 4 3 2 1 0 The overall slope for all data points Note: Line for men, women have the same slope… but one is high other is lower. The constant differs! If women=1, men=0: The constant (a) reflects men only. Dummy coefficient (b) reflects an increase for women (relative to men) Visually: Women = blue, Men = red
  • 30. U N I V E R S I T Y O F S O U T H F L O R I D A // 30 Dummy Variables Dummy coefficients shouldn’t be called slopes  Referring to the “slope” of gender doesn’t make sense  Rather, it is the difference in the constant (or “level”) The contrast is always with the nominal category that was left out of the equation  If DFEMALE is included, the contrast is with males  If DBLACK, DOTHER are included, coefficients reflect difference in constant compared to whites.
  • 31. U N I V E R S I T Y O F S O U T H F L O R I D A // 31 Interaction Terms What if you suspect that a variable has a totally different slope for two different sub-groups in your data? Perhaps men are more materialistic -- an extra dollar increases their happiness a lot If women are less materialistic, each dollar has a smaller effect on income (compared to men) Rather, the slope of a variable (income) differs across groups The issue isn’t men = “more” or “less” than women Example: Income and Happiness
  • 32. U N I V E R S I T Y O F S O U T H F L O R I D A // 32 Interaction Terms Visually: Women = blue, Men = red INCOME 100000 80000 60000 40000 20000 0 HAPPY 10 9 8 7 6 5 4 3 2 1 0 Overall slope for all data points Note: Here, the slope for men and women differs. The effect of income on happiness (X1 on Y) varies with gender (X2). This is called an “interaction effect”
  • 33. U N I V E R S I T Y O F S O U T H F L O R I D A // 33 Interactions Terms  Examples of interaction:  Effect of education on income may interact with type of school attended (public vs. private)  Private schooling has bigger effect on income  Effect of aspirations on educational attainment interacts with poverty  Aspirations matter less if you don’t have money to pay for college  Question: Can you think of examples of two variables that might interact?  From your final project? Or anything else?
  • 34. U N I V E R S I T Y O F S O U T H F L O R I D A // 34 Interaction Terms  Interaction effects: Differences in the relationship (slope) between two variables for each category of a third variable  Option #1: Analyze each group separately  Look for different sized slope in each group  Option #2: Multiply the two variables of interest: (DFEMALE, INCOME) to create a new variable  Called: DFEMALE*INCOME  Add that variable to the multiple regression model.
  • 35. U N I V E R S I T Y O F S O U T H F L O R I D A // 35 Interaction Terms Consider the following regression equation: i i i i e INC DFEM b INCOME b a Y     * 2 1  Question: What if the case is male?  Answer: DFEMALE is 0, so b2(DFEM*INC) drops out of the equation  Result: Males are modeled using the ordinary regression equation: a + b1X + e.
  • 36. U N I V E R S I T Y O F S O U T H F L O R I D A // 36 Interaction Terms Consider the following regression equation: i i i i e INC DFEM b INCOME b a Y     * 2 1  Question: What if the case is female?  Answer: DFEMALE is 1, so b2(DFEM*INC) becomes b2*INCOME, which is added to b1  Result: Females are modeled using a different regression line: a + (b1+b2) X + e  Thus, the coefficient of b2 reflects difference in the slope of INCOME for women.
  • 37. U N I V E R S I T Y O F S O U T H F L O R I D A // 37 Interpreting Interaction Terms • Interpreting interaction terms: • A positive b for DFEMALE*INCOME indicates the slope for income is higher for women vs. men • A negative effect indicates the slope is lower • Size of coefficient indicates actual difference in slope • Example: DFEMALE*INCOME. Observed b’s: • Income: b = .5 • DFEMALE * INCOME: b = -.2 • Interpretation: Slope is .5 for men, .3 for women.
  • 38. U N I V E R S I T Y O F S O U T H F L O R I D A // 38 Interaction Terms  Two continuous variables can also interact  Example: Effect of education and income on happiness  Perhaps highly educated people are less materialistic  As education increases, the slope between between income and happiness would decrease  Simply multiply Education and Income to create the interaction term “EDUCATION*INCOME”  And add it to the model.
  • 39. U N I V E R S I T Y O F S O U T H F L O R I D A // 39 Interpreting Interaction Terms  How do you interpret continuous variable interactions?  Example: EDUCATION*INCOME: Coefficient = 2.0  Answer: For each unit change in education, the slope of income vs. happiness increases by 2  Note: coefficient is symmetrical: For each unit change in income, education slope increases by 2  Dummy interactions effectively estimate 2 slopes: one for each group  Continuous interactions result in many slopes: Each value of education*income yields a different slope.=
  • 40. U N I V E R S I T Y O F S O U T H F L O R I D A // 40 Dummy Interactions  It is also possible to construct interaction terms based on two dummy variables  Instead of a “slope” interaction, dummy interactions show difference in constants  Constant (not slope) differs across values of a third variable  Example: Effect of race on school success varies by gender  African Americans do less well in school; but the difference is much larger for black males.
  • 41. U N I V E R S I T Y O F S O U T H F L O R I D A // 41 Interaction Terms  If you make an interaction, you should also include the component variables in the model:  A model with “DFEMALE * INCOME” should also include DFEMALE and INCOME  There are rare exceptions. But when in doubt, include them  Sometimes interaction terms are highly correlated with its components  That can cause problems (multicollinearity – which we’ll discuss next week).  Make sure you have enough cases in each group for your interaction terms  Interaction terms involve estimating slopes based on sub-groups in your data (e.g., black females).  If you there are hardly any black females in the dataset, you can have problems.
  • 42. U N I V E R S I T Y O F S O U T H F L O R I D A // 42 Partial Correlation  A partial correlation measures the relationship between two variables (X and Y) while eliminating the influence of a third variable (Z).  Partial correlations are used to reveal the real, underlying relationship between two variables when researchers suspect that the apparent relation may be distorted by a third variable. 42
  • 43. U N I V E R S I T Y O F S O U T H F L O R I D A // 43 Partial Correlation  For example, there probably is no underlying relationship between weight and mathematics skill for elementary school children.  However, both variables are positively related to age: Older children weigh more and, because they have spent more years in school, have higher mathematics skills. 43
  • 44. U N I V E R S I T Y O F S O U T H F L O R I D A // 44 Partial Correlation  As a result, weight and mathematics skill will show a positive correlation for a sample of children that includes several different ages.  A partial correlation between weight and mathematics skill, holding age constant, would eliminate the influence of age and show the true correlation which is near zero. 44
  • 45. U N I V E R S I T Y O F S O U T H F L O R I D A // 45 Properties of Partial Correlation  It Falls between -1 and +1.  The larger the absolute value, the stronger the association, controlling for the other variables  Does not depend on units of measurement  Has the same sign as the corresponding partial slope in the prediction equation  Can regard as approximating the ordinary correlation between y and x1 at a fixed value of x2.  Equals ordinary correlation found for data points in the corresponding partial regression plot  Squared partial correlation has a proportional reduction in error (PRE) interpretation for predicting y using that predictor, controlling for other explain Var’s in the model. 1 2 . yx x r
  • 46. U N I V E R S I T Y O F S O U T H F L O R I D A // 46  Stepwise Regression  Forward Selection  Backward Elimination Iterative; one independent variable at a time is added or deleted based on the F statistic Different subsets of the independent variables are evaluated  Best-Subsets Regression The first 3 procedures are heuristics. There is no guarantee that the best model will be found. Model Selection: Variable Selection Procedures
  • 47. U N I V E R S I T Y O F S O U T H F L O R I D A // 47 Variable Selection: Stepwise Regression If no variable can be removed and no variable can be added, the procedure stops.  At each iteration, the first consideration is to see whether the least significant variable currently in the model can be removed because its F value is less than the user-specified or default Alpha to remove.  If no variable can be removed, the procedure checks to see whether the most significant variable not in the model can be added because its F value is greater than the user-specified or default Alpha to enter.
  • 48. U N I V E R S I T Y O F S O U T H F L O R I D A // 48 Variable Selection: Forward Selection  This procedure is like stepwise regression but does not permit a variable to be deleted.  This forward-selection procedure starts with no independent variables.  It adds variables one at a time as long as a significant reduction in the error sum of squares (SSE) can be achieved.
  • 49. U N I V E R S I T Y O F S O U T H F L O R I D A // 49 Variable Selection: Backward Elimination  This procedure begins with a model that includes all the independent variables the modeler wants to be considered.  It then attempts to delete one variable at a time by determining whether the least significant variable currently in the model can be removed because its p-value is less than the user-specified or default value.  Once a variable has been removed from the model it cannot reenter at a subsequent step.
  • 50. U N I V E R S I T Y O F S O U T H F L O R I D A // 50 Variable Selection: Best-Subsets Regression  Some software packages include best-subsets regression that enables the user to find, given a specified number of independent variables, the best regression model.  The three preceding procedures are one-variable-at-a-time methods offering no guarantee that the best model for a given number of variables will be found.
  • 51. U N I V E R S I T Y O F S O U T H F L O R I D A // 51  With positive autocorrelation, we expect a positive residual in one period to be followed by a positive residual in the next period.  With positive autocorrelation, we expect a negative residual in one period to be followed by a negative residual in the next period.  With negative autocorrelation, we expect a positive residual in one period to be followed by a negative residual in the next period, then a positive residual, and so on. Regression with Time Series Data: Autocorrelation and the Durbin-Watson Test
  • 52. U N I V E R S I T Y O F S O U T H F L O R I D A // 52  When autocorrelation is present, one of the regression assumptions is violated: the error terms are not independent.  When autocorrelation is present, serious errors can be made in performing tests of significance based upon the assumed regression model.  The Durbin-Watson statistic can be used to detect first-order autocorrelation. Autocorrelation and the Durbin-Watson Test
  • 53. U N I V E R S I T Y O F S O U T H F L O R I D A // 53  If successive values are far apart (negative autocorrelation is present), the statistic will be large.  The statistic ranges in value from zero to four.  A value of two indicates no autocorrelation.  If successive values of the residuals are close together (positive autocorrelation is present) the statistic will be small. Autocorrelation and the Durbin-Watson Test
  • 54. U N I V E R S I T Y O F S O U T H F L O R I D A // 54 Key Takeaway  Checking for assumptions of the regression model is key to interpreting the results.  Even if the regression model assumption is met, the presence of multicollinearity can lead to a bad inference.  Dummy variables and interaction terms are powerful tools for building insightful models.  Model selection is key for inferential purposes.
  • 55. U N I V E R S I T Y O F S O U T H F L O R I D A // You have reached the end of the presentation.