Multiple regression analyzes the relationship between one dependent variable and multiple independent variables. It assumes:
1) Normal distribution of dependent variable for each independent variable value
2) Equal variance of dependent variable for each independent value (homoscedasticity)
3) Linear relationship between dependent and independent variables
4) Independence of errors and no autocorrelation between errors over time
Violations can be addressed by modifying variables or model specification. The Durbin-Watson test evaluates autocorrelation.
2. Multiple regression generally explains the relationship between
multiple independent or predictor variables and one dependent or
criterion variable.
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + ⋯ . +𝛽𝑘𝑥𝑘 + 𝑢
𝛽0 is the intercept
𝛽1 is the parameter associated with x1.
𝛽2is the parameter associated with x2, and so on
3. Assumptions for multiple regression
1) normality
assumption
for any specific value of the independent
variable, the values of the y variable are
normally distributed
2) equal-variance
assumption
(assumption of
homoscedasticity)
The variances (or standard deviations) for the
y variables are the same for each value of the
independent variable.
If the errors do not have a constant
variance, they are said to be
heteroscedastic.
There are a number of formal statistical tests
for heteroscedasticity: the Goldfeld-Quandt
test, White’s general test
4. Assumptions for multiple regression
3) the linearity
assumption
there is a linear relationship between the
dependent variable and the independent
variables
The covariance between the error terms over time is zero. In other
words, it is assumed that the errors are uncorrelated with one
another.
The formal method to detect autocorrelation is Durbin and Watson
test, Breusch–Godfrey test
5. Durbin-Watson (DW) is a test for first order
autocorrelation - it tests only for a relationship between
an error and its immediately previous value
the DW test statistic is approximately equal to 2(1−𝜌).
Since 𝜌 is a correlation, it implies that −1≤ 𝜌 ≤1
𝜌 is bounded to lie between −1 and +1.
6. The corresponding limits for DW as 0≤DW ≤4.
Consider now the implication of DW taking one of three
important values (0, 2, and 4):
𝜌=0, DW =2 This is the case where there is no autocorrelation in the
residuals. So roughly speaking, the null hypothesis would not be
rejected if DW is near 2→i.e. there is little evidence of autocorrelation.
𝜌=1, DW =0 This corresponds to the case where there is perfect
positive autocorrelation in the residuals.
𝜌 =− 1, DW =4 This corresponds to the case where there is perfect
negative autocorrelation in the residual
The rejection, non-
rejection, and
inconclusive
regions for DW test
7. Autocorrelation can be violated in two ways
model misspecification
If an important
independent variable is
omitted or if an incorrect
functional form is used,
the residuals may not be
independent. The
solution to this dilemma
is to find the proper
functional form or to
include the proper
independent variables
time-sequenced data
Regression analysis is performed on data
taken over time, the residuals are often
correlated (serial correlation or
autocorrelation).
Positive autocorrelation means that the
residual in time period j tends to have the
same sign as the residual in time period (j-
k), where k is the lag in time periods.
Negative autocorrelation means that the
residual in time period j tends to have the
opposite sign as the residual in time period
(j-k).
8. Assumptions for multiple regression
4) the nonmulti-
collinearity
assumption
the independent variables are not correlated
5) the
independence
assumption
The values for the y variables are independent
The strength of the relationship between the
independent variables and the dependent
variable is measured by a correlation
coefficient. This multiple correlation
coefficient is symbolized by R. The value of
R can range from 0 to +1. The closer to +1,
the stronger the relationship; the closer to
0, the weaker the relationship.
9. Multicollinearity occurs when there are high correlations
between two or more predictor variables
Data-based multicollinearity Structural multicollinearity
caused by poorly designed
experiments, data that is
100% observational, or data
collection methods that
cannot be manipulated
caused by the researcher,
creating new predictor variables
10. Causes for multicollinearity
Insufficient data. In some cases, collecting more data can resolve
the issue.
Dummy variables may be incorrectly used. For example, the
researcher may fail to exclude one category, or add a dummy
variable for every category (e.g. spring, summer, autumn, winter).
Including a variable in the regression that is actually a
combination of two other variables. For example, including “total
investment income” when total investment income = income from
stocks and bonds + income from savings interest.
Including two identical (or almost identical) variables. For
example, weight in pounds and weight in kilos, or investment
income and savings/bond income.
11. Solutions to the problem of multicollinearity
Ignore it, if the model is otherwise adequate, i.e. statistically
and in terms of each coefficient being of a plausible
magnitude and having an appropriate sign;
Drop one of the collinear variables, so that the problem
disappears;
Transform the highly correlated variables into a ratio and
include only the ratio and not the individual variables in the
regression;
A problem with the data than with the model, so that there is
insufficient information in the sample to obtain estimates for
all of the coefficients