Quantitative Research Methods Lecture Regression Diagnostics

Quantitative Research Methods
Lecture 9 Model Building
1. Regression Diagnostics I
2. Regression Diagnostics II Multicollinearity
3. Regression Diagnostics II Time series
4. Polynomial Models
5. Nominal variable in Multiple Regression
6. Stepwise Multiple Regression

Statistical analyses
• Group differences (nominal variable) on one interval variable:
▫ T-tests (2 groups)
▫ ANOVA (3 or more groups)
 One factor: one way ANOVA
 Two factor: two way/factor ANOVA
• The relationship between two nominal variable:
▫ Chi-square test
• The relationship between two interval variable:
▫ Correlation, simple linear regression
• The relationship between multiple interval variable on one
interval variable
▫ Multiple regression
• The relationship between multiple interval variable on one
nominal variable (yes/no)
▫ Logistic regression

Regression
• Single Linear Regression (interval)
▫ one independent, one dependent
• Multiple Regression (all interval)
▫ Multiple independent, one dependent
• Logistic Regression
▫ Multiple interval independent, one nominal
dependent (Yes/No)
▫ Check example: https://youtu.be/H_48AcV0qlY
▫

16.4
Simple Linear Regression Model…
A straight line model with one independent
variable is called a simple linear regression
model. Its is written as:
error variable
dependent
variable
independent
variable
y-intercept slope of the line

16.5
Simple Linear Regression Model…
Note that both and are population
parameters which are usually unknown and
hence estimated from the data.
y
x
run
rise
=slope (=rise/run)
=y-intercept

16.6
Estimating the Coefficients…
In much the same way we base estimates of µ on x , we
estimate β0 using b0 and β1 using b1, the y-intercept
and slope (respectively) of the least squares or
regression line given by:
(Recall: this is an application of the least squares
method and it produces a straight line that
minimizes the sum of the squared differences
between the points and the line)

16.7
Least Squares Line…
these differences are
called residuals
Example 16.1

16.8
Example 16.2…
Car dealers across North America use the "Red Book" to
help them determine the value of used cars that their
customers trade in when purchasing new cars.
The book, which is published monthly, lists the trade-in
values for all basic models of cars.
It provides alternative values for each car model according
to its condition and optional features.
The values are determined on the basis of the average paid
at recent used-car auctions, the source of supply for many
used-car dealers.

16.9
Example 16.2…
However, the Red Book does not indicate the value
determined by the odometer reading, despite the fact that a
critical factor for used-car buyers is how far the car has
been driven.
To examine this issue, a used-car dealer randomly selected
100 three-year old Toyota Camrys that were sold at auction
during the past month.
The dealer recorded the price ($1,000) and the number of
miles (thousands) on the odometer. (Xm16-02).
The dealer wants to find the regression line.

16.10
Using SPSS
Analyze > Regression > Linear
Simple Linear Regression
SPSS Steps: Analyze > Regression > Linear

16.11
SPSS Output: check three tables
R2 strength of the linear relationship
Model
significance
/fit
b1 b0

16.12
Example 16.2…
As you might expect with used cars…
The slope coefficient, b1, is –0.0669, that is, each
additional mile on the odometer decreases the price by
$.0669 or 6.69¢
The intercept, b0, is 17,250. One interpretation would
be that when x = 0 (no miles on the car) the selling
price is $17,250. However, we have no data for cars
with less than 19,100 miles on them so this isn’t a
correct assessment.

16.13
Testing the Slope…
If no linear relationship exists between the two
variables, we would expect the regression line to be
horizontal, that is, to have a slope of zero.
We want to see if there is a linear relationship, i.e.
we want to see if the slope (β1) is something other
than zero. Our research hypothesis becomes:
H1: β1 ≠ 0
Thus the null hypothesis becomes:
H0: β1 = 0

16.14
Coefficient of Determination…
Tests thus far have shown if a linear relationship
exists; it is also useful to measure the strength
of the relationship. This is done by calculating
the coefficient of determination – R2.
The coefficient of determination is the square of
the coefficient of correlation (r), hence R2 = (r)2

16.15
As we did with analysis of variance, we can partition
the variation in y into two parts:
Variation in y = SSE + SSR
SSE – Sum of Squares Error – measures the amount of
variation in y that remains unexplained (i.e. due to
error)
SSR – Sum of Squares Regression – measures the
amount of variation in y explained by variation in the
independent variable x.

16.16
Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation
in the auction selling prices (y) is explained by the variation
in the odometer readings (x). The remaining 35.17% is
unexplained, i.e. due to error.
Unlike the value of a test statistic, the coefficient of
determination does not have a critical value that
enables us to draw conclusions.
In general the higher the value of R2, the better the model
fits the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.

16.17
Using the Regression Equation…
We could use our regression equation:
y = 17.250 – .0669x
to predict the selling price of a car with 40 (,000) miles
on it:
y = 17.250 – .0669x = 17.250 – .0669(40) = 14,574
We call this value ($14,574) a point prediction.
Chances are though the actual selling price will be
different, hence we can estimate the selling price in terms
of an interval.

16.18
Prediction Interval
The prediction interval is used when we want to
predict one particular value of the dependent
variable, given a specific value of the independent
variable:
(xg is the given value of x we’re interested in)

16.19
Confidence Interval Estimator…
…of the expected value of y. In this case, we are
estimating the mean of y given a value of x:
(Technically this formula is used for infinitely large
populations. However, we can interpret our
problem as attempting to determine the average
selling price of all Toyota Camrys, all with 40,000
miles on the odometer)

16.20
What’s the Difference?
1 no 1
The confidence interval estimate of the expected value of y will be narrower than
the prediction interval for the same given value of x and confidence level. This is
because there is less error in estimating a mean value as opposed to predicting an
individual value.
Prediction Interval Confidence Interval
Used to estimate the value of
one value of y (at given x)
Used to estimate the mean
value of y (at given x)

16.23
Regression Diagnostics…
There are three conditions that are required in order to
perform a regression analysis. These are:
• The error variable must be normally distributed,
• The error variable must have a constant variance,
• The errors must be independent of each other.
How can we diagnose violations of these conditions?
 Residual Analysis, that is, examine the
differences between the actual data points and those
predicted by the linear equation…

16.24
Nonnormality…
We can take the residuals and put them into a histogram
to visually check for normality…
…we’re looking for a bell shaped histogram with the
mean close to zero. 

SPSS: Regression>Linear>Save>check Residuals >
unstandardized & standardized

SPSS: Test of normality
Analyze>descriptive statistics >explore>plots

16.28
Heteroscedasticity…
When the requirement of a constant variance is violated,
we have a condition of heteroscedasticity.
We can diagnose heteroscedasticity by plotting the
residual against the predicted y.

16.29
Heteroscedasticity…
If the variance of the error variable ( ) is not constant,
then we have “heteroscedasticity”. Here’s the plot of
the residual against the predicted value of y:
there doesn’t appear to be a
change in the spread of the
plotted points, therefore no
heteroscedasticity


SPSS: Regression>linear>Save>check Predicted Values &
Residuals

SPSS: Graphs>scatter> y-Residual; x-
Predicted Price

16.32
Nonindependence of the Error Variable
If we were to observe the auction price of cars every week
for, say, a year, that would constitute a time series.
When the data are time series, the errors often are
correlated.
Error terms that are correlated over time are said to be
autocorrelated or serially correlated.
We can often detect autocorrelation by graphing the
residuals against the time periods. If a pattern
emerges, it is likely that the independence requirement is
violated.

16.33
Nonindependence of the Error Variable
Patterns in the appearance of the residuals over time
indicates that autocorrelation exists:
Note the runs of positive residuals,
replaced by runs of negative residuals
Note the oscillating behavior of the
residuals around zero.
Durbin-Watson test, one way to test autocorrelation

16.34
Outliers…
An outlier is an observation that is unusually
small or unusually large.
E.g. our used car example had odometer readings
from 19.1 to 49.2 thousand miles. Suppose we
have a value of only 5,000 miles (i.e. a car driven
by an old person only on Sundays  ) — this point
is an outlier.

16.35
Outliers…
Possible reasons for the existence of outliers include:
▫ There was an error in recording the value
▫ The point should not have been included in the sample
▫ Perhaps the observation is indeed valid.
Outliers can be easily identified from a scatter plot.
If the absolute value of the standard residual is > 2, we
suspect the point may be an outlier and investigate further.
They need to be dealt with since they can easily
influence the least squares line…

Example 16.2
SPSS: Graph>scatter>x: odomter>y:price

Procedure for Regression Diagnostics
1. Develop a model that has a theoretical basis; that
is, for the dependent variable in question, find an
independent variable that you believe is linearly
related to it.
2. Gather data for the two variables.
3. Draw the scatter diagram to determine whether a
linear model appears to be appropriate. Identify
possible outliers.
4. Determine the regression equation.
5. Calculate the residuals and check slides for the
required conditions.
6. Assess the model fit: Check slides SPSS output
7. If the model fits the data, use the regression
equation to predict a particular value of the
dependent variable or estimate its mean (or both)

From simple linear regression to
multiple regression
• Simple linear regression
Education
Income

17.40
Multiple Regression…
The simple linear regression model was used to
analyze how one interval variable (the dependent
variable y) is related to one other interval variable (the
independent variable x).
Multiple regression allows for any number of
independent variables.
We expect to develop models that fit the data better
than would a simple linear regression model.

Multiple regression
Variable A
Variable D
Variable B
Variable C

Multiple regression
Age
Income
Education
Number of
Family member
earn money
Number of
Children
Year
With current
employer
Occupation
Prestige score
Work hours

Example: GSS2008
• How is income affected by
▫ Age (AGE)
▫ Education (EDUC)
▫ Work hours (HRS)
▫ Spouse work hours (SPHRS)
▫ Occupation prestige score (PRESTG80)
▫ Number of children (CHILDS)
▫ Number of family members earn money (EARNS)
▫ Years with current employer (CUREMPYR)

17.44
The Model…
We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:
In the one variable, two dimensional case we drew a regression
line; here we imagine a response surface.
error variable
dependent
variable
independent variables
coefficients

17.45
Estimating the Coefficients…
The sample regression equation is expressed as:
We will use computer output to:
Assess the model…
How well it fits the data?
Is it useful?
Are any required conditions violated?
Employ the model…
Interpreting the coefficients
Predictions using the regression model.

17.46
Regression Analysis Steps…
u Use a computer and software to generate the
coefficients and the statistics used to assess the model.
v Diagnose violations of required conditions. If there
are problems, attempt to remedy them.
w Assess the model’s fit.
coefficient of determination,
F-test of the analysis of variance.
x If u, v, and w are OK, use the model for prediction.

17.47
Transformation…
Can we transform this data into a mathematical
model that looks like this:
income
education Year with current employ…age

17.48
Using SPSS
• Analyze > Regression > Linear

Using SPSS
• Dependent/Independent

The mathematical model
ŷ= -51785.243 +460.87 x1+4100.9 x2+…+329.771 x8

17.52
The Model…
Although we haven’t done any assessment of the model yet,
at first pass:
ŷ= -51785.243 +460.87 x1+4100.9 x2+ 620 x3-862.201 x4…+329.771 x8
it suggests that increases in AGE, EDUC, HRS,
PRESTG80, EARNRS, CUREMPYR, will positively
impact the income.
Likewise, increases in the SPHRS, CHILDS will
negatively impact the operating margin…
INTERPRET

17.53
Model Assessment…
We will assess the model in two ways:
Coefficient of determination, and
F-test of the analysis of variance.

17.54
• Again, the coefficient of determination is defined
as:
This means that 33.7% of the variation in income is
explained by the six independent variables, but
66.3% remains unexplained.

17.55
Adjusted R2 value…
The adjusted” R2 is:
the coefficient of determination adjusted
for the number of explanatory variables.
It takes into account the sample size n, and k, the
number of independent variables, and is given by:

17.56
Testing the Validity of the Model…
In a multiple regression model (i.e. more than one
independent variable), we utilize an analysis of
variance technique to test the overall validity of the
model. Here’s the idea:
H0:
H1: At least one is not equal to zero.
If the null hypothesis is true, none of the independent
variables is linearly related to y, and so the model is
invalid.
If at least one is not equal to 0, the model does have
some validity.

17.57
ANOVA table for regression analysis…
Source of
Variation
degrees of
freedom
Sums of
Squares
Mean Squares F-Statistic
Regression k SSR MSR = SSR/k F=MSR/MSE
Error n–k–1 SSE MSE = SSE/(n–k-1)
Total n–1
A large value of F indicates that most of the variation in y is explained by
the regression equation and that the model is valid. A small value of F
indicates that most of the variation in y is unexplained.

P<.o5, at least one is not 0,
Reject H0, accept H1
the the model is valid

17.59
Interpreting the Coefficients*
Intercept (b0) -51785.243 • This is the average income when
all of the independent variables are zero. It’s meaningless to try
and interpret this value, particularly if 0 is outside the range of
the values of the independent variables (as is the case here).
Age (b1) 460.87 • Each 1 year increase in age will increase
$460.87 in the income.
Education (b2) 4100.9• For each additional year of
education, the annual income will increase $4100.9.
Hours of work (b3) 620 • each additional hour of work per
week, the annual income will increase $620.
*in each case we assume all other variables are held constant…

17.60
Interpreting the Coefficients*
Spouse hours of work (b4) -862.201• For each additional
hour the spouse work per week, the average annual income will
decrease $862.201 .
Occupation Prestige Score (b5) 641• For each additional
unit of score, the average annual income increases by $641
Number of Children (b6) -331 • For each additional child,
the average income decrease by -331
Number of family members earn money (b7) 687 • For
each additional family member earn money, the income
increase by $687
Number of years with current job (b8) 330• For each
additional year with current job, the income increase by
$330.
*in each case we assume all other variables are held constant…

17.61
Testing the Coefficients…
For each independent variable, we can test to
determine whether there is enough evidence of a linear
relationship between it and the dependent variable for
the entire population…
H0: = 0
H1: ≠ 0
(for i = 1, 2, …, k) and using:
as our test statistic (with n–k–1 degrees of freedom).

17.62
Testing the Coefficients
We can use SPSS output to quickly test each of the
8 coefficients in our model…
Thus, EDUC, HRS, SPHRS, PRESTG80, are linearly related to the
operating margin. There is no evidence to infer that AGE, CHILDS,
EARNS, CUREMPYR are linearly related to operating margin.

17.63
Using the Regression Equation
Much like we did with simple linear regression, we
can produce a prediction interval for a
particular value of y.
As well, we can produce the confidence interval
estimate of the expected value of y.

17.64
Using the Regression Equation
Exercise GSS2008:
We add one row (our given values for the independent
variables) to the bottom of our data set, please produce
▫ prediction interval
▫ confidence interval estimate
For the dependent variable y.

17.65
Regression Diagnostics I
Exercise GSS2008
• Calculate the residuals and check the following:
▫ Is the error variable nonnormal?
▫ Perform a normality test
• Is the error variance constant?
▫ Plot the residuals versus the predicted values of y.
• Are the errors independent (time-series data)?
▫ Plot the residuals versus the time periods.
• Are there observations that are inaccurate or do
not belong to the target population?
▫ Double-check the accuracy of outliers and influential
observations.

17.66
Regression Diagnostics II
• Multiple regression models have a problem that
simple regressions do not, namely
multicollinearity.
• It happens when the independent variables
are highly correlated.
• We’ll explore this concept through the following
example…

17.67
Example GSS2008
• AGE and CUREMPYR are not significant
predictor for INCOME in multiple regression
model, but when we run correlation between
AGE and INCOME, CUREMPYR and INCOME.
They are both significantly correlated.
• How to account for this apparent contradiction?
• The answer is that the AGE and CUREMPYR are
correlated, all three independent variables
are correlated with each other !
• The is the problem of multicollinearity.

How to deal with multicollinearity
problem
• Multicollinearity exits in virtually all multiple
regression models.
• To minimize the effect:
▫ Try to include independent variables that are
independent of each other.
▫ Develop a model that has a theoretical basis and
include IVs that are necessary.

17.71
Regression Diagnostics III – Time Series
• The Durbin-Watson test allows us to determine
whether there is evidence of first-order
autocorrelation — a condition in which a
relationship exists between consecutive
residuals, i.e. ei-1 and ei (i is the time period). The
statistic for this test is defined as:
• d has a range of values: 0 ≤ d ≤ 4.

17.72
Durbin–Watson (two-tail test)
• To test for first-order autocorrelation:
• If d < dL or d > 4 – dL , first-order
autocorrelation exists.
• If d falls between dL and dU or between 4 – dU
and 4 – dU , the test is inconclusive.
• If d falls between dU and 4 – dU there is no
evidence of first order autocorrelation.
4-dU 4-dL
exists existsinconclusive
dUdL 2 40
inconclusive doesn’t exist

17.73
Example 17.1 Xm17-01
Can we create a model that will predict lift ticket
sales at a ski hill based on two weather
parameters?
Variables:
y - lift ticket sales during Christmas week,
x1 - total snowfall (inches), and
x2 - average temperature (degrees Fahrenheit)
Our ski hill manager collected 20 years of data.

17.74
Example 17.1
Both the coefficient of determination
and the p-value of the F-test indicate
the model is poor…
Neither variable is linearly related
to ticket sale…

17.75
Example 17.1
• The histogram of residuals…
• reveals the errors may be normally distributed…

17.76
Example 17.1
• In the plot of residuals versus predicted values
(testing for heteroscedasticity) — the error
variance appears to be constant…

17.77
Example 17.1 Durbin-Watson
• Apply the Durbin-Watson Statistic from to the entire list of
residuals.
• Regression>Linear>Statistics>check Durbin-Watson

17.78
Example 17.1
To test for first-order autocorrelation with α = .05, we
find in Table 8(a) in Appendix B
dL = 1.10 and dU = 1.54
The null and alternative hypotheses are
H0 : There is no first-order autocorrelation.
H1 : There is first-order autocorrelation.
The rejection region includes d < dL = 1.10. Since d =
.593, we reject the null hypothesis and conclude that
there is enough evidence to infer that first-order
autocorrelation exists.

17.79
Example 17.1
Autocorrelation usually indicates that the model needs to
include an independent variable that has a time-ordered
effect on the dependent variable.
The simplest such independent variable represents the
time periods. We included a third independent variable
that records the number of years since the year the data
were gathered. Thus, x3 = 1, 2,..., 20. The new model is
y = β0 + β1x1 + β2x2 + β3x3 + ε

17.80
Example 17.1
The fit of the model is high,
The model is valid…
Snowfall and time are linearly related to
ticket sales; temperature is not…
our new
variable
dL = 1.10 and dU = 1.54
dU <d<4- dU, first-order
autocorrelation doesn't exit

17.81
Example 17.1
• The Durbin-Watson statistic against the residuals
from our Regression analysis is equal to 1.885.
• we can conclude that there is not enough evidence
to infer the presence of first-order
autocorrelation. (Determining dL is left as an
exercise for the reader…)
• Hence, we have improved out model dramatically!

17.82
Example 17.1
Notice that the model is improved dramatically.
The F-test tells us that the model is valid. The t-tests tell us that
both the amount of snowfall and time are significantly linearly
related to the number of lift tickets.
This information could prove useful in advertising for the resort.
For example, if there has been a recent snowfall, the resort
could emphasize that in its advertising.
If no new snow has fallen, it may emphasize their snow-making
facilities.

18.83
Model Selection
Regression analysis can also be used for:
• non-linear (polynomial) models, and
• models that include nominal independent
variables.

18.84
Polynomial Models
Previously we looked at this multiple regression
model:
(its considered linear or first-order since the
exponent on each of the xi’s is 1)
The independent variables may be functions of a
smaller number of predictor variables; polynomial
models fall into this category. If there is one
predictor value (x) we have:

18.85
Polynomial Models
u
v
Technically, equation vis a multiple regression model
with p independent variables (x1, x2, …, xp). Since x1 =
x, x2 = x2, x3 = x3, …, xp = xp, its based on one predictor
value (x).
p is the order of the equation; we’ll focus equations of
order p = 1, 2, and 3.

18.86
First Order Model
When p = 1, we have our simple linear regression model:
That is, we believe there is a straight-line relationship
between the dependent and independent variables over the
range of the values of x:

18.87
Second Order Model
When p = 2, the polynomial model is a parabola:

18.88
Third Order Model
When p = 3, our third order model looks like:

18.89
Polynomial Models: 2 Predictor
Variables
Perhaps we suspect that there are two predictor
variables (x1 & x2) which influence the dependent
variable:
First order model (no interaction):
First order model (with interaction):

18.90
Polynomial Models: 2 Predictor Variables
First order models, 2 predictors, without & with interaction:

18.91
Polynomial Models: 2 Predictor Variables
If we believe that a quadratic relationship exists between y
and each of x1 and x2, and that the predictor variables
interact in their effect on y, we can use this model:
Second order model (in two variables) WITH interaction:

18.92
Polynomial Models: 2 Predictor
Variables
2nd order models, 2 predictors, without & with interaction:

18.93
Selecting a Model
One predictor variable, or two (or more)?
First order? Second order? Higher order?
With interaction? Without?
How do we choose the right model??
Use our knowledge of the variables involved to
build an initial model.
Test that model using statistical techniques.
If required, modify our model and re-test…

18.94
Example 18.1
We’ve been asked to come up with a regression model
for a fast food restaurant. We know our primary
market is middle-income adults and their children,
particularly those between the ages of 5 and 12.
Dependent variable —restaurant revenue (gross or net)
Predictor variables — family income, age of children
Is the relationship first order? quadratic?…

18.95
Example 18.1
The relationship between the dependent variable (revenue)
and each predictor variable is probably quadratic.
Members of low or high income households are less likely to eat at this chain’s
restaurants, since the restaurants attract mostly middle-income customers.
Neighborhoods where the mean age of children is either quite low or quite high
are also less likely to eat there vs. the families with children in the 5-to-12 year
range.
Seems reasonable?

18.96
Example 18.1
Should we include the interaction term in our model?
When in doubt, it is probably best to include
it.
Our model then, is:
Where y = annual gross sales
x1 = median annual household income*
x2 = mean age of children*
*in the neighborhood

18.97
Example 18.2 Xm18-02
Our fast food restaurant research department
selected 25 locations at random and gathered data
on revenues, household income, and ages of
neighborhood children.
Collected Data Calculated Data

18.98
Example 18.2
You can take the original data collected (revenues,
household income, and age) and plot y vs. x1 and y
vs. x2 to get a feel for the data; trend lines were
added for clarity…

18.99
Example 18.2
Checking the regression tool’s output…
The model fits the data well
and its valid…
INTERPRET

18.100
Nominal Independent Variables
Thus far in our regression analysis, we’ve only
considered variables that are interval. Often
however, we need to consider nominal data in
our analysis.
For example, our earlier example regarding the
market for used cars focused only on mileage.
Perhaps color is an important factor. How can we
model this new variable?

18.101
Indicator Variables
An indicator variable (also called a dummy
variable) is a variable that can assume either one
of only two values (usually 0 and 1).
A value of 1 usually indicates the existence of a certain
condition, while a value of 0 usually indicates that the
condition does not hold.
I1 =
I2 =
0 if color not white
1 if color is white
0 if color not silver
1 if color is silver
Car Color I1 I2
white 1 0
silver 0 1
other 0 0
two tone! 1 1
to represent m categories…
we need m–1 indicator variables

18.102
Interpreting Indicator Variable Coefficients
After performing our regression analysis:
we have this regression equation…
Thus, the price diminishes with additional mileage (x)
a white car sells for $91.10 more than other colors (I1)
a silver car fetches $330.40 more than other colors (I2)

18.104
To test the coefficient of I1, we use these
hypotheses…
H0: = 0
H1: ≠ 0
There is insufficient evidence to infer that in the
population of 3-year-old white Tauruses with the same
odometer reading have a different selling price than
do Tauruses in the “other” color category…

18.105
To test the coefficient of I2, we use these
hypotheses…
H0: = 0
H1: ≠ 0
We can conclude that there are differences in
auction selling prices between all 3-year-old
silver-colored Tauruses and the “other” color
category with the same odometer readings

Stepwise Regression
• Stepwise Regression is an iterative procedure
that adds and deletes one independent variable
at a time. The decision to add or delete a variable
is made on the basis of whether that variable
improves the model.
• It is a procedure that can eliminate correlated
independent variables.

Step 1: do simultaneous regression and
rank all the significant variables
No.1
No.4
No.2
No.3

Step 2
• Analyze
• Regression
• Linear
• Stepwise
• Dependent variable
• Independent variables (1st round: the top
predictor; 2nd round: the top predictor & the 2nd
top predictor…until the nth round; n = number
of predictors
• Statistics
• R square change & Descriptives

• Stepwise output
• What to read?
• R2 , R2 change, F of R2 change, significance level
of F of R2 change in each round

• The regression equation
• Simulaneous: ŷ= −51785.243 +460.87 AGE+4100.9 EDUC+ 620 HRS−862.201
SPHRS… … +329.771 CUREMPRY
• Stepwise: ŷ= -44703.12 +3944.7 EDUS-617.37SPHRS+526.493PRESTG80+956.933HRS

Multiple regression
• Multiple regression examines the predictability
of a set of predictors on a dependent variable
(criterion)
• Why don’t we just throw in all the predictors and
let the MR determine which ones are good
predictors then?
• Reason 1: Theoretical consideration
• Reason 2: Concern of sample size

Concern of sample size
• The desired level is 20 observations for each
independent variable
• For instance, if you have 6 predictors, you’ve got
to have at least 120 subjects in your data
• However, if a stepwise procedure is employed,
the recommended level increases to 50 to 1
• That is, you’ve got to have at least 300 subjects
in order to run stepwise MR

18.114
Model Building
Here is a procedure for building a regression model:
uIdentify the dependent variable; what is it we
wish to predict? Don’t forget the variable’s unit of
measure.
vList potential predictors; how would changes in
predictors change the dependent variable? Be
selective; go with the fewest independent variables
required. Be aware of the effects of multicollinearity.
w Gather the data; at least six? observations for
each independent variable used in the equation.

18.115
Model Building
x Identify several possible models; formulate
first- and second- order models with and without
interaction. Draw scatter diagrams.
y Use statistical software to estimate the
models.
z Determine whether the required conditions
are satisfied; if not, attempt to correct the problem.
{ Use your judgment and the statistical output
to select the best model!

Quantitative Research Methods Lecture Regression Diagnostics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Quantitative Research Methods Lecture Regression Diagnostics

Similar to Quantitative Research Methods Lecture Regression Diagnostics (20)

More from Penny Jiang

More from Penny Jiang (13)

Recently uploaded

Recently uploaded (20)

Quantitative Research Methods Lecture Regression Diagnostics