SlideShare a Scribd company logo
Quantitative Research Methods
Lecture 9 Model Building
1. Regression Diagnostics I
2. Regression Diagnostics II Multicollinearity
3. Regression Diagnostics II Time series
4. Polynomial Models
5. Nominal variable in Multiple Regression
6. Stepwise Multiple Regression
Statistical analyses
• Group differences (nominal variable) on one interval variable:
▫ T-tests (2 groups)
▫ ANOVA (3 or more groups)
 One factor: one way ANOVA
 Two factor: two way/factor ANOVA
• The relationship between two nominal variable:
▫ Chi-square test
• The relationship between two interval variable:
▫ Correlation, simple linear regression
• The relationship between multiple interval variable on one
interval variable
▫ Multiple regression
• The relationship between multiple interval variable on one
nominal variable (yes/no)
▫ Logistic regression
Regression
• Single Linear Regression (interval)
▫ one independent, one dependent
• Multiple Regression (all interval)
▫ Multiple independent, one dependent
• Logistic Regression
▫ Multiple interval independent, one nominal
dependent (Yes/No)
▫ Check example: https://youtu.be/H_48AcV0qlY
▫
16.4
Simple Linear Regression Model…
A straight line model with one independent
variable is called a simple linear regression
model. Its is written as:
error variable
dependent
variable
independent
variable
y-intercept slope of the line
16.5
Simple Linear Regression Model…
Note that both and are population
parameters which are usually unknown and
hence estimated from the data.
y
x
run
rise
=slope (=rise/run)
=y-intercept
16.6
Estimating the Coefficients…
In much the same way we base estimates of µ on x , we
estimate β0 using b0 and β1 using b1, the y-intercept
and slope (respectively) of the least squares or
regression line given by:
(Recall: this is an application of the least squares
method and it produces a straight line that
minimizes the sum of the squared differences
between the points and the line)
16.7
Least Squares Line…
these differences are
called residuals
Example 16.1
16.8
Example 16.2…
Car dealers across North America use the "Red Book" to
help them determine the value of used cars that their
customers trade in when purchasing new cars.
The book, which is published monthly, lists the trade-in
values for all basic models of cars.
It provides alternative values for each car model according
to its condition and optional features.
The values are determined on the basis of the average paid
at recent used-car auctions, the source of supply for many
used-car dealers.
16.9
Example 16.2…
However, the Red Book does not indicate the value
determined by the odometer reading, despite the fact that a
critical factor for used-car buyers is how far the car has
been driven.
To examine this issue, a used-car dealer randomly selected
100 three-year old Toyota Camrys that were sold at auction
during the past month.
The dealer recorded the price ($1,000) and the number of
miles (thousands) on the odometer. (Xm16-02).
The dealer wants to find the regression line.
16.10
Using SPSS
Analyze > Regression > Linear
Simple Linear Regression
SPSS Steps: Analyze > Regression > Linear
16.11
SPSS Output: check three tables
R2 strength of the linear relationship
Model
significance
/fit
b1 b0
16.12
Example 16.2…
As you might expect with used cars…
The slope coefficient, b1, is –0.0669, that is, each
additional mile on the odometer decreases the price by
$.0669 or 6.69¢
The intercept, b0, is 17,250. One interpretation would
be that when x = 0 (no miles on the car) the selling
price is $17,250. However, we have no data for cars
with less than 19,100 miles on them so this isn’t a
correct assessment.
16.13
Testing the Slope…
If no linear relationship exists between the two
variables, we would expect the regression line to be
horizontal, that is, to have a slope of zero.
We want to see if there is a linear relationship, i.e.
we want to see if the slope (β1) is something other
than zero. Our research hypothesis becomes:
H1: β1 ≠ 0
Thus the null hypothesis becomes:
H0: β1 = 0
16.14
Coefficient of Determination…
Tests thus far have shown if a linear relationship
exists; it is also useful to measure the strength
of the relationship. This is done by calculating
the coefficient of determination – R2.
The coefficient of determination is the square of
the coefficient of correlation (r), hence R2 = (r)2
16.15
Coefficient of Determination…
As we did with analysis of variance, we can partition
the variation in y into two parts:
Variation in y = SSE + SSR
SSE – Sum of Squares Error – measures the amount of
variation in y that remains unexplained (i.e. due to
error)
SSR – Sum of Squares Regression – measures the
amount of variation in y explained by variation in the
independent variable x.
16.16
Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation
in the auction selling prices (y) is explained by the variation
in the odometer readings (x). The remaining 35.17% is
unexplained, i.e. due to error.
Unlike the value of a test statistic, the coefficient of
determination does not have a critical value that
enables us to draw conclusions.
In general the higher the value of R2, the better the model
fits the data.
R2 = 1: Perfect match between the line and the data points.
R2 = 0: There are no linear relationship between x and y.
16.17
Using the Regression Equation…
We could use our regression equation:
y = 17.250 – .0669x
to predict the selling price of a car with 40 (,000) miles
on it:
y = 17.250 – .0669x = 17.250 – .0669(40) = 14,574
We call this value ($14,574) a point prediction.
Chances are though the actual selling price will be
different, hence we can estimate the selling price in terms
of an interval.
16.18
Prediction Interval
The prediction interval is used when we want to
predict one particular value of the dependent
variable, given a specific value of the independent
variable:
(xg is the given value of x we’re interested in)
16.19
Confidence Interval Estimator…
…of the expected value of y. In this case, we are
estimating the mean of y given a value of x:
(Technically this formula is used for infinitely large
populations. However, we can interpret our
problem as attempting to determine the average
selling price of all Toyota Camrys, all with 40,000
miles on the odometer)
16.20
What’s the Difference?
1 no 1
The confidence interval estimate of the expected value of y will be narrower than
the prediction interval for the same given value of x and confidence level. This is
because there is less error in estimating a mean value as opposed to predicting an
individual value.
Prediction Interval Confidence Interval
Used to estimate the value of
one value of y (at given x)
Used to estimate the mean
value of y (at given x)
16.21
Intervals with SPSS
Output
17.22
16.23
Regression Diagnostics…
There are three conditions that are required in order to
perform a regression analysis. These are:
• The error variable must be normally distributed,
• The error variable must have a constant variance,
• The errors must be independent of each other.
How can we diagnose violations of these conditions?
 Residual Analysis, that is, examine the
differences between the actual data points and those
predicted by the linear equation…
16.24
Nonnormality…
We can take the residuals and put them into a histogram
to visually check for normality…
…we’re looking for a bell shaped histogram with the
mean close to zero. 
SPSS: Regression>Linear>Save>check Residuals >
unstandardized & standardized
SPSS: Test of normality
Analyze>descriptive statistics >explore>plots
Example 16.2
16.28
Heteroscedasticity…
When the requirement of a constant variance is violated,
we have a condition of heteroscedasticity.
We can diagnose heteroscedasticity by plotting the
residual against the predicted y.
16.29
Heteroscedasticity…
If the variance of the error variable ( ) is not constant,
then we have “heteroscedasticity”. Here’s the plot of
the residual against the predicted value of y:
there doesn’t appear to be a
change in the spread of the
plotted points, therefore no
heteroscedasticity

SPSS: Regression>linear>Save>check Predicted Values &
Residuals
SPSS: Graphs>scatter> y-Residual; x-
Predicted Price
16.32
Nonindependence of the Error Variable
If we were to observe the auction price of cars every week
for, say, a year, that would constitute a time series.
When the data are time series, the errors often are
correlated.
Error terms that are correlated over time are said to be
autocorrelated or serially correlated.
We can often detect autocorrelation by graphing the
residuals against the time periods. If a pattern
emerges, it is likely that the independence requirement is
violated.
16.33
Nonindependence of the Error Variable
Patterns in the appearance of the residuals over time
indicates that autocorrelation exists:
Note the runs of positive residuals,
replaced by runs of negative residuals
Note the oscillating behavior of the
residuals around zero.
Durbin-Watson test, one way to test autocorrelation
16.34
Outliers…
An outlier is an observation that is unusually
small or unusually large.
E.g. our used car example had odometer readings
from 19.1 to 49.2 thousand miles. Suppose we
have a value of only 5,000 miles (i.e. a car driven
by an old person only on Sundays  ) — this point
is an outlier.
16.35
Outliers…
Possible reasons for the existence of outliers include:
▫ There was an error in recording the value
▫ The point should not have been included in the sample
▫ Perhaps the observation is indeed valid.
Outliers can be easily identified from a scatter plot.
If the absolute value of the standard residual is > 2, we
suspect the point may be an outlier and investigate further.
They need to be dealt with since they can easily
influence the least squares line…
Example 16.2
SPSS: Graph>scatter>x: odomter>y:price
Procedure for Regression Diagnostics
1. Develop a model that has a theoretical basis; that
is, for the dependent variable in question, find an
independent variable that you believe is linearly
related to it.
2. Gather data for the two variables.
3. Draw the scatter diagram to determine whether a
linear model appears to be appropriate. Identify
possible outliers.
4. Determine the regression equation.
5. Calculate the residuals and check slides for the
required conditions.
6. Assess the model fit: Check slides SPSS output
7. If the model fits the data, use the regression
equation to predict a particular value of the
dependent variable or estimate its mean (or both)
From simple linear regression to
multiple regression
• Simple linear regression
Education
Income
17.40
Multiple Regression…
The simple linear regression model was used to
analyze how one interval variable (the dependent
variable y) is related to one other interval variable (the
independent variable x).
Multiple regression allows for any number of
independent variables.
We expect to develop models that fit the data better
than would a simple linear regression model.
Multiple regression
Variable A
Variable D
Variable B
Variable C
Multiple regression
Age
Income
Education
Number of
Family member
earn money
Number of
Children
Year
With current
employer
Occupation
Prestige score
Work hours
Example: GSS2008
• How is income affected by
▫ Age (AGE)
▫ Education (EDUC)
▫ Work hours (HRS)
▫ Spouse work hours (SPHRS)
▫ Occupation prestige score (PRESTG80)
▫ Number of children (CHILDS)
▫ Number of family members earn money (EARNS)
▫ Years with current employer (CUREMPYR)
17.44
The Model…
We now assume we have k independent variables potentially
related to the one dependent variable. This relationship is
represented in this first order linear equation:
In the one variable, two dimensional case we drew a regression
line; here we imagine a response surface.
error variable
dependent
variable
independent variables
coefficients
17.45
Estimating the Coefficients…
The sample regression equation is expressed as:
We will use computer output to:
Assess the model…
How well it fits the data?
Is it useful?
Are any required conditions violated?
Employ the model…
Interpreting the coefficients
Predictions using the regression model.
17.46
Regression Analysis Steps…
u Use a computer and software to generate the
coefficients and the statistics used to assess the model.
v Diagnose violations of required conditions. If there
are problems, attempt to remedy them.
w Assess the model’s fit.
coefficient of determination,
F-test of the analysis of variance.
x If u, v, and w are OK, use the model for prediction.
17.47
Transformation…
Can we transform this data into a mathematical
model that looks like this:
income
education Year with current employ…age
17.48
Using SPSS
• Analyze > Regression > Linear
Using SPSS
• Dependent/Independent
Output
The mathematical model
ŷ= -51785.243 +460.87 x1+4100.9 x2+…+329.771 x8
17.52
The Model…
Although we haven’t done any assessment of the model yet,
at first pass:
ŷ= -51785.243 +460.87 x1+4100.9 x2+ 620 x3-862.201 x4…+329.771 x8
it suggests that increases in AGE, EDUC, HRS,
PRESTG80, EARNRS, CUREMPYR, will positively
impact the income.
Likewise, increases in the SPHRS, CHILDS will
negatively impact the operating margin…
INTERPRET
17.53
Model Assessment…
We will assess the model in two ways:
Coefficient of determination, and
F-test of the analysis of variance.
17.54
Coefficient of Determination…
• Again, the coefficient of determination is defined
as:
This means that 33.7% of the variation in income is
explained by the six independent variables, but
66.3% remains unexplained.
17.55
Adjusted R2 value…
The adjusted” R2 is:
the coefficient of determination adjusted
for the number of explanatory variables.
It takes into account the sample size n, and k, the
number of independent variables, and is given by:
17.56
Testing the Validity of the Model…
In a multiple regression model (i.e. more than one
independent variable), we utilize an analysis of
variance technique to test the overall validity of the
model. Here’s the idea:
H0:
H1: At least one is not equal to zero.
If the null hypothesis is true, none of the independent
variables is linearly related to y, and so the model is
invalid.
If at least one is not equal to 0, the model does have
some validity.
17.57
Testing the Validity of the Model…
ANOVA table for regression analysis…
Source of
Variation
degrees of
freedom
Sums of
Squares
Mean Squares F-Statistic
Regression k SSR MSR = SSR/k F=MSR/MSE
Error n–k–1 SSE MSE = SSE/(n–k-1)
Total n–1
A large value of F indicates that most of the variation in y is explained by
the regression equation and that the model is valid. A small value of F
indicates that most of the variation in y is unexplained.
Testing the Validity of the Model…
P<.o5, at least one is not 0,
Reject H0, accept H1
the the model is valid
17.59
Interpreting the Coefficients*
Intercept (b0) -51785.243 • This is the average income when
all of the independent variables are zero. It’s meaningless to try
and interpret this value, particularly if 0 is outside the range of
the values of the independent variables (as is the case here).
Age (b1) 460.87 • Each 1 year increase in age will increase
$460.87 in the income.
Education (b2) 4100.9• For each additional year of
education, the annual income will increase $4100.9.
Hours of work (b3) 620 • each additional hour of work per
week, the annual income will increase $620.
*in each case we assume all other variables are held constant…
17.60
Interpreting the Coefficients*
Spouse hours of work (b4) -862.201• For each additional
hour the spouse work per week, the average annual income will
decrease $862.201 .
Occupation Prestige Score (b5) 641• For each additional
unit of score, the average annual income increases by $641
Number of Children (b6) -331 • For each additional child,
the average income decrease by -331
Number of family members earn money (b7) 687 • For
each additional family member earn money, the income
increase by $687
Number of years with current job (b8) 330• For each
additional year with current job, the income increase by
$330.
*in each case we assume all other variables are held constant…
17.61
Testing the Coefficients…
For each independent variable, we can test to
determine whether there is enough evidence of a linear
relationship between it and the dependent variable for
the entire population…
H0: = 0
H1: ≠ 0
(for i = 1, 2, …, k) and using:
as our test statistic (with n–k–1 degrees of freedom).
17.62
Testing the Coefficients
We can use SPSS output to quickly test each of the
8 coefficients in our model…
Thus, EDUC, HRS, SPHRS, PRESTG80, are linearly related to the
operating margin. There is no evidence to infer that AGE, CHILDS,
EARNS, CUREMPYR are linearly related to operating margin.
17.63
Using the Regression Equation
Much like we did with simple linear regression, we
can produce a prediction interval for a
particular value of y.
As well, we can produce the confidence interval
estimate of the expected value of y.
17.64
Using the Regression Equation
Exercise GSS2008:
We add one row (our given values for the independent
variables) to the bottom of our data set, please produce
▫ prediction interval
▫ confidence interval estimate
For the dependent variable y.
17.65
Regression Diagnostics I
Exercise GSS2008
• Calculate the residuals and check the following:
▫ Is the error variable nonnormal?
▫ Perform a normality test
• Is the error variance constant?
▫ Plot the residuals versus the predicted values of y.
• Are the errors independent (time-series data)?
▫ Plot the residuals versus the time periods.
• Are there observations that are inaccurate or do
not belong to the target population?
▫ Double-check the accuracy of outliers and influential
observations.
17.66
Regression Diagnostics II
• Multiple regression models have a problem that
simple regressions do not, namely
multicollinearity.
• It happens when the independent variables
are highly correlated.
• We’ll explore this concept through the following
example…
17.67
Example GSS2008
• AGE and CUREMPYR are not significant
predictor for INCOME in multiple regression
model, but when we run correlation between
AGE and INCOME, CUREMPYR and INCOME.
They are both significantly correlated.
• How to account for this apparent contradiction?
• The answer is that the AGE and CUREMPYR are
correlated, all three independent variables
are correlated with each other !
• The is the problem of multicollinearity.
Multiple Regression Output
How to deal with multicollinearity
problem
• Multicollinearity exits in virtually all multiple
regression models.
• To minimize the effect:
▫ Try to include independent variables that are
independent of each other.
▫ Develop a model that has a theoretical basis and
include IVs that are necessary.
17.71
Regression Diagnostics III – Time Series
• The Durbin-Watson test allows us to determine
whether there is evidence of first-order
autocorrelation — a condition in which a
relationship exists between consecutive
residuals, i.e. ei-1 and ei (i is the time period). The
statistic for this test is defined as:
• d has a range of values: 0 ≤ d ≤ 4.
17.72
Durbin–Watson (two-tail test)
• To test for first-order autocorrelation:
• If d < dL or d > 4 – dL , first-order
autocorrelation exists.
• If d falls between dL and dU or between 4 – dU
and 4 – dU , the test is inconclusive.
• If d falls between dU and 4 – dU there is no
evidence of first order autocorrelation.
4-dU 4-dL
exists existsinconclusive
dUdL 2 40
inconclusive doesn’t exist
17.73
Example 17.1 Xm17-01
Can we create a model that will predict lift ticket
sales at a ski hill based on two weather
parameters?
Variables:
y - lift ticket sales during Christmas week,
x1 - total snowfall (inches), and
x2 - average temperature (degrees Fahrenheit)
Our ski hill manager collected 20 years of data.
17.74
Example 17.1
Both the coefficient of determination
and the p-value of the F-test indicate
the model is poor…
Neither variable is linearly related
to ticket sale…
17.75
Example 17.1
• The histogram of residuals…
• reveals the errors may be normally distributed…
17.76
Example 17.1
• In the plot of residuals versus predicted values
(testing for heteroscedasticity) — the error
variance appears to be constant…
17.77
Example 17.1 Durbin-Watson
• Apply the Durbin-Watson Statistic from to the entire list of
residuals.
• Regression>Linear>Statistics>check Durbin-Watson
17.78
Example 17.1
To test for first-order autocorrelation with α = .05, we
find in Table 8(a) in Appendix B
dL = 1.10 and dU = 1.54
The null and alternative hypotheses are
H0 : There is no first-order autocorrelation.
H1 : There is first-order autocorrelation.
The rejection region includes d < dL = 1.10. Since d =
.593, we reject the null hypothesis and conclude that
there is enough evidence to infer that first-order
autocorrelation exists.
17.79
Example 17.1
Autocorrelation usually indicates that the model needs to
include an independent variable that has a time-ordered
effect on the dependent variable.
The simplest such independent variable represents the
time periods. We included a third independent variable
that records the number of years since the year the data
were gathered. Thus, x3 = 1, 2,..., 20. The new model is
y = β0 + β1x1 + β2x2 + β3x3 + ε
17.80
Example 17.1
The fit of the model is high,
The model is valid…
Snowfall and time are linearly related to
ticket sales; temperature is not…
our new
variable
dL = 1.10 and dU = 1.54
dU <d<4- dU, first-order
autocorrelation doesn't exit
17.81
Example 17.1
• The Durbin-Watson statistic against the residuals
from our Regression analysis is equal to 1.885.
• we can conclude that there is not enough evidence
to infer the presence of first-order
autocorrelation. (Determining dL is left as an
exercise for the reader…)
• Hence, we have improved out model dramatically!
17.82
Example 17.1
Notice that the model is improved dramatically.
The F-test tells us that the model is valid. The t-tests tell us that
both the amount of snowfall and time are significantly linearly
related to the number of lift tickets.
This information could prove useful in advertising for the resort.
For example, if there has been a recent snowfall, the resort
could emphasize that in its advertising.
If no new snow has fallen, it may emphasize their snow-making
facilities.
18.83
Model Selection
Regression analysis can also be used for:
• non-linear (polynomial) models, and
• models that include nominal independent
variables.
18.84
Polynomial Models
Previously we looked at this multiple regression
model:
(its considered linear or first-order since the
exponent on each of the xi’s is 1)
The independent variables may be functions of a
smaller number of predictor variables; polynomial
models fall into this category. If there is one
predictor value (x) we have:
18.85
Polynomial Models
u
v
Technically, equation vis a multiple regression model
with p independent variables (x1, x2, …, xp). Since x1 =
x, x2 = x2, x3 = x3, …, xp = xp, its based on one predictor
value (x).
p is the order of the equation; we’ll focus equations of
order p = 1, 2, and 3.
18.86
First Order Model
When p = 1, we have our simple linear regression model:
That is, we believe there is a straight-line relationship
between the dependent and independent variables over the
range of the values of x:
18.87
Second Order Model
When p = 2, the polynomial model is a parabola:
18.88
Third Order Model
When p = 3, our third order model looks like:
18.89
Polynomial Models: 2 Predictor
Variables
Perhaps we suspect that there are two predictor
variables (x1 & x2) which influence the dependent
variable:
First order model (no interaction):
First order model (with interaction):
18.90
Polynomial Models: 2 Predictor Variables
First order models, 2 predictors, without & with interaction:
18.91
Polynomial Models: 2 Predictor Variables
If we believe that a quadratic relationship exists between y
and each of x1 and x2, and that the predictor variables
interact in their effect on y, we can use this model:
Second order model (in two variables) WITH interaction:
18.92
Polynomial Models: 2 Predictor
Variables
2nd order models, 2 predictors, without & with interaction:
18.93
Selecting a Model
One predictor variable, or two (or more)?
First order? Second order? Higher order?
With interaction? Without?
How do we choose the right model??
Use our knowledge of the variables involved to
build an initial model.
Test that model using statistical techniques.
If required, modify our model and re-test…
18.94
Example 18.1
We’ve been asked to come up with a regression model
for a fast food restaurant. We know our primary
market is middle-income adults and their children,
particularly those between the ages of 5 and 12.
Dependent variable —restaurant revenue (gross or net)
Predictor variables — family income, age of children
Is the relationship first order? quadratic?…
18.95
Example 18.1
The relationship between the dependent variable (revenue)
and each predictor variable is probably quadratic.
Members of low or high income households are less likely to eat at this chain’s
restaurants, since the restaurants attract mostly middle-income customers.
Neighborhoods where the mean age of children is either quite low or quite high
are also less likely to eat there vs. the families with children in the 5-to-12 year
range.
Seems reasonable?
18.96
Example 18.1
Should we include the interaction term in our model?
When in doubt, it is probably best to include
it.
Our model then, is:
Where y = annual gross sales
x1 = median annual household income*
x2 = mean age of children*
*in the neighborhood
18.97
Example 18.2 Xm18-02
Our fast food restaurant research department
selected 25 locations at random and gathered data
on revenues, household income, and ages of
neighborhood children.
Collected Data Calculated Data
18.98
Example 18.2
You can take the original data collected (revenues,
household income, and age) and plot y vs. x1 and y
vs. x2 to get a feel for the data; trend lines were
added for clarity…
18.99
Example 18.2
Checking the regression tool’s output…
The model fits the data well
and its valid…
INTERPRET
18.100
Nominal Independent Variables
Thus far in our regression analysis, we’ve only
considered variables that are interval. Often
however, we need to consider nominal data in
our analysis.
For example, our earlier example regarding the
market for used cars focused only on mileage.
Perhaps color is an important factor. How can we
model this new variable?
18.101
Indicator Variables
An indicator variable (also called a dummy
variable) is a variable that can assume either one
of only two values (usually 0 and 1).
A value of 1 usually indicates the existence of a certain
condition, while a value of 0 usually indicates that the
condition does not hold.
I1 =
I2 =
0 if color not white
1 if color is white
0 if color not silver
1 if color is silver
Car Color I1 I2
white 1 0
silver 0 1
other 0 0
two tone! 1 1
to represent m categories…
we need m–1 indicator variables
18.102
Interpreting Indicator Variable Coefficients
After performing our regression analysis:
we have this regression equation…
Thus, the price diminishes with additional mileage (x)
a white car sells for $91.10 more than other colors (I1)
a silver car fetches $330.40 more than other colors (I2)
18.103
Graphically
18.104
Testing the Coefficients
To test the coefficient of I1, we use these
hypotheses…
H0: = 0
H1: ≠ 0
There is insufficient evidence to infer that in the
population of 3-year-old white Tauruses with the same
odometer reading have a different selling price than
do Tauruses in the “other” color category…
18.105
Testing the Coefficients
To test the coefficient of I2, we use these
hypotheses…
H0: = 0
H1: ≠ 0
We can conclude that there are differences in
auction selling prices between all 3-year-old
silver-colored Tauruses and the “other” color
category with the same odometer readings
Stepwise Regression
• Stepwise Regression is an iterative procedure
that adds and deletes one independent variable
at a time. The decision to add or delete a variable
is made on the basis of whether that variable
improves the model.
• It is a procedure that can eliminate correlated
independent variables.
Step 1: do simultaneous regression and
rank all the significant variables
No.1
No.4
No.2
No.3
Step 2
• Analyze
• Regression
• Linear
• Stepwise
• Dependent variable
• Independent variables (1st round: the top
predictor; 2nd round: the top predictor & the 2nd
top predictor…until the nth round; n = number
of predictors
• Statistics
• R square change & Descriptives
• Stepwise output
• What to read?
• R2 , R2 change, F of R2 change, significance level
of F of R2 change in each round
Stepwise output
• The regression equation
• Simulaneous: ŷ= −51785.243 +460.87 AGE+4100.9 EDUC+ 620 HRS−862.201
SPHRS… … +329.771 CUREMPRY
• Stepwise: ŷ= -44703.12 +3944.7 EDUS-617.37SPHRS+526.493PRESTG80+956.933HRS
Multiple regression
• Multiple regression examines the predictability
of a set of predictors on a dependent variable
(criterion)
• Why don’t we just throw in all the predictors and
let the MR determine which ones are good
predictors then?
• Reason 1: Theoretical consideration
• Reason 2: Concern of sample size
Concern of sample size
• The desired level is 20 observations for each
independent variable
• For instance, if you have 6 predictors, you’ve got
to have at least 120 subjects in your data
• However, if a stepwise procedure is employed,
the recommended level increases to 50 to 1
• That is, you’ve got to have at least 300 subjects
in order to run stepwise MR
18.114
Model Building
Here is a procedure for building a regression model:
uIdentify the dependent variable; what is it we
wish to predict? Don’t forget the variable’s unit of
measure.
vList potential predictors; how would changes in
predictors change the dependent variable? Be
selective; go with the fewest independent variables
required. Be aware of the effects of multicollinearity.
w Gather the data; at least six? observations for
each independent variable used in the equation.
18.115
Model Building
x Identify several possible models; formulate
first- and second- order models with and without
interaction. Draw scatter diagrams.
y Use statistical software to estimate the
models.
z Determine whether the required conditions
are satisfied; if not, attempt to correct the problem.
{ Use your judgment and the statistical output
to select the best model!

More Related Content

What's hot

Linear Regression | Machine Learning | Data Science
Linear Regression | Machine Learning | Data ScienceLinear Regression | Machine Learning | Data Science
Linear Regression | Machine Learning | Data Science
Sumit Pandey
 
Regression
RegressionRegression
Regression
SAURABH KUMAR
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
Muhammad Fazeel
 
Correlation Statistics
Correlation StatisticsCorrelation Statistics
Correlation Statistics
tahmid rashid
 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excel
Thilina Rathnayaka
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
Dan Wellisch
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regressiondessybudiyanti
 
Simple regression and correlation
Simple regression and correlationSimple regression and correlation
Simple regression and correlation
Mary Grace
 
Correlation & Regression
Correlation & RegressionCorrelation & Regression
Correlation & RegressionGrant Heller
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inferenceKemal İnciroğlu
 
Correlation 2
Correlation 2Correlation 2
Correlation 2
KanishkJaiswal6
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Abdelaziz Tayoun
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Al Arizmendez
 
Chapter 16: Correlation (enhanced by VisualBee)
Chapter 16: Correlation  
(enhanced by VisualBee)Chapter 16: Correlation  
(enhanced by VisualBee)
Chapter 16: Correlation (enhanced by VisualBee)nunngera
 
Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation Analysis
Birinder Singh Gulati
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
Shiela Vinarao
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
Santosh Bhaskar
 
multiple regression
multiple regressionmultiple regression
multiple regression
Priya Sharma
 
Statr session 23 and 24
Statr session 23 and 24Statr session 23 and 24
Statr session 23 and 24
Ruru Chowdhury
 

What's hot (20)

Linear Regression | Machine Learning | Data Science
Linear Regression | Machine Learning | Data ScienceLinear Regression | Machine Learning | Data Science
Linear Regression | Machine Learning | Data Science
 
Regression
RegressionRegression
Regression
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Correlation Statistics
Correlation StatisticsCorrelation Statistics
Correlation Statistics
 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excel
 
Simple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-StepSimple Linear Regression: Step-By-Step
Simple Linear Regression: Step-By-Step
 
Simple Linier Regression
Simple Linier RegressionSimple Linier Regression
Simple Linier Regression
 
Simple regression and correlation
Simple regression and correlationSimple regression and correlation
Simple regression and correlation
 
Correlation & Regression
Correlation & RegressionCorrelation & Regression
Correlation & Regression
 
Correlation continued
Correlation continuedCorrelation continued
Correlation continued
 
Simple lin regress_inference
Simple lin regress_inferenceSimple lin regress_inference
Simple lin regress_inference
 
Correlation 2
Correlation 2Correlation 2
Correlation 2
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
 
Chapter 16: Correlation (enhanced by VisualBee)
Chapter 16: Correlation  
(enhanced by VisualBee)Chapter 16: Correlation  
(enhanced by VisualBee)
Chapter 16: Correlation (enhanced by VisualBee)
 
Correlation Analysis
Correlation AnalysisCorrelation Analysis
Correlation Analysis
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Correlation and Regression ppt
Correlation and Regression pptCorrelation and Regression ppt
Correlation and Regression ppt
 
multiple regression
multiple regressionmultiple regression
multiple regression
 
Statr session 23 and 24
Statr session 23 and 24Statr session 23 and 24
Statr session 23 and 24
 

Similar to 9 model building

Simple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSimple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptx
SoumyaBansal7
 
Linear and Logistics Regression
Linear and Logistics RegressionLinear and Logistics Regression
Linear and Logistics Regression
Mukul Kumar Singh Chauhan
 
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdfregression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
lisow86669
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
nszakir
 
Chap013.ppt
Chap013.pptChap013.ppt
Chap013.ppt
najwalyaa
 
Business Analytics Foundation with R Tools Part 1
Business Analytics Foundation with R Tools Part 1Business Analytics Foundation with R Tools Part 1
Business Analytics Foundation with R Tools Part 1
Beamsync
 
IPPTCh013.pptx
IPPTCh013.pptxIPPTCh013.pptx
IPPTCh013.pptx
ManoloTaquire
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
arsh260174
 
Regression Analysis Techniques.pptx
Regression Analysis Techniques.pptxRegression Analysis Techniques.pptx
Regression Analysis Techniques.pptx
YutaItadori
 
Chap013.ppt
Chap013.pptChap013.ppt
Chap013.ppt
ManoloTaquire
 
LINEAR REGRESSION.pptx
LINEAR REGRESSION.pptxLINEAR REGRESSION.pptx
LINEAR REGRESSION.pptx
neelamsanjeevkumar
 
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptxThe 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
Chode Amarnath
 
Unit 03 - Consolidated.pptx
Unit 03 - Consolidated.pptxUnit 03 - Consolidated.pptx
Unit 03 - Consolidated.pptx
ChristopherDevakumar1
 
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxDistribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
madlynplamondon
 
Quantity Demand Analysis
Quantity Demand AnalysisQuantity Demand Analysis
Quantity Demand Analysis
Joseph Winthrop Godoy
 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
budbarber38650
 
Econometrics project
Econometrics projectEconometrics project
Econometrics project
Shubham Joon
 
10 Nonparamatric statistics
10 Nonparamatric statistics10 Nonparamatric statistics
10 Nonparamatric statistics
Penny Jiang
 
Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptx
nikshaikh786
 

Similar to 9 model building (20)

Simple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptxSimple Regression Analysis ch12.pptx
Simple Regression Analysis ch12.pptx
 
Linear and Logistics Regression
Linear and Logistics RegressionLinear and Logistics Regression
Linear and Logistics Regression
 
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdfregression-linearandlogisitics-220524024037-4221a176 (1).pdf
regression-linearandlogisitics-220524024037-4221a176 (1).pdf
 
Chapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares RegressionChapter 2 part3-Least-Squares Regression
Chapter 2 part3-Least-Squares Regression
 
Chap013.ppt
Chap013.pptChap013.ppt
Chap013.ppt
 
Business Analytics Foundation with R Tools Part 1
Business Analytics Foundation with R Tools Part 1Business Analytics Foundation with R Tools Part 1
Business Analytics Foundation with R Tools Part 1
 
Chapter05
Chapter05Chapter05
Chapter05
 
IPPTCh013.pptx
IPPTCh013.pptxIPPTCh013.pptx
IPPTCh013.pptx
 
Regression Analysis.pptx
Regression Analysis.pptxRegression Analysis.pptx
Regression Analysis.pptx
 
Regression Analysis Techniques.pptx
Regression Analysis Techniques.pptxRegression Analysis Techniques.pptx
Regression Analysis Techniques.pptx
 
Chap013.ppt
Chap013.pptChap013.ppt
Chap013.ppt
 
LINEAR REGRESSION.pptx
LINEAR REGRESSION.pptxLINEAR REGRESSION.pptx
LINEAR REGRESSION.pptx
 
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptxThe 10 Algorithms Machine Learning Engineers Need to Know.pptx
The 10 Algorithms Machine Learning Engineers Need to Know.pptx
 
Unit 03 - Consolidated.pptx
Unit 03 - Consolidated.pptxUnit 03 - Consolidated.pptx
Unit 03 - Consolidated.pptx
 
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxDistribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
 
Quantity Demand Analysis
Quantity Demand AnalysisQuantity Demand Analysis
Quantity Demand Analysis
 
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docxFSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
FSE 200AdkinsPage 1 of 10Simple Linear Regression Corr.docx
 
Econometrics project
Econometrics projectEconometrics project
Econometrics project
 
10 Nonparamatric statistics
10 Nonparamatric statistics10 Nonparamatric statistics
10 Nonparamatric statistics
 
Module 3_ Classification.pptx
Module 3_ Classification.pptxModule 3_ Classification.pptx
Module 3_ Classification.pptx
 

More from Penny Jiang

Step 9 Evaluating the Strategic Plan
Step 9 Evaluating the Strategic PlanStep 9 Evaluating the Strategic Plan
Step 9 Evaluating the Strategic Plan
Penny Jiang
 
Step 8 Implementing the Strategic Plan
Step 8 Implementing the Strategic PlanStep 8 Implementing the Strategic Plan
Step 8 Implementing the Strategic Plan
Penny Jiang
 
Step 7 Selecting Tactics
Step 7 Selecting TacticsStep 7 Selecting Tactics
Step 7 Selecting Tactics
Penny Jiang
 
Step 6 Developing the Message Strategy
Step 6 Developing the Message StrategyStep 6 Developing the Message Strategy
Step 6 Developing the Message Strategy
Penny Jiang
 
Chinese calligraphy
Chinese calligraphyChinese calligraphy
Chinese calligraphy
Penny Jiang
 
7 anova chi square test
 7 anova chi square test 7 anova chi square test
7 anova chi square test
Penny Jiang
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t test
Penny Jiang
 
5 numerical descriptive statitics
5 numerical descriptive statitics5 numerical descriptive statitics
5 numerical descriptive statitics
Penny Jiang
 
4 sampling
4 sampling4 sampling
4 sampling
Penny Jiang
 
3 survey, questionaire, graphic techniques
3 survey, questionaire, graphic techniques3 survey, questionaire, graphic techniques
3 survey, questionaire, graphic techniques
Penny Jiang
 
2 statistics, measurement, graphical techniques
2 statistics, measurement, graphical techniques2 statistics, measurement, graphical techniques
2 statistics, measurement, graphical techniques
Penny Jiang
 
1 introduction
1 introduction1 introduction
1 introduction
Penny Jiang
 
2 elements of design line
2 elements of design line2 elements of design line
2 elements of design line
Penny Jiang
 

More from Penny Jiang (13)

Step 9 Evaluating the Strategic Plan
Step 9 Evaluating the Strategic PlanStep 9 Evaluating the Strategic Plan
Step 9 Evaluating the Strategic Plan
 
Step 8 Implementing the Strategic Plan
Step 8 Implementing the Strategic PlanStep 8 Implementing the Strategic Plan
Step 8 Implementing the Strategic Plan
 
Step 7 Selecting Tactics
Step 7 Selecting TacticsStep 7 Selecting Tactics
Step 7 Selecting Tactics
 
Step 6 Developing the Message Strategy
Step 6 Developing the Message StrategyStep 6 Developing the Message Strategy
Step 6 Developing the Message Strategy
 
Chinese calligraphy
Chinese calligraphyChinese calligraphy
Chinese calligraphy
 
7 anova chi square test
 7 anova chi square test 7 anova chi square test
7 anova chi square test
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t test
 
5 numerical descriptive statitics
5 numerical descriptive statitics5 numerical descriptive statitics
5 numerical descriptive statitics
 
4 sampling
4 sampling4 sampling
4 sampling
 
3 survey, questionaire, graphic techniques
3 survey, questionaire, graphic techniques3 survey, questionaire, graphic techniques
3 survey, questionaire, graphic techniques
 
2 statistics, measurement, graphical techniques
2 statistics, measurement, graphical techniques2 statistics, measurement, graphical techniques
2 statistics, measurement, graphical techniques
 
1 introduction
1 introduction1 introduction
1 introduction
 
2 elements of design line
2 elements of design line2 elements of design line
2 elements of design line
 

Recently uploaded

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
RitikBhardwaj56
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
Bisnar Chase Personal Injury Attorneys
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Ashish Kohli
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
goswamiyash170123
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
ArianaBusciglio
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
Priyankaranawat4
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
chanes7
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 

Recently uploaded (20)

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...The simplified electron and muon model, Oscillating Spacetime: The Foundation...
The simplified electron and muon model, Oscillating Spacetime: The Foundation...
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Top five deadliest dog breeds in America
Top five deadliest dog breeds in AmericaTop five deadliest dog breeds in America
Top five deadliest dog breeds in America
 
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
Aficamten in HCM (SEQUOIA HCM TRIAL 2024)
 
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdfMASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
MASS MEDIA STUDIES-835-CLASS XI Resource Material.pdf
 
Group Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana BuscigliopptxGroup Presentation 2 Economics.Ariana Buscigliopptx
Group Presentation 2 Economics.Ariana Buscigliopptx
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdfANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
ANATOMY AND BIOMECHANICS OF HIP JOINT.pdf
 
Digital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion DesignsDigital Artifact 2 - Investigating Pavilion Designs
Digital Artifact 2 - Investigating Pavilion Designs
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 

9 model building

  • 1. Quantitative Research Methods Lecture 9 Model Building 1. Regression Diagnostics I 2. Regression Diagnostics II Multicollinearity 3. Regression Diagnostics II Time series 4. Polynomial Models 5. Nominal variable in Multiple Regression 6. Stepwise Multiple Regression
  • 2. Statistical analyses • Group differences (nominal variable) on one interval variable: ▫ T-tests (2 groups) ▫ ANOVA (3 or more groups)  One factor: one way ANOVA  Two factor: two way/factor ANOVA • The relationship between two nominal variable: ▫ Chi-square test • The relationship between two interval variable: ▫ Correlation, simple linear regression • The relationship between multiple interval variable on one interval variable ▫ Multiple regression • The relationship between multiple interval variable on one nominal variable (yes/no) ▫ Logistic regression
  • 3. Regression • Single Linear Regression (interval) ▫ one independent, one dependent • Multiple Regression (all interval) ▫ Multiple independent, one dependent • Logistic Regression ▫ Multiple interval independent, one nominal dependent (Yes/No) ▫ Check example: https://youtu.be/H_48AcV0qlY ▫
  • 4. 16.4 Simple Linear Regression Model… A straight line model with one independent variable is called a simple linear regression model. Its is written as: error variable dependent variable independent variable y-intercept slope of the line
  • 5. 16.5 Simple Linear Regression Model… Note that both and are population parameters which are usually unknown and hence estimated from the data. y x run rise =slope (=rise/run) =y-intercept
  • 6. 16.6 Estimating the Coefficients… In much the same way we base estimates of µ on x , we estimate β0 using b0 and β1 using b1, the y-intercept and slope (respectively) of the least squares or regression line given by: (Recall: this is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)
  • 7. 16.7 Least Squares Line… these differences are called residuals Example 16.1
  • 8. 16.8 Example 16.2… Car dealers across North America use the "Red Book" to help them determine the value of used cars that their customers trade in when purchasing new cars. The book, which is published monthly, lists the trade-in values for all basic models of cars. It provides alternative values for each car model according to its condition and optional features. The values are determined on the basis of the average paid at recent used-car auctions, the source of supply for many used-car dealers.
  • 9. 16.9 Example 16.2… However, the Red Book does not indicate the value determined by the odometer reading, despite the fact that a critical factor for used-car buyers is how far the car has been driven. To examine this issue, a used-car dealer randomly selected 100 three-year old Toyota Camrys that were sold at auction during the past month. The dealer recorded the price ($1,000) and the number of miles (thousands) on the odometer. (Xm16-02). The dealer wants to find the regression line.
  • 10. 16.10 Using SPSS Analyze > Regression > Linear Simple Linear Regression SPSS Steps: Analyze > Regression > Linear
  • 11. 16.11 SPSS Output: check three tables R2 strength of the linear relationship Model significance /fit b1 b0
  • 12. 16.12 Example 16.2… As you might expect with used cars… The slope coefficient, b1, is –0.0669, that is, each additional mile on the odometer decreases the price by $.0669 or 6.69¢ The intercept, b0, is 17,250. One interpretation would be that when x = 0 (no miles on the car) the selling price is $17,250. However, we have no data for cars with less than 19,100 miles on them so this isn’t a correct assessment.
  • 13. 16.13 Testing the Slope… If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. We want to see if there is a linear relationship, i.e. we want to see if the slope (β1) is something other than zero. Our research hypothesis becomes: H1: β1 ≠ 0 Thus the null hypothesis becomes: H0: β1 = 0
  • 14. 16.14 Coefficient of Determination… Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination – R2. The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2
  • 15. 16.15 Coefficient of Determination… As we did with analysis of variance, we can partition the variation in y into two parts: Variation in y = SSE + SSR SSE – Sum of Squares Error – measures the amount of variation in y that remains unexplained (i.e. due to error) SSR – Sum of Squares Regression – measures the amount of variation in y explained by variation in the independent variable x.
  • 16. 16.16 Coefficient of Determination R2 has a value of .6483. This means 64.83% of the variation in the auction selling prices (y) is explained by the variation in the odometer readings (x). The remaining 35.17% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2, the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y.
  • 17. 16.17 Using the Regression Equation… We could use our regression equation: y = 17.250 – .0669x to predict the selling price of a car with 40 (,000) miles on it: y = 17.250 – .0669x = 17.250 – .0669(40) = 14,574 We call this value ($14,574) a point prediction. Chances are though the actual selling price will be different, hence we can estimate the selling price in terms of an interval.
  • 18. 16.18 Prediction Interval The prediction interval is used when we want to predict one particular value of the dependent variable, given a specific value of the independent variable: (xg is the given value of x we’re interested in)
  • 19. 16.19 Confidence Interval Estimator… …of the expected value of y. In this case, we are estimating the mean of y given a value of x: (Technically this formula is used for infinitely large populations. However, we can interpret our problem as attempting to determine the average selling price of all Toyota Camrys, all with 40,000 miles on the odometer)
  • 20. 16.20 What’s the Difference? 1 no 1 The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level. This is because there is less error in estimating a mean value as opposed to predicting an individual value. Prediction Interval Confidence Interval Used to estimate the value of one value of y (at given x) Used to estimate the mean value of y (at given x)
  • 23. 16.23 Regression Diagnostics… There are three conditions that are required in order to perform a regression analysis. These are: • The error variable must be normally distributed, • The error variable must have a constant variance, • The errors must be independent of each other. How can we diagnose violations of these conditions?  Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation…
  • 24. 16.24 Nonnormality… We can take the residuals and put them into a histogram to visually check for normality… …we’re looking for a bell shaped histogram with the mean close to zero. 
  • 25. SPSS: Regression>Linear>Save>check Residuals > unstandardized & standardized
  • 26. SPSS: Test of normality Analyze>descriptive statistics >explore>plots
  • 28. 16.28 Heteroscedasticity… When the requirement of a constant variance is violated, we have a condition of heteroscedasticity. We can diagnose heteroscedasticity by plotting the residual against the predicted y.
  • 29. 16.29 Heteroscedasticity… If the variance of the error variable ( ) is not constant, then we have “heteroscedasticity”. Here’s the plot of the residual against the predicted value of y: there doesn’t appear to be a change in the spread of the plotted points, therefore no heteroscedasticity 
  • 31. SPSS: Graphs>scatter> y-Residual; x- Predicted Price
  • 32. 16.32 Nonindependence of the Error Variable If we were to observe the auction price of cars every week for, say, a year, that would constitute a time series. When the data are time series, the errors often are correlated. Error terms that are correlated over time are said to be autocorrelated or serially correlated. We can often detect autocorrelation by graphing the residuals against the time periods. If a pattern emerges, it is likely that the independence requirement is violated.
  • 33. 16.33 Nonindependence of the Error Variable Patterns in the appearance of the residuals over time indicates that autocorrelation exists: Note the runs of positive residuals, replaced by runs of negative residuals Note the oscillating behavior of the residuals around zero. Durbin-Watson test, one way to test autocorrelation
  • 34. 16.34 Outliers… An outlier is an observation that is unusually small or unusually large. E.g. our used car example had odometer readings from 19.1 to 49.2 thousand miles. Suppose we have a value of only 5,000 miles (i.e. a car driven by an old person only on Sundays  ) — this point is an outlier.
  • 35. 16.35 Outliers… Possible reasons for the existence of outliers include: ▫ There was an error in recording the value ▫ The point should not have been included in the sample ▫ Perhaps the observation is indeed valid. Outliers can be easily identified from a scatter plot. If the absolute value of the standard residual is > 2, we suspect the point may be an outlier and investigate further. They need to be dealt with since they can easily influence the least squares line…
  • 37.
  • 38. Procedure for Regression Diagnostics 1. Develop a model that has a theoretical basis; that is, for the dependent variable in question, find an independent variable that you believe is linearly related to it. 2. Gather data for the two variables. 3. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. 4. Determine the regression equation. 5. Calculate the residuals and check slides for the required conditions. 6. Assess the model fit: Check slides SPSS output 7. If the model fits the data, use the regression equation to predict a particular value of the dependent variable or estimate its mean (or both)
  • 39. From simple linear regression to multiple regression • Simple linear regression Education Income
  • 40. 17.40 Multiple Regression… The simple linear regression model was used to analyze how one interval variable (the dependent variable y) is related to one other interval variable (the independent variable x). Multiple regression allows for any number of independent variables. We expect to develop models that fit the data better than would a simple linear regression model.
  • 41. Multiple regression Variable A Variable D Variable B Variable C
  • 42. Multiple regression Age Income Education Number of Family member earn money Number of Children Year With current employer Occupation Prestige score Work hours
  • 43. Example: GSS2008 • How is income affected by ▫ Age (AGE) ▫ Education (EDUC) ▫ Work hours (HRS) ▫ Spouse work hours (SPHRS) ▫ Occupation prestige score (PRESTG80) ▫ Number of children (CHILDS) ▫ Number of family members earn money (EARNS) ▫ Years with current employer (CUREMPYR)
  • 44. 17.44 The Model… We now assume we have k independent variables potentially related to the one dependent variable. This relationship is represented in this first order linear equation: In the one variable, two dimensional case we drew a regression line; here we imagine a response surface. error variable dependent variable independent variables coefficients
  • 45. 17.45 Estimating the Coefficients… The sample regression equation is expressed as: We will use computer output to: Assess the model… How well it fits the data? Is it useful? Are any required conditions violated? Employ the model… Interpreting the coefficients Predictions using the regression model.
  • 46. 17.46 Regression Analysis Steps… u Use a computer and software to generate the coefficients and the statistics used to assess the model. v Diagnose violations of required conditions. If there are problems, attempt to remedy them. w Assess the model’s fit. coefficient of determination, F-test of the analysis of variance. x If u, v, and w are OK, use the model for prediction.
  • 47. 17.47 Transformation… Can we transform this data into a mathematical model that looks like this: income education Year with current employ…age
  • 48. 17.48 Using SPSS • Analyze > Regression > Linear
  • 51. The mathematical model ŷ= -51785.243 +460.87 x1+4100.9 x2+…+329.771 x8
  • 52. 17.52 The Model… Although we haven’t done any assessment of the model yet, at first pass: ŷ= -51785.243 +460.87 x1+4100.9 x2+ 620 x3-862.201 x4…+329.771 x8 it suggests that increases in AGE, EDUC, HRS, PRESTG80, EARNRS, CUREMPYR, will positively impact the income. Likewise, increases in the SPHRS, CHILDS will negatively impact the operating margin… INTERPRET
  • 53. 17.53 Model Assessment… We will assess the model in two ways: Coefficient of determination, and F-test of the analysis of variance.
  • 54. 17.54 Coefficient of Determination… • Again, the coefficient of determination is defined as: This means that 33.7% of the variation in income is explained by the six independent variables, but 66.3% remains unexplained.
  • 55. 17.55 Adjusted R2 value… The adjusted” R2 is: the coefficient of determination adjusted for the number of explanatory variables. It takes into account the sample size n, and k, the number of independent variables, and is given by:
  • 56. 17.56 Testing the Validity of the Model… In a multiple regression model (i.e. more than one independent variable), we utilize an analysis of variance technique to test the overall validity of the model. Here’s the idea: H0: H1: At least one is not equal to zero. If the null hypothesis is true, none of the independent variables is linearly related to y, and so the model is invalid. If at least one is not equal to 0, the model does have some validity.
  • 57. 17.57 Testing the Validity of the Model… ANOVA table for regression analysis… Source of Variation degrees of freedom Sums of Squares Mean Squares F-Statistic Regression k SSR MSR = SSR/k F=MSR/MSE Error n–k–1 SSE MSE = SSE/(n–k-1) Total n–1 A large value of F indicates that most of the variation in y is explained by the regression equation and that the model is valid. A small value of F indicates that most of the variation in y is unexplained.
  • 58. Testing the Validity of the Model… P<.o5, at least one is not 0, Reject H0, accept H1 the the model is valid
  • 59. 17.59 Interpreting the Coefficients* Intercept (b0) -51785.243 • This is the average income when all of the independent variables are zero. It’s meaningless to try and interpret this value, particularly if 0 is outside the range of the values of the independent variables (as is the case here). Age (b1) 460.87 • Each 1 year increase in age will increase $460.87 in the income. Education (b2) 4100.9• For each additional year of education, the annual income will increase $4100.9. Hours of work (b3) 620 • each additional hour of work per week, the annual income will increase $620. *in each case we assume all other variables are held constant…
  • 60. 17.60 Interpreting the Coefficients* Spouse hours of work (b4) -862.201• For each additional hour the spouse work per week, the average annual income will decrease $862.201 . Occupation Prestige Score (b5) 641• For each additional unit of score, the average annual income increases by $641 Number of Children (b6) -331 • For each additional child, the average income decrease by -331 Number of family members earn money (b7) 687 • For each additional family member earn money, the income increase by $687 Number of years with current job (b8) 330• For each additional year with current job, the income increase by $330. *in each case we assume all other variables are held constant…
  • 61. 17.61 Testing the Coefficients… For each independent variable, we can test to determine whether there is enough evidence of a linear relationship between it and the dependent variable for the entire population… H0: = 0 H1: ≠ 0 (for i = 1, 2, …, k) and using: as our test statistic (with n–k–1 degrees of freedom).
  • 62. 17.62 Testing the Coefficients We can use SPSS output to quickly test each of the 8 coefficients in our model… Thus, EDUC, HRS, SPHRS, PRESTG80, are linearly related to the operating margin. There is no evidence to infer that AGE, CHILDS, EARNS, CUREMPYR are linearly related to operating margin.
  • 63. 17.63 Using the Regression Equation Much like we did with simple linear regression, we can produce a prediction interval for a particular value of y. As well, we can produce the confidence interval estimate of the expected value of y.
  • 64. 17.64 Using the Regression Equation Exercise GSS2008: We add one row (our given values for the independent variables) to the bottom of our data set, please produce ▫ prediction interval ▫ confidence interval estimate For the dependent variable y.
  • 65. 17.65 Regression Diagnostics I Exercise GSS2008 • Calculate the residuals and check the following: ▫ Is the error variable nonnormal? ▫ Perform a normality test • Is the error variance constant? ▫ Plot the residuals versus the predicted values of y. • Are the errors independent (time-series data)? ▫ Plot the residuals versus the time periods. • Are there observations that are inaccurate or do not belong to the target population? ▫ Double-check the accuracy of outliers and influential observations.
  • 66. 17.66 Regression Diagnostics II • Multiple regression models have a problem that simple regressions do not, namely multicollinearity. • It happens when the independent variables are highly correlated. • We’ll explore this concept through the following example…
  • 67. 17.67 Example GSS2008 • AGE and CUREMPYR are not significant predictor for INCOME in multiple regression model, but when we run correlation between AGE and INCOME, CUREMPYR and INCOME. They are both significantly correlated. • How to account for this apparent contradiction? • The answer is that the AGE and CUREMPYR are correlated, all three independent variables are correlated with each other ! • The is the problem of multicollinearity.
  • 69.
  • 70. How to deal with multicollinearity problem • Multicollinearity exits in virtually all multiple regression models. • To minimize the effect: ▫ Try to include independent variables that are independent of each other. ▫ Develop a model that has a theoretical basis and include IVs that are necessary.
  • 71. 17.71 Regression Diagnostics III – Time Series • The Durbin-Watson test allows us to determine whether there is evidence of first-order autocorrelation — a condition in which a relationship exists between consecutive residuals, i.e. ei-1 and ei (i is the time period). The statistic for this test is defined as: • d has a range of values: 0 ≤ d ≤ 4.
  • 72. 17.72 Durbin–Watson (two-tail test) • To test for first-order autocorrelation: • If d < dL or d > 4 – dL , first-order autocorrelation exists. • If d falls between dL and dU or between 4 – dU and 4 – dU , the test is inconclusive. • If d falls between dU and 4 – dU there is no evidence of first order autocorrelation. 4-dU 4-dL exists existsinconclusive dUdL 2 40 inconclusive doesn’t exist
  • 73. 17.73 Example 17.1 Xm17-01 Can we create a model that will predict lift ticket sales at a ski hill based on two weather parameters? Variables: y - lift ticket sales during Christmas week, x1 - total snowfall (inches), and x2 - average temperature (degrees Fahrenheit) Our ski hill manager collected 20 years of data.
  • 74. 17.74 Example 17.1 Both the coefficient of determination and the p-value of the F-test indicate the model is poor… Neither variable is linearly related to ticket sale…
  • 75. 17.75 Example 17.1 • The histogram of residuals… • reveals the errors may be normally distributed…
  • 76. 17.76 Example 17.1 • In the plot of residuals versus predicted values (testing for heteroscedasticity) — the error variance appears to be constant…
  • 77. 17.77 Example 17.1 Durbin-Watson • Apply the Durbin-Watson Statistic from to the entire list of residuals. • Regression>Linear>Statistics>check Durbin-Watson
  • 78. 17.78 Example 17.1 To test for first-order autocorrelation with α = .05, we find in Table 8(a) in Appendix B dL = 1.10 and dU = 1.54 The null and alternative hypotheses are H0 : There is no first-order autocorrelation. H1 : There is first-order autocorrelation. The rejection region includes d < dL = 1.10. Since d = .593, we reject the null hypothesis and conclude that there is enough evidence to infer that first-order autocorrelation exists.
  • 79. 17.79 Example 17.1 Autocorrelation usually indicates that the model needs to include an independent variable that has a time-ordered effect on the dependent variable. The simplest such independent variable represents the time periods. We included a third independent variable that records the number of years since the year the data were gathered. Thus, x3 = 1, 2,..., 20. The new model is y = β0 + β1x1 + β2x2 + β3x3 + ε
  • 80. 17.80 Example 17.1 The fit of the model is high, The model is valid… Snowfall and time are linearly related to ticket sales; temperature is not… our new variable dL = 1.10 and dU = 1.54 dU <d<4- dU, first-order autocorrelation doesn't exit
  • 81. 17.81 Example 17.1 • The Durbin-Watson statistic against the residuals from our Regression analysis is equal to 1.885. • we can conclude that there is not enough evidence to infer the presence of first-order autocorrelation. (Determining dL is left as an exercise for the reader…) • Hence, we have improved out model dramatically!
  • 82. 17.82 Example 17.1 Notice that the model is improved dramatically. The F-test tells us that the model is valid. The t-tests tell us that both the amount of snowfall and time are significantly linearly related to the number of lift tickets. This information could prove useful in advertising for the resort. For example, if there has been a recent snowfall, the resort could emphasize that in its advertising. If no new snow has fallen, it may emphasize their snow-making facilities.
  • 83. 18.83 Model Selection Regression analysis can also be used for: • non-linear (polynomial) models, and • models that include nominal independent variables.
  • 84. 18.84 Polynomial Models Previously we looked at this multiple regression model: (its considered linear or first-order since the exponent on each of the xi’s is 1) The independent variables may be functions of a smaller number of predictor variables; polynomial models fall into this category. If there is one predictor value (x) we have:
  • 85. 18.85 Polynomial Models u v Technically, equation vis a multiple regression model with p independent variables (x1, x2, …, xp). Since x1 = x, x2 = x2, x3 = x3, …, xp = xp, its based on one predictor value (x). p is the order of the equation; we’ll focus equations of order p = 1, 2, and 3.
  • 86. 18.86 First Order Model When p = 1, we have our simple linear regression model: That is, we believe there is a straight-line relationship between the dependent and independent variables over the range of the values of x:
  • 87. 18.87 Second Order Model When p = 2, the polynomial model is a parabola:
  • 88. 18.88 Third Order Model When p = 3, our third order model looks like:
  • 89. 18.89 Polynomial Models: 2 Predictor Variables Perhaps we suspect that there are two predictor variables (x1 & x2) which influence the dependent variable: First order model (no interaction): First order model (with interaction):
  • 90. 18.90 Polynomial Models: 2 Predictor Variables First order models, 2 predictors, without & with interaction:
  • 91. 18.91 Polynomial Models: 2 Predictor Variables If we believe that a quadratic relationship exists between y and each of x1 and x2, and that the predictor variables interact in their effect on y, we can use this model: Second order model (in two variables) WITH interaction:
  • 92. 18.92 Polynomial Models: 2 Predictor Variables 2nd order models, 2 predictors, without & with interaction:
  • 93. 18.93 Selecting a Model One predictor variable, or two (or more)? First order? Second order? Higher order? With interaction? Without? How do we choose the right model?? Use our knowledge of the variables involved to build an initial model. Test that model using statistical techniques. If required, modify our model and re-test…
  • 94. 18.94 Example 18.1 We’ve been asked to come up with a regression model for a fast food restaurant. We know our primary market is middle-income adults and their children, particularly those between the ages of 5 and 12. Dependent variable —restaurant revenue (gross or net) Predictor variables — family income, age of children Is the relationship first order? quadratic?…
  • 95. 18.95 Example 18.1 The relationship between the dependent variable (revenue) and each predictor variable is probably quadratic. Members of low or high income households are less likely to eat at this chain’s restaurants, since the restaurants attract mostly middle-income customers. Neighborhoods where the mean age of children is either quite low or quite high are also less likely to eat there vs. the families with children in the 5-to-12 year range. Seems reasonable?
  • 96. 18.96 Example 18.1 Should we include the interaction term in our model? When in doubt, it is probably best to include it. Our model then, is: Where y = annual gross sales x1 = median annual household income* x2 = mean age of children* *in the neighborhood
  • 97. 18.97 Example 18.2 Xm18-02 Our fast food restaurant research department selected 25 locations at random and gathered data on revenues, household income, and ages of neighborhood children. Collected Data Calculated Data
  • 98. 18.98 Example 18.2 You can take the original data collected (revenues, household income, and age) and plot y vs. x1 and y vs. x2 to get a feel for the data; trend lines were added for clarity…
  • 99. 18.99 Example 18.2 Checking the regression tool’s output… The model fits the data well and its valid… INTERPRET
  • 100. 18.100 Nominal Independent Variables Thus far in our regression analysis, we’ve only considered variables that are interval. Often however, we need to consider nominal data in our analysis. For example, our earlier example regarding the market for used cars focused only on mileage. Perhaps color is an important factor. How can we model this new variable?
  • 101. 18.101 Indicator Variables An indicator variable (also called a dummy variable) is a variable that can assume either one of only two values (usually 0 and 1). A value of 1 usually indicates the existence of a certain condition, while a value of 0 usually indicates that the condition does not hold. I1 = I2 = 0 if color not white 1 if color is white 0 if color not silver 1 if color is silver Car Color I1 I2 white 1 0 silver 0 1 other 0 0 two tone! 1 1 to represent m categories… we need m–1 indicator variables
  • 102. 18.102 Interpreting Indicator Variable Coefficients After performing our regression analysis: we have this regression equation… Thus, the price diminishes with additional mileage (x) a white car sells for $91.10 more than other colors (I1) a silver car fetches $330.40 more than other colors (I2)
  • 104. 18.104 Testing the Coefficients To test the coefficient of I1, we use these hypotheses… H0: = 0 H1: ≠ 0 There is insufficient evidence to infer that in the population of 3-year-old white Tauruses with the same odometer reading have a different selling price than do Tauruses in the “other” color category…
  • 105. 18.105 Testing the Coefficients To test the coefficient of I2, we use these hypotheses… H0: = 0 H1: ≠ 0 We can conclude that there are differences in auction selling prices between all 3-year-old silver-colored Tauruses and the “other” color category with the same odometer readings
  • 106. Stepwise Regression • Stepwise Regression is an iterative procedure that adds and deletes one independent variable at a time. The decision to add or delete a variable is made on the basis of whether that variable improves the model. • It is a procedure that can eliminate correlated independent variables.
  • 107. Step 1: do simultaneous regression and rank all the significant variables No.1 No.4 No.2 No.3
  • 108. Step 2 • Analyze • Regression • Linear • Stepwise • Dependent variable • Independent variables (1st round: the top predictor; 2nd round: the top predictor & the 2nd top predictor…until the nth round; n = number of predictors • Statistics • R square change & Descriptives
  • 109. • Stepwise output • What to read? • R2 , R2 change, F of R2 change, significance level of F of R2 change in each round
  • 111. • The regression equation • Simulaneous: ŷ= −51785.243 +460.87 AGE+4100.9 EDUC+ 620 HRS−862.201 SPHRS… … +329.771 CUREMPRY • Stepwise: ŷ= -44703.12 +3944.7 EDUS-617.37SPHRS+526.493PRESTG80+956.933HRS
  • 112. Multiple regression • Multiple regression examines the predictability of a set of predictors on a dependent variable (criterion) • Why don’t we just throw in all the predictors and let the MR determine which ones are good predictors then? • Reason 1: Theoretical consideration • Reason 2: Concern of sample size
  • 113. Concern of sample size • The desired level is 20 observations for each independent variable • For instance, if you have 6 predictors, you’ve got to have at least 120 subjects in your data • However, if a stepwise procedure is employed, the recommended level increases to 50 to 1 • That is, you’ve got to have at least 300 subjects in order to run stepwise MR
  • 114. 18.114 Model Building Here is a procedure for building a regression model: uIdentify the dependent variable; what is it we wish to predict? Don’t forget the variable’s unit of measure. vList potential predictors; how would changes in predictors change the dependent variable? Be selective; go with the fewest independent variables required. Be aware of the effects of multicollinearity. w Gather the data; at least six? observations for each independent variable used in the equation.
  • 115. 18.115 Model Building x Identify several possible models; formulate first- and second- order models with and without interaction. Draw scatter diagrams. y Use statistical software to estimate the models. z Determine whether the required conditions are satisfied; if not, attempt to correct the problem. { Use your judgment and the statistical output to select the best model!