2. Chapter Outline Continued
15.7 The Sales Representative Case: Evaluating Employee
Performance
15.8 Using Dummy Variables to Model Qualitative Independent
Variables (Optional)
15.9 Using Squared and Interaction Variables (Optional)
15.10 Multicollinearity, Model Building and Model
Validation (Optional)
15.11 Residual Analysis and Outlier Detection in Multiple
Regression (Optional)
15-3
3
15.1 The Multiple Regression Model and the Least Squares
Point Estimate
Simple linear regression used one independent variable to
explain the dependent variable
Some relationships are too complex to be described using a
single independent variable
Multiple regression uses two or more independent variables to
describe the dependent variable
This allows multiple regression models to handle more complex
situations
There is no limit to the number of independent variables a
model can use
Multiple regression has only one dependent variable
LO15-1: Explain the multiple regression model and the related
least squares point estimates.
3. 15-4
4
The Multiple Regression Model
The linear regression model relating y to x1, x2,…, xk is y = β0
µy = β0 + β1x1 + β2x2 +…+ βkxk is the mean value of the
dependent variable y when the values of the independent
variables are x1, x2,…, xk
β0, β1, β2,… βk are the unknown regression parameters relating
the mean value of y to x1, x2,…, xk
other than the independent variables x1, x2,…, xk
LO15-1
15-5
5
The Least Squares Estimates and Point Estimation and
Prediction
Estimation/prediction equation
ŷ = b0 + b1x1 + b2x2 + … + bkxk
is the point estimate of the mean value of the dependent
4. variable when the values of the independent variables are x1,
x2,…, xk
It is also the point prediction of an individual value of the
dependent variable when the values of the independent variables
are x1, x2,…, xk
b0, b1, b2,…, bk are the least squares point estimates of the
parameters β0, β1, β2,…, βk
x1, x2,…, xk are specified values of the independent predictor
variables x1, x2,…, xk
LO15-1
15-6
6
LO15-1
Example 15.1 The Tasty Sub Shop Case
Figure 15.4 (a)
15-7
7
15.2 R2 and Adjusted R2
Total variation is given by the formula
5. Explained variation is given by the formula
Unexplained variation is given by the formula
Total variation is the sum of explained and unexplained
variation
LO15-2: Calculate and interpret the multiple and adjusted
multiple coefficients of determination.
15-8
8
R2 and Adjusted R2 Continued
The multiple coefficient of determination is the ratio of
explained variation to total variation
R2 is the proportion of the total variation that is explained by
the overall regression model
Multiple correlation coefficient R is the square root of R2
LO15-2
15-9
6. 9
Multiple Correlation Coefficient R
The multiple correlation coefficient R is just the square root of
R2
With simple linear regression, r would take on the sign of b1
There are multiple bi’s with multiple regression
For this reason, R is always positive
To interpret the direction of the relationship between the x’s
and y, you must look to the sign of the appropriate bi
coefficient
LO15-2
15-10
10
Adjusted R2
Adding an independent variable to multiple regression will raise
R2
R2 will rise slightly even if the new variable has no relationship
to y
corrects this tendency in R2
As a result, it gives a better estimate of the importance of the
independent variables
LO15-2
15-11
7. 11
15.3 Model Assumptions and the Standard Error
A
Mean of Zero Assumption: The mean of the error terms is equal
to 0
Constant Variance Assumption: The variance of the error terms
σ2 is, the same for every combination values of x1, x2,…, xk
Normality Assumption: The error terms follow a normal
distribution for every combination values of
x1, x2,…, xk
Independence Assumption: The values of the error terms are
statistically independent of each other
LO15-3: Explain the assumptions behind multiple regression
and calculate the standard error.
15-12
12
The Mean Square Error and the Standard Error
8. Sum of squared errors
Mean squared error
Point estimate of the residual variance σ2
Standard error
Point estimate of the residual standard deviation σ
LO15-3
15-13
13
15.4 The Overall F Test
To test
H0: β1= β2 = …= βk = 0 versus
Ha: At least one of β1, β2,…, βk ≠ 0
Test statistic
p-
- (k + 1) denominator
degrees of freedom
LO15-4: Test the significance of a multiple regression model by
using an F test.
15-14
9. 14
15.5 Testing the Significance of an Independent Variable
A variable in a multiple regression model is not likely to be
useful unless there is a significant relationship between it and y
To test significance, we use the null hypothesis H0: βj = 0
Versus the alternative hypothesis
Ha: βj ≠ 0
LO15-5: Test the significance of a single independent variable.
15-15
15
Testing the Significance of the Independent Variable xj
LO15-5
15-16
16
10. Testing the Significance of an Independent Variable Continued
Customary to test significance of every independent variable
the independent variable xj is significantly related to y
evidence the independent variable xj is significantly related to y
rejected, the stronger the evidence that xj is significantly
related to y
LO15-5
15-17
17
A Confidence Interval for the Regression Parameter βj
If the regression assumptions hold,
100 (1 -
– (k + 1) degrees of freedom
LO15-5
15-18
18
11. 15.6 Confidence and Prediction Intervals
The point on the regression line corresponding to a particular
value of x1, x2,…, xk, of the independent variables is
It is unlikely that this value will equal the mean value of y for
these x values
Therefore, we need to place bounds on how far away the
predicted value might be
We can do this by calculating a confidence interval for the mean
value of y and a prediction interval for an individual value of y
LO15-6: Find and interpret a confidence interval for a mean
value and a prediction interval for an individual value.
15-19
19
Distance Value
Both the confidence interval for the mean value of y and the
prediction interval for an individual value of y employ a
quantity called the distance value
With simple regression, we were able to calculate the distance
value fairly easily
However, for multiple regression, calculating the distance value
requires matrix algebra
LO15-6
15-20
12. 20
A Confidence Interval and a Prediction Interval
Distance value
Assume the regression assumptions hold
Confidence interval for the mean value of y
Prediction interval for an individual value of y
These are based on n - (k + 1) degrees of freedom
LO15-6
15-21
21
15.7 The Sales Representative Case: Evaluating Employee
Performance
yi Yearly sales of the company’s product
x1 Number of months the representative has been employed
x2 Sales of products in the sales territory
x3 Dollar advertising expenditure in the territory
x4 Weighted average of the company’s market share in
territory for the previous four years
x5 Change in the company’s market share in the territory over
the previous four years
15-22
13. 22
Partial Excel Output of a Regression Analysis of the Sales
Territory Performance Data
Figure 15.10a
15-23
Time = 85.42
MktPoten = 35,182.73
Adver = 7,281.65
MktShare = 9.64
Change = .28
Sales
Predicted 4,181.74
95% Prediction Interval
[3,233.59 to 5,129.89]
23
15.8 Using Dummy Variables to Model Qualitative Independent
Variables (Optional)
14. So far, we have only looked at including quantitative data in a
regression model
However, we may wish to include descriptive qualitative data as
well
For example, might want to include the gender of respondents
We can model the effects of different levels of a qualitative
variable by using what are called dummy variables
Also known as indicator variables
LO15-7: Use dummy variables to model qualitative independent
Variables (Optional).
15-24
24
Constructing Dummy Variables
A dummy variable always has a value of either 0 or 1
For example, to model sales at two locations, would code the
first location as a zero and the second as a 1
Operationally, it does not matter which is coded 0 and which is
coded 1
LO15-7
15-25
25
15. What If We Have More Than Two Categories?
Consider having three categories, say A, B and C
Cannot code this using one dummy variable
A=0, B=1 and C=2 would be invalid
Assumes the difference between A and B is the same as B and C
We must use multiple dummy variables
Specifically, k categories requires k - 1 dummy variables
LO15-7
15-26
26
What If We Have Three Categories?
For A, B, and C, would need two dummy variables
x1 is 1 for A, zero otherwise
x2 is 1 for B, zero otherwise
If x1 and x2 are zero, must be C
This is why the third dummy variable is not needed
LO15-7
15-27
27
16. Interaction Models
So far, have only considered dummy variables as stand-alone
variables
Where D is dummy variable
However, can also look at interaction between dummy variable
and other variables
That model would take the form
With an interaction term, both the intercept and slope are
shifted
LO15-7
15-28
28
15.9 Using Squared and Interaction Variables (Optional)
Quadratic regression model is:
y = β0 + β1x + β2x2 ε
where
β0 + β1x + β2x2 is μy
β, β1, and β2 are the regression parameters
ε is an error term
LO15-8: Use squared and interaction variables.
15-29
17. 29
Using Interaction Variables
Regression models often contain interaction variables
Formed by multiplying two independent variables together
Consider a model where x3 and x4 interact
and x3 is used as a quadratic
y = β0 + β1x4 + β2x3 + β3x32 + β4x4x3 + ε
LO15-8
15-30
30
15.10 Multicollinearity, Model Building, and Model Validation
(Optional)
Multicollinearity: when “independent” variables are related to
one another
Considered severe when the simple correlation exceeds 0.9
Even moderate multicollinearity can be a problem
Another measurement is variance inflation factors
Multicollinearity considered
Severe when VIF > 10
Moderately strong for VIF > 5
LO15-9: Describe multicollinearity and build and validate a
multiple regression model (Optional).
18. 15-31
31
Effect of Adding Independent Variable
Adding any independent variable will increase R²
Even adding an unimportant independent variable
Thus, R² cannot tell us that adding an independent variable is
undesirable
LO15-9
15-32
32
A Better Criterion is the Standard Error
A better criterion is the size of the standard error s
If s increases when an independent variable is added, we should
not add that variable
However, decreasing s alone is not enough
An independent variable should only be included if it reduces s
enough to offset the higher t value and reduces the length of the
desired prediction interval for y
LO15-9
19. 15-33
33
C Statistic
Another quantity for comparing regression models is called the
C (a.k.a. Cp) statistic,
First, calculate mean square error for the model containing all p
potential independent variables (s2p)
Next, calculate SSE for a reduced model with k independent
variables
LO15-9
15-34
34
C Statistic Continued
We want the value of C to be small
Adding unimportant independent variables will raise the value
of C
While we want C to be small, we also wish to find a model for
which C roughly equals k + 1
A model with C substantially greater than k + 1 has substantial
bias and is undesirable
20. If a model has a small value of C and C for this model is less
than k + 1, then it is not biased and the model should be
considered desirable
LO15-9
15-35
35
The Partial F Test: An F Test for a Portion of a Regression
Model
To test
H0: All of the βj coefficients corresponding to the independent
variables in the subset are zero
Ha: At least one of the βj coefficients is not equal to zero
Reject H0 in favor of Ha if:
p-
- g numerator and n - (k + 1) denominator
degrees of freedom
LO15-9
15-36
36
21. 15.11 Residual Analysis and Outlier Detection in Multiple
Regression (Optional)
For an observed value of yi, the residual is
i = yi - ŷ = yi – (b0 + b1xi1 + … + bkxik)
If the assumptions hold, the residuals should look like a random
sample from a normal distribution with mean 0 and variance σ2
Residual plots
Residuals versus each independent variable
Residuals versus predicted y’s
Residuals in time order (if the response is a time series)
LO15-10: Use residual analysis and outlier detection to check
the assumptions of multiple regression (Optional).
15-37
Figure 15.35
37
LO15-10
Outliers
Figure 15.37 c, d and e
15-38
23. Chapter Outline Continued
14.6 Testing the Significance of the Population Correlation
Coefficient (Optional)
14.7 Residual Analysis
14-3
3
14.1 The Simple Linear Regression Model and the Least
Squares Point Estimates
The dependent (or response) variable is the variable we wish to
understand or predict
The independent (or predictor) variable is the variable we will
use to understand or predict the dependent variable
Regression analysis is a statistical technique that uses observed
data to relate the dependent variable to one or more independent
variables
The objective is to build a regression model that can describe,
predict and control the dependent variable based on the
independent variable
LO14-1: Explain the simple linear regression model.
14-4
4
24. Form of The Simple Linear
Regression Model
y = β0 + β1x + ε
when the value of the independent variable is x
β0 is the y-intercept; the mean of y when x is zero
β1 is the slope; the change in the mean of y per unit change in x
ε is an error term that describes the effect on y of all factors
other than x
LO14-1
14-5
5
Regression Terms
β0 and β1 are called regression parameters
β0 is the y-intercept
β1 is the slope
We do not know the true values of these parameters
So, we must use sample data to estimate them
b0 is the estimate of β0
b1 is the estimate of β1
LO14-1
14-6
25. 6
LO14-1
The Simple Linear Regression Model Illustrated
Figure 14.3
14-7
7
The Least Squares Point Estimates
LO14-2: Find the least squares point estimates of the slope and
y-intercept.
14-8
8
Example 14.2 The Tasty Sub Shop Case: The Least Squares
Estimates
26. LO14-2
14-9
9
Example 14.2 The Tasty Sub Shop Case: The Least Squares
Estimates
From last slide,
Σyi = 8,603.1
Σxi = 434.1
Σx2i = 20,757.41
Σxiyi = 403,296.96
Once we have these values, we no longer need the raw data
Calculation of b0 and b1 uses these totals
LO14-2
14-10
10
Example 14.2 The Tasty Sub Shop Case (Slope b1)
LO14-2
14-11
27. 11
Example 14.2 The Tasty Sub Shop Case (y-Intercept b0)
Prediction (x = 20.8)
ŷ = b0 + b1x = 183.31 + (15.59)(20.8)
ŷ = 507.69
Residual is 527.1 – 507.69 = 19.41
LO14-2
14-12
Figure 14.5
12
14.2 Simple Coefficients of
Determination and Correlation
How useful is a particular regression model?
One measure of usefulness is the simple coefficient of
determination
28. It is represented by the symbol r2
LO14-3: Calculate and interpret the simple coefficients of
determination and correlation.
14-13
13
The Simple Coefficient of Determination,
Total variation is yi-ȳ)2
Explained variation is ŷi-ȳ)2
Unexplained variation is yi-ŷ)2
Total variation is the sum of explained and unexplained
variation
Simple coefficient of determination is
is the proportion of explained variation
LO14-3
14-14
14
The Simple Correlation Coefficient,
The simple correlation coefficient between y and x is denoted
29. by r
It is…
if b1 is positive
if b1 is negative
Where b1 is the slope of the least squares line
Simple correlation coefficient measures the strength of the
linear relationship between y and x and is denoted by r
LO14-3
14-15
15
LO14-3
Different Values of the Correlation Coefficient
Figure 14.8
14-16
16
14.3 Model Assumptions and the Standard Error
Mean of Zero: At any given value of x, the population of
potential error term values has a mean equal to zero
30. Constant Variance Assumption: At any value of x, the
population of potential error term values has a variance that
does not depend on the value of x
Normality Assumption: At any given value of x, the population
of potential error term values has a normal distribution
Independence Assumption: Any one value of the error term ε is
statistically independent of any other value of ε
LO14-4: Describe the assumptions behind simple linear
regression and calculate the standard error.
14-17
Figure 14.9
17
LO14-4
The Mean Square Error and the Standard Error
Sum of squared errors
Mean square error
Point estimate of the residual variance σ2
Standard error
Point estimate of the residual standard deviation σ
14-18
31. 18
14.4 Testing the Significance of the Slope and y-Intercept
A regression model is not likely to be useful unless there is a
significant relationship between x and y
To test significance, we use the null hypothesis:
H0: β1 = 0
Versus the alternative hypothesis:
Ha: β1 ≠ 0
LO14-5: Test the significance of the slope and y-intercept.
14-19
19
Testing the Significance of the Slope and y-Intercept Continued
LO14-5
14-20
32. 20
An F Test for the Significance of the Slope (Optional)
H0: β1 = 0
p-
- 2 denominator degrees of
freedom
LO14-6: Test the significance of a simple linear regression
model by using an F test (Optional).
14-21
14.5 Confidence and Prediction Intervals
The point on the regression line corresponding to a particular
value of x0 of the independent variable x is ŷ = b0 + b1x0
It is unlikely that this value will equal the mean value of y
when x equals x0
Therefore, we need to place bounds on how far the predicted
value might be from the actual value
We can do this by calculating a confidence interval mean for the
value of y and a prediction interval for an individual value of y
LO14-7: Calculate and interpret a confidence interval for a
mean value and a prediction interval for an individual value.
14-22
33. 22
Distance Value
Both the confidence interval for the mean value of y and the
prediction interval for an individual value of y employ a
quantity called the distance value
The distance value is a measure of the distance between the
value x0 of x and
Notice that the further x0 is from , the larger the distance value
LO14-7
14-23
23
A Confidence Interval and Prediction Interval
Assume that the regression assumption holds
The formula for a 100 (1 - the mean
value of y is
The formula for a 100 (1 -
individual value of y is
34. This is based on n - 2 degrees of freedom
LO14-7
14-24
24
Which to Use?
The prediction interval is useful if it is important to predict an
individual value of the dependent variable
A confidence interval is useful if it is important to estimate the
mean value
The prediction interval will always be wider than the confidence
interval
LO14-7
14-25
25
14.6 Testing the Significance of the Population Correlation
Coefficient (Optional)
The simple correlation coefficient (r) measures the linear
relationship between the observed values of x and y from the
35. sample
The population correlation coefficient (ρ) measures the linear
relationship between all possible combinations of observed
values of x and y
r is an estimate of ρ
LO14-8: Test hypotheses about the population correlation
coefficient (Optional).
14-26
26
Testing ρ
We can test to see if the correlation is significant using the
hypotheses
H0: ρ = 0
Ha: ρ ≠ 0
The statistic is
This test will give the same results as the test for significance
on the slope coefficient b1
LO14-8
14-27
36. 27
14.7 Residual Analysis
Checks of regression assumptions are performed by analyzing
the regression residuals
Residuals () are defined as the difference between the observed
value of y and the predicted value of y, = y - ŷ
Note that is the point estimate of ε
If regression assumptions valid, the population of potential
error terms will be normally distributed with mean zero and
variance σ2
Different error terms will be statistically independent
LO14-9: Use residual analysis to check the assumptions of
simple linear regression.
14-28
28
Residual Analysis Continued
Residuals are randomly and independently selected from normal
populations with mean zero and variance σ2
With any real data, assumptions will not hold exactly
Mild departures do not affect our ability to make statistical
inferences
37. In checking assumptions, we are looking for pronounced
departures from the assumptions
So, only require residuals to approximately fit the description
above
LO14-9
14-29
29
LO14-9
Example 14.9 The QHIC Case: Constructing Residual Plots
Figure 14.18b
Quality Home Improvement Center (QHIC) operates five stores
Studies the relationship between home value and yearly
expenditure on home upkeep
Random sample of 40 homeowners
Intercept = –348.3921
Slope 7.2583
14-30
30
38. Residual Plots
Residuals versus independent variable
Residuals versus predicted y’s
Residuals in time order (if the response is a time series)
LO14-9
14-31
31
Constant Variance Assumptions
To check the validity of the constant variance assumption,
examine residual plots against
The x values
The predicted y values
Time (when data is time series)
A pattern that fans out says the variance is increasing rather
than staying constant
A pattern that funnels in says the variance is decreasing rather
than staying constant
A pattern that is evenly spread within a band says the
assumption has been met
LO14-9
14-32
32
39. LO14-9
Constant Variance Visually
Figure 14.19
14-33
33
Assumption of Correct Functional Form
If the relationship between x and y is something other than a
linear one, the residual plot will often suggest a form more
appropriate for the model
For example, if there is a curved relationship between x and y, a
plot of residuals will often show a curved relationship
LO14-9
14-34
34
Normality Assumption
If the normality assumption holds, a histogram or stem-and-leaf
40. display of residuals should look bell-shaped and symmetric
Another way to check is a normal plot of residuals
Order residuals from smallest to largest
Plot (i) on vertical axis against (i)
(i) is the point on the horizontal axis under the curve so the
area under this curve to the left is (3i - 1)/(3n + 1)
If the normality assumption holds, the plot should have a
straight-line appearance
LO14-9
14-35
35
Independence Assumption
Independence assumption most likely violated by time-series
data
If the data is not time series, it can be reordered without
affecting it
For time-series data, the time-ordered error terms can be
autocorrelated
Positive autocorrelation is when a positive error term in time
period i tends to be followed by another positive value in i + k
Negative autocorrelation is when a positive error term tends to
be followed by a negative value
Either one will cause a cyclical error term over time
LO14-9
14-36