2. Slide-2
Learning Objectives
How to use regression analysis to predict the value of
a dependent variable based on an independent
variable
The meaning of the regression coefficients b0 and b1
How to evaluate the assumptions of regression
analysis and know what to do if the assumptions are
violated
To make inferences about the slope and correlation
coefficient
To estimate mean values and predict individual values
3. Slide-3
Correlation vs. Regression
A scatter diagram can be used to show the
relationship between two variables
Correlation analysis is used to measure
strength of the association (linear relationship)
between two variables
Correlation is only concerned with strength of the
relationship
No causal effect is implied with correlation
4. Slide-4
Introduction to
Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on the
value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to predict
or explain
Independent variable: the variable used to explain
the dependent variable
5. Slide-5
Simple Linear Regression
Model
Only one independent variable, X
Relationship between X and Y is
described by a linear function
Changes in Y are assumed to be caused
by changes in X
9. Department of Statistics, ITS Surabaya Slide-9
i
i
1
0
i ε
X
β
β
Y
Linear component
Simple Linear Regression
Model
Population
Y intercept
Population
Slope
Coefficient
Random
Error
term
Dependent
Variable
Independent
Variable
Random Error
component
10. Slide-10
(continued)
Random Error
for this Xi value
Y
X
Observed Value
of Y for Xi
Predicted Value
of Y for Xi
i
i
1
0
i ε
X
β
β
Y
Xi
Slope = β1
Intercept = β0
εi
Simple Linear Regression
Model
11. Slide-11
i
1
0
i X
b
b
Ŷ
The simple linear regression equation provides an
estimate of the population regression line
Simple Linear Regression
Equation (Prediction Line)
Estimate of
the regression
intercept
Estimate of the
regression slope
Estimated
(or predicted)
Y value for
observation i
Value of X for
observation i
The individual random error terms ei have a mean of zero
12. Slide-12
Simple Linear Regression
Example
A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet
13. Slide-13
Sample Data for House Price
Model
House Price in $1000s
(Y)
Square Feet
(X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
15. Slide-15
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
House
Price
($1000s)
Graphical Presentation
House price model: scatter plot and
regression line
feet)
(square
0.10977
98.24833
price
house
Slope
= 0.10977
Intercept
= 98.248
16. Slide-16
Interpretation of the
Intercept, b0
b0 is the estimated average value of Y when the
value of X is zero (if X = 0 is in the range of
observed X values)
Here, no houses had 0 square feet, so b0 = 98.24833
just indicates that, for houses within the range of
sizes observed, $98,248.33 is the portion of the
house price not explained by square feet
feet)
(square
0.10977
98.24833
price
house
17. Slide-17
Interpretation of the
Slope Coefficient, b1
b1 measures the estimated change in the
average value of Y as a result of a one-
unit change in X
Here, b1 = .10977 tells us that the average value of a
house increases by .10977($1000) = $109.77, on
average, for each additional one square foot of size
feet)
(square
0.10977
98.24833
price
house
19. Slide-19
Measures of Variation
Total variation is made up of two parts:
SSE
SSR
SST
Total Sum of
Squares
Regression Sum
of Squares
Error Sum of
Squares
2
i )
Y
Y
(
SST
2
i
i )
Ŷ
Y
(
SSE
2
i )
Y
Ŷ
(
SSR
where:
= Average value of the dependent variable
Yi = Observed values of the dependent variable
i = Predicted value of Y for the given Xi value
Ŷ
Y
20. Slide-20
SST = total sum of squares
Measures the variation of the Yi values around their
mean Y
SSR = regression sum of squares
Explained variation attributable to the relationship
between X and Y
SSE = error sum of squares
Variation attributable to factors other than the
relationship between X and Y
(continued)
Measures of Variation
23. Multiple Regression
In general the regression estimates are more
reliable if:
i) n is large (large dataset)
ii) The sample variance of the explanatory
variable is high.
iii) the variance of the error term is small
iv) The less closely related are the explanatory
variables.
24. Multiple Regression
The constant and parameters are derived in the
same way as with the bi-variate model. It
involves minimising the sum of the error terms.
The equation for the slope parameters (α)
contains an expression for the covariance
between the explanatory variables.
When a new variable is added it affects the
coefficients of the existing variables
25. Regression
In the previous slide, a unit rise in x produces 0.4 of a unit rise in y,
with z held constant.
Interpretation of the t-statistics remains the same, i.e. 0.4-0/0.4=1
(critical value is 2.02), so we fail to reject the null and x is not
significant.
The R-squared statistic indicates 30% of the variance of y is
explained
DW statistic indicates we are not sure if there is autocorrelation, as
the DW statistic lies in the zone of indecision (Dl=1.43, Du=1.62)
)
tan
,
45
(
56
.
1
,
3
.
0
R
(0.3)
(0.4)
(0.1)
9
.
0
4
.
0
6
.
0
ˆ
2
brackets
in
errors
dard
s
ns
observatio
DW
z
x
y t
t
t
26. Adjusted R-squared Statistic
This statistic is used in a multiple regression
analysis, because it does not automatically
rise when an extra explanatory variable is
added.
Its value depends on the number of
explanatory variables
It is usually written as (R-bar squared):
2
R
27. ANNOVA, or Analysis of
Variance
It is a statistical method used to compare the
means of two or more groups to determine if
there are any significant differences between
them.
It is commonly used in research studies to
analyze the effects of different variables on a
particular outcome
Slide-27
28. The F-test
The F-test is an analysis of the variance of a regression
It can be used to test for the significance of a group of
variables or for a restriction
It has a different distribution to the t-test, but can be
used to test at different levels of significance
When determining the F-statistic we need to collect
either the residual sum of squares (RSS) or the R-
squared statistic
The formula for the F-test of a group of variables can be
expressed in terms of either the residual sum of
squares (RSS) or explained sum of squares (ESS)
29. F-test of explanatory power
This is the F-test for the goodness of fit of a
regression and in effect tests for the joint
significance of the explanatory variables.
It is based on the R-squared statistic
It is routinely produced by most computer
software packages
It follows the F-distribution, which is quite
different to the t-test
30. F-test formula
The formula for the F-test of the goodness of
fit is:
1
2
2
)
/(
)
1
(
1
/
k
k
n
F
k
n
R
k
R
F
31. F-distribution
To find the critical value of the F-distribution, in
general you need to know the number of
parameters and the degrees of freedom
The number of parameters is then read across
the top of the table, the d of f. from the side.
Where these two values intersect, we find the
critical value.
32. F-statistic
When testing for the significance of the
goodness of fit, our null hypothesis is that the
explanatory variables jointly equal 0.
If our F-statistic is below the critical value we fail
to reject the null and therefore we say the
goodness of fit is not significant.
33. Slide-33
House Price
in $1000s
(y)
Square Feet
(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.)
0.1098
98.25
price
house
Simple Linear Regression Equation:
The slope of this model is 0.1098
Does square footage of the house
affect its sales price?
Inference about the Slope:
t Test
(continued)
34. Slide-34
Inference about the Slope:
t Test
t test for a population slope
Is there a linear relationship between X and Y?
Null and alternative hypotheses
H0: β1 = 0 (no linear relationship)
H1: β1 0 (linear relationship does exist)
Test statistic
1
b
1
1
S
β
b
t
2
n
d.f.
where:
b1 = regression slope
coefficient
β1 = hypothesized slope
Sb = standard
error of the slope
1
35. Slide-35
Inferences about the Slope:
t Test Example
H0: β1 = 0
H1: β1 0
From Excel output:
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
1
b
S
t
b1
32938
.
3
03297
.
0
0
10977
.
0
S
β
b
t
1
b
1
1
36. Slide-36
Inferences about the Slope:
t Test Example
H0: β1 = 0
H1: β1 0
Test Statistic: t = 3.329
There is sufficient evidence
that square footage affects
house price
From Excel output:
Reject H0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
1
b
S t
b1
Decision:
Conclusion:
Reject H0
Reject H0
a/2=.025
-tα/2
Do not reject H0
0
tα/2
a/2=.025
-2.3060 2.3060 3.329
d.f. = 10-2 = 8
(continued)
37. Slide-37
Inferences about the Slope:
t Test Example
H0: β1 = 0
H1: β1 0
P-value = 0.01039
There is sufficient evidence
that square footage affects
house price
From Excel output:
Reject H0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
P-value
Decision: P-value < α so
Conclusion:
(continued)
This is a two-tail test, so
the p-value is
P(t > 3.329)+P(t < -3.329)
= 0.01039
(for 8 d.f.)
38. Slide-38
F Test for Significance
F Test statistic:
where
MSE
MSR
F
1
k
n
SSE
MSE
k
SSR
MSR
where F follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
(k = the number of independent variables in the regression model)
39. Slide-39
Excel Output
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
11.0848
1708.1957
18934.9348
MSE
MSR
F
With 1 and 8 degrees
of freedom
P-value for
the F Test
40. Slide-40
H0: β1 = 0
H1: β1 ≠ 0
a = .05
df1= 1 df2 = 8
Test Statistic:
Decision:
Conclusion:
Reject H0 at a = 0.05
There is sufficient evidence that
house size affects selling price
0
a = .05
F.05 = 5.32
Reject H0
Do not
reject H0
11.08
MSE
MSR
F
Critical
Value:
Fa = 5.32
F Test for Significance
(continued)
F