In this presentation, we will discuss the mathematical basis of linear regression and analyze the concepts of p-value, hypothesis testing, and confidence intervals, and their interpretation.
1. U N I V E R S I T Y O F S O U T H F L O R I D A //
Linear Regression Concepts
Dr. S. Shivendu
2. U N I V E R S I T Y O F S O U T H F L O R I D A // 2
Objectives
Linear Regression Concepts
Identify the mathematical basis of linear
regression.
01
Differentiate statistical inferences about
relationships based on regression output.
02
Analyze the concepts of p-value, hypothesis
testing, and confidence intervals, and their
interpretation.
03
3. U N I V E R S I T Y O F S O U T H F L O R I D A // 3
Agenda
Linear Regression Concepts
Regression Analysis
Introduction
Linear Regression
Concepts
Assumptions
Concepts
Coefficient Confidence Intervals
Concepts
Prediction Confidence Intervals
Concepts
4. U N I V E R S I T Y O F S O U T H F L O R I D A // 4
Models
A mathematical model is a mathematical expression of some phenomenon
Describe relationships between variables
Deterministic
Models
Probabilistic
Models
5. U N I V E R S I T Y O F S O U T H F L O R I D A // 5
Deterministic Models
Hypothesize exact relationships.
Suitable when the relationship is certain and known.
Example: Force is exactly mass times acceleration
F = m·a
6. U N I V E R S I T Y O F S O U T H F L O R I D A // 6
The relationship is not certain and all factors that impact
the outcome are not known
Hypothesize two components
Probabilistic Models
Deterministic and random error
Example: Sales volume (y) is 10 times advertising
spending (x) + random error
y = 10x +
The random error may be due to factors
other than advertising
7. U N I V E R S I T Y O F S O U T H F L O R I D A // 7
Regression Models
Answers: “What is the relationship between the variables?”
Equations used:
One numerical dependent (response) variable
Used mainly for estimating the strength of the relationship and
for prediction
One or more numerical or categorical independent
(explanatory) variables
8. U N I V E R S I T Y O F S O U T H F L O R I D A // 8
Regression Modeling Steps
Hypothesize the
deterministic
relationship
between the
response variable
(dependent
variable) and one
or more
explanatory
(independent
variables) in the
Population
Specify
probability
distribution of
random error
term. Estimate
the standard
deviation of the
error
Estimate
unknown model
parameters
Interpret the
estimated
parameters?
What is a
parameter?
9. U N I V E R S I T Y O F S O U T H F L O R I D A // 9
Model Specification is Based on Theory
Theory of field
(e.g., Sociology)
Mathematical
theory
Previous research
“Common sense”
10. U N I V E R S I T Y O F S O U T H F L O R I D A // 10
Types of Regression Models
Simple
1 Explanatory
Variable
Regression
Models
2+ Explanatory
Variables
Multiple
Linear Linear Non- Linear
Non- Linear
11. U N I V E R S I T Y O F S O U T H F L O R I D A // 11
Linear Regression Models
Relationship between variables is a linear function
y
Dependent (Response)
Variable
x
= + +
Population y - intercept Participation Slope Random Error
Independent (Explanatory)
Variable
0 1
12. U N I V E R S I T Y O F S O U T H F L O R I D A // 12
Population Linear Regression Model
y
x
0 1
i i i
y x
0 1
E y x
Observed value
Observed value
i = Random error
13. U N I V E R S I T Y O F S O U T H F L O R I D A // 13
Sample Linear Regression Model
y
x
0 1
ˆ ˆ ˆ
i i i
y x
0 1
ˆ ˆ
ˆi i
y x
Unsampled observation
i = Random error
Observed value
^
14. U N I V E R S I T Y O F S O U T H F L O R I D A // 14
Estimating Parameters: Least Squares Method
Hypothesize deterministic component
Estimate unknown model parameters
Regression Modeling Steps
Specify probability distribution of random error term
Evaluate model
Use model for prediction and estimation
15. U N I V E R S I T Y O F S O U T H F L O R I D A // 15
Scattergram
0
20
40
60
0 20 40 60
x
y
Plot of all (xi, yi) pairs
Suggests how well the model will fit
16. U N I V E R S I T Y O F S O U T H F L O R I D A // 16
Thinking Challenge
How would you draw a line
through the points?
0
20
40
60
0 20 40 60
x
y
How would you determine
which line fits best?
17. U N I V E R S I T Y O F S O U T H F L O R I D A // 17
Least Squares
“Best fit’ means the
difference between
actual y values and
estimated or predicted y
values are a minimum
2 2
1 1
ˆ ˆ
n n
i
i i
i i
y y
Positive differences off-set
negative
Least Squares minimizes
the Sum of the Squared
Differences (SSE)
18. U N I V E R S I T Y O F S O U T H F L O R I D A // 18
Least Squares Graphically
e2
y
x
e1 e3
e4
^
^
^
^
2 0 1 2 2
ˆ ˆ ˆ
y x
0 1
ˆ ˆ
ˆi i
y x
2 2 2 2 2
1 2 3 4
1
ˆ ˆ ˆ ˆ ˆ
LS minimizes
n
i
i
19. U N I V E R S I T Y O F S O U T H F L O R I D A // 19
Coefficient Equations
Prediction Equation
0 1
ˆ ˆ
ŷ x
1 1
1
1 2
1
2
1
ˆ
n n
i i
n
i i
i i
xy i
n
xx
i
n
i
i
i
x y
x y
SS n
SS
x
x
n
Slope
0 1
ˆ ˆ
y x
y-intercept
20. U N I V E R S I T Y O F S O U T H F L O R I D A // 20
Estimated y changes by 1 for each 1unit increase in x
Interpretation of Coefficients
If 1 = 2, then Sales (y) is expected to increase by 2 for each
1 unit increase in Advertising (x)
The average value of y when x = 0
If 0 = 4, then Average Sales (y) is expected to be 4 when
Advertising (x) is 0
Slope (1)
Y-Intercept (0)
^
^
^
^
21. U N I V E R S I T Y O F S O U T H F L O R I D A // 21
Parameter Estimation Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 -0.1000 0.6350 -0.157 0.8849
ADVERT 1 0.7000 0.1914 3.656 0.0354
0
^
1
^
ˆ .1 .7
y x
22. U N I V E R S I T Y O F S O U T H F L O R I D A // 22
Sales Volume (y) is expected to increase by .7 units for
each $1 increase in Advertising (x)
Coefficient Interpretation Solution
Average value of Sales Volume (y) is -.10 units when
Advertising (x) is 0
Difficult to explain to marketing manager
Expect some sales without advertising
Slope (1)
Y-Intercept (0)
^
^
^
^
23. U N I V E R S I T Y O F S O U T H F L O R I D A // 23
Probability Distribution of Random Error
Hypothesize deterministic component
Estimate unknown model parameters
Regression Modeling Steps
Specify probability distribution of random error term
Evaluate model
Use model for prediction and estimation
24. U N I V E R S I T Y O F S O U T H F L O R I D A // 24
Linear Regression Assumptions
The mean probability
distribution of error, ε, is
0
The probability
distribution of error, ε, is
approximately normally
distributed
The probability
distribution of error has
a constant variance
Errors are independent
25. U N I V E R S I T Y O F S O U T H F L O R I D A // 25
Error Probability Distribution
x1 x2 x3
y
E(y) = β0 + β1x
x
26. Variation of actual y from
predicted y, y
Random Error Variation
Measured by standard error of
regression model. Sample
standard deviation of : s
Affects several factors like
parameter significance and
prediction accuracy
27. U N I V E R S I T Y O F S O U T H F L O R I D A // 27
Variation Measures
y
x
xi
0 1
ˆ ˆ
ˆi i
y x
yi
2
ˆ
( )
i i
y y
Unexplained sum of
squares or SSE
2
( )
i
y y
Total sum of squares
2
ˆ
( )
i
y y
Explained sum of
squares
y
28. U N I V E R S I T Y O F S O U T H F L O R I D A // 28
Estimation of Variance of Error σ2
2
2
ˆ
2
i i
SSE
s where SSE y y
n
2
2
SSE
s s
n
29. U N I V E R S I T Y O F S O U T H F L O R I D A // 29
Residual Analysis
e Y Y
= -
i i
ˆ
Check the assumptions of regression by examining the residuals
Examine for linearity assumption
Evaluate independence assumption
Evaluate normal distribution assumption
Examine for constant variance for all levels of X (homoscedasticity)
The residual for observation i, ei, is the difference between its
observed and predicted value
30. U N I V E R S I T Y O F S O U T H F L O R I D A // 30
Residual Analysis for Linearity
Not Linear Linear
x
residuals
x
Y
x
Y
x
residuals
31. U N I V E R S I T Y O F S O U T H F L O R I D A // 31
Residual Analysis for Independence
Not Independent Independent
X
X
residuals
residuals
X
residuals
32. U N I V E R S I T Y O F S O U T H F L O R I D A // 32
Check for Normality
Examine the Sem-and-Leaf Display of the Residuals
Examine the Boxplot of the Residuals
Examine the Histogram of the Residuals
Construct a Normal Probability Plot of the Residuals
33. U N I V E R S I T Y O F S O U T H F L O R I D A // 33
Residual Analysis for Normality
Percent
Residual
When using a normal probability plot, normal errors
will approximately display in a straight line
-3 -2 -1 0 1 2 3
0
100
34. U N I V E R S I T Y O F S O U T H F L O R I D A // 34
Residual Analysis for Equal Variance
Non-constant variance Constant variance
x x
Y
x x
Y
residuals
residuals
35. U N I V E R S I T Y O F S O U T H F L O R I D A // 35
Interpreting the Model - Testing for Significance
Hypothesize deterministic component
Estimate unknown model parameters
Regression Modeling Steps
Specify probability distribution of random error term
Interpret model
36. U N I V E R S I T Y O F S O U T H F L O R I D A // 36
Test of Slope Coefficient
Shows if there is a linear
relationship between x
and y
Hypotheses:
Involves population
slope 1
Theoretical basis is
sampling distribution of
slope
H0: 1 = 0 (No Linear Relationship)
Ha: 1 0 (Linear Relationship)
37. U N I V E R S I T Y O F S O U T H F L O R I D A // 37
Sampling Distribution of Sample Slopes
y
Population Line
x
Sample 1 Line
Sample 2 Line
1
Sampling Distribution
1
1
S
^
^
All Possible
Sample Slopes
Sample 1: 2.5
Sample 2: 1.6
Sample 3: 1.8
Sample 4: 2.1
: :
Very large number of
sample slopes
38. U N I V E R S I T Y O F S O U T H F L O R I D A // 38
Slope Coefficient Test Statistic
1
1 1
ˆ
2
1
2
1
ˆ ˆ
2
where
xx
n
i
n
i
xx i
i
t df n
s
S
SS
x
SS x
n
39. U N I V E R S I T Y O F S O U T H F L O R I D A // 39
Test of Slope Coefficient Computer Output
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Param=0 Prob>|T|
INTERCEP 1 -0.1000 0.6350 -0.157 0.8849
ADVERT 1 0.7000 0.1914 3.656 0.0354
t = 1 / S
P-Value
S
1 1 1
^
^
^
^
40. U N I V E R S I T Y O F S O U T H F L O R I D A // 40
Prediction with Regression Models
Types of predictions
What is predicted?
Point estimates
Interval estimates
Population mean response E (y) for given x
Point on population regression line
Individual response (y) for given x
41. U N I V E R S I T Y O F S O U T H F L O R I D A // 41
Confidence Interval Estimate for Mean Value of y at x = x
xx
p
SS
x
x
n
S
t
y
2
2
/
1
ˆ
df = n – 2
p
42. U N I V E R S I T Y O F S O U T H F L O R I D A // 42
Factors Affecting Interval Width
Level of confidence (1 – )
Width increases as confidence increases
Data dispersion (s)
Width increases as variation increases
Sample size
Width decreases as sample size increases
Distance of x from mean x
Width increases as distance increases
p
-
43. U N I V E R S I T Y O F S O U T H F L O R I D A // 43
Prediction Interval of Individual Value of y at x = x
df = n – 2
p
2
/2
1
ˆ 1
p
xx
x x
y t S
n SS
44. U N I V E R S I T Y O F S O U T H F L O R I D A // 44
Key Takeaway
The statistical
interpretation is the
value proposition of
the linear
regression model
The statistical
interpretation
depends on
assumptions of the
linear model being
met
Understanding
outliers is critical for
drawing meaningful
inferences from the
linear regression
model
45. U N I V E R S I T Y O F S O U T H F L O R I D A //
You have reached the end
of the presentation.