3. Regression
What is
Regression
A Statistical Technique that is used to relate two or
more variables.
Use the independent variable(s) to predict the value of
dependent variable.
Objective
Example
For a given value of advertisement expenditure, how
much sales will be generated.
With a given diet plan, how much weight an individual
will be able to reduce.
With a given diet plan, how much weight an individual
will be able to reduce.
4. Regression Understanding
A layman
Question
Suppose we want to find out how much the age of the
car helps you to determine the price of the car
The older the car ______ will be the priceA layman Answer
Regression in
Simple Words
As the age of the car increases by one year the price of
the car is estimated to decrease by a certain amount.
Y(Estimated) = b0 + b1 X
Regression in
Statistical Terms
5. Regression Understanding
Data Set: Age &
Price of the Cars
A Negative Relationship
What Relation Do
you see?
Age 1 2 1 2 3 4 3 4 3
Price 90 85 93 84 80 74 81 76 79
A Convenient
Way to Look
(What is this tool
Called?)
Price
Age
70
80
90
1 2 3 4
6. Price
Age
70
80
90
1 2 3 4
HowtoShowit
Statistically
Y (E) = b0 + b1 X
Y (E) = 97 – 5 X
Y = 97 – 5 X +E
Term
Y (E)
X
b0
b1
What it is!
Dependent Variable whose behavior is to be determined
Independent Variable whose effect to be determined
Intercept: Value of Y(E) when X = 0
Estimated Change in Y in response to unit Change in X
E Difference between the actual and estimated
7. Assessing the Goodness of Fit: Graphical Way
Goodness of
Fit Means
How well the model fits the actual data. Less residual
means a good bit, more residual means bad Fit
Bad Fit Good Fit Perfect Fit
12. Assessing the Goodness of Fit: Statistical Way R2
SST =Σ (Real – Expected)2
SSR =Σ (Estimated – Expected)2
SSE =Σ (Actual – Expected)2
A good Model is the one in
which SSE is the lowest
SSE = 0
SST = SSR + SSE R2 = SSR/SST R2 = 1 - SSE/SST
13. Inferring About the Population
Assumptions
Expected Value
of Residual
Variance of
Residual
Distribution of
Residual
Dependency of
Residuals
E(ei ) = 0
σe1= σe2= …. = σei
Normal
Independent
What it means
No apparent pattern in residual plot
Residual Plot has consistent Spread
Histogram is symmetric or normal
(Histogram & Probability Plot of Residual)
Relationship
b/w IndV & DV
Linear Linear Scatter Plot
How to Check it
14. The Three Conditions Shown Together
As the distribution is symmetric, the
mean distribution of error term will
be zero
The distribution of error term is
shown to be normally distributed
Variance of error term for different
values of x appear to be same
15. Analysis of Residuals
If the assumptions of regression are met, the following two
conditions are met
Cond1: Plot of residuals (e) against predictor (x) should fall
roughly in a horizontal band & symmetric about x-axis
Cond2: A normal probability plot of the residuals should be
roughly linear
16. 16
Residual Analysis
Examining the residuals (or standardized residuals), help
detect violations of the required conditions.
Example – continued:
Nonnormality.
Use Excel to obtain the standardized residual histogram.
Examine the histogram and look for a bell shaped. diagram with a mean
close to zero.
17. 17
For each residual we calculate
the standard deviation as follows:
2
x
2
i
i
ir
s)1n(
)xx(
n
1
h
whereh1ss i
A Partial list of
Standard residuals
ObservationPredicted Price Residuals Standard Residuals
1 14736.91 -100.91 -0.33
2 14277.65 -155.65 -0.52
3 14210.66 -194.66 -0.65
4 15143.59 446.41 1.48
5 15091.05 476.95 1.58
Standardized residual ‘i’ =
Residual ‘i’
Standard deviation
Residual Analysis
19. 19
Heteroscedasticity
When the requirement of a constant variance is violated we have a
condition of heteroscedasticity.
Diagnose heteroscedasticity by plotting the residual against the
predicted y.
+ + +
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The spread increases with y^
y^
Residual
^y
+
+
+
+
+
+
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
+
20. 20
Homoscedasticity
When the requirement of a constant variance is not violated we
have a condition of homoscedasticity.
Example - continued
-1000
-500
0
500
1000
13500 14000 14500 15000 15500 16000
Predicted Price
Residuals
21. 21
Non Independence of Error Variables
A time series is constituted if data were collected over time.
Examining the residuals over time, no pattern should be
observed if the errors are independent.
When a pattern is detected, the errors are said to be
autocorrelated.
Autocorrelation can be detected by graphing the residuals
against time.
22. 22
Patterns in the appearance of the residuals over time indicates
that autocorrelation exists.
+
+
+
+ +
+
+
+
+
+
+
+
+ + +
+
+
+
+
+
+
+
+
+
+
Time
Residual Residual
Time
+
+
+
Note the runs of positive residuals,
replaced by runs of negative residuals
Note the oscillating behavior of the
residuals around zero.
0 0
Non Independence of Error
Variables
23. 23
Outliers
An outlier is an observation that is unusually small or large.
Several possibilities need to be investigated when an outlier
is observed:
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid.
Identify outliers from the scatter diagram.
It is customary to suspect an observation is an outlier if its
|standard residual| > 2
25. Sequence of Entering Variables
Which Variables
to Enter First
The one which is theoretically more important
If Variables are
Uncorrelated
The sequence of entering variable does not
have any effect
But
Real life has more of the correlated than the
uncorrelated
Some Methods
Hierarchical
Forced/Enter
Stepwise
First Known then unknown
All together, the only method for
testing theory
The order is selected
mathematically by software
26. Stepwise
Methods
Forward Backward
Process
Start with the constant and
then add the one with the
highest variation explained
Start with the all and then
remove the one with the
least significance
Suppression
Effect
It suppresses No suppression
Suppression effect means that a variable has significant effect only when other
variables are held constant. Forward is more prone to exclude the variable
because of suppression effect.
Cross Validation
When stepwise methods are used, the sample is advised to be divided into two
groups; one is used to develop the model and the other is used to test it.
28. Diagnostics Outliers
Outlier
Outlier Effect
How to Identify
Residuals
Diagnostics Outliers
Unstandardized
Residuals
Standardized
Residuals (SR)
There is outlier if
SR > 3.29
More than 1% Sample cases have
SR > 2.58
More than 5% Sample cases have
SR > 1.96
Student zed
Residuals
Unstandardized Residual divided by
Changing Standard deviation
29.
30. Diagnostics Influential Cases
Influential Case
Measuring the
Effect on Case
Undue influence on coefficient
Adjusted Predicted
Value ( APV)
DFFIT
Deleted Residuals
Studentized
Deleted Residuals
Predicted value when that
particular case is excluded while
developing the model
APV– Original PV
APV– Original OV
Deleted Residuals / Std Dev
31. Cook’s Distance
Leverage
(K+1)/n
K =Predictors
n = sample size
Mahalanobis
Distance
CD >1
influence of the observed
value of the outcome
variable over the predicted
values. (0 to 1)
Effect On Model
Effect on
Model
Values Cause for
Concern
Distance of cases from the
mean(s) of the predictor
variable(s).
L >2(K+1)/n
L >3(K+1)/n
N = 500, 5 above 25
N = 100, 3 above 15
Use Barnett & Lewis
Table
32. Assumptions
Variable Type
Variance Positive
No Perfect
Multicolinearity
Homoscedasticity
Independent
Errors
Predictors are
uncorrelated with
external variables
Quantitative or Categorical
Variance > 0
Predictor Variables should not correlate
highly
Variance of the residual
terms should be constant
33. Multicolinearity
Perfect Colinearity
Perfect collinearity exists when at
least one predictor is a perfect
linear combination of
the others (the simplest example being
two predictors that are perfectly
correlated – they
have a correlation coefficient of 1).