3. Regression
What is
Regression
A Statistical Technique that is used to relate two or
more variables.
Use the independent variable(s) to predict the value of
dependent variable.
Objective
Example
For a given value of advertisement expenditure, how
much sales will be generated.
With a given diet plan, how much weight an individual
will be able to reduce.
With a unit increase in green house gases, how much
will be the rise in the temperature?
4. Regression Understanding
A layman
Question
Suppose we want to find out how much the age of the
car helps you to determine the price of the car
The older the car ______ will be the priceA layman Answer
Regression in
Simple Words
As the age of the car increases by one year the price of
the car is estimated to decrease by a certain amount.
Y(Estimated) = b0 + b1 X
Regression in
Statistical Terms
5. Regression Understanding
Data Set: Age &
Price of the Cars
A Negative Relationship
What Relation Do
you see?
Age 1 2 1 2 3 4 3 4 3
Price 90 85 93 84 80 74 81 76 79
A Convenient
Way to Look
(What is this tool
Called?)
Price
Age
70
80
90
1 2 3 4
6. Price
Age
70
80
90
1 2 3 4
HowtoShowit
Statistically
Y (E) = b0 + b1 X
Y (E) = 97 – 5 X
Y = 97 – 5 X +E
Term
Y (E)
X
b0
b1
What it is!
Dependent Variable whose behavior is to be determined
Independent Variable whose effect to be determined
Intercept: Value of Y(E) when X = 0
Estimated Change in Y in response to unit Change in X
E Difference between the actual and estimated
7. Assessing the Goodness of Fit: Graphical Way
Goodness of
Fit Means
How well the model fits the actual data. Less residual
means a good fit, more residual means bad Fit
Bad Fit Good Fit Perfect Fit
12. Assessing the Goodness of Fit: Statistical Way R2
SST =Σ (Real – Expected)2
SSR =Σ (Estimated – Expected)2
SSE =Σ (Actual – Expected)2
A good Model is the one in
which SSE is the lowest
SSE = 0
SST = SSR + SSE R2 = SSR/SST R2 = 1 - SSE/SST
13. Residual Analysis
Why
The purpose of Modeling is to predict
(interpolate), the interpolation can be
correct when the assumptions about the
behavior of the data hold true.
Assumptions:
Response
Variable
is independent
Is Normally
Distributed
Has constant
Variance
Has straight line
Relation with IV
14. Residual Analysis
In Terms of Response
Variable
In Terms of Residual
Independence
Normality
Constant
Variance
Linearity
Response Variable Random Errors
is independent
Is Normally
Distributed
Has constant Variance
Has straight line
Relation with IV
are independent
are Normally
Distributed
Have constant
Variance
Have straight line
relation with IV
15. Inferring About the Population
Assumptions
Expected Value
of Residual
Variance of
Residual
Distribution of
Residual
Dependency of
Residuals
E(ei ) = 0
σe1= σe2= …. = σei
Normal
Independent
What it means
No apparent pattern in residual plot
Residual Plot has consistent Spread
Histogram is symmetric or normal
(Histogram & Probability Plot of Residual)
Relationship
b/w IndV & DV
Linear Linear Scatter Plot
How to Check it
16. The Three Conditions Shown Together
As the distribution is symmetric, the
mean distribution of error term will
be zero
The distribution of error term is
shown to be normally distributed
Variance of error term for different
values of x appear to be same
17. Residual Analysis
Types of Residuals
Normal or Raw
Residual: RESID
Standardized
Residual: ZRESID
Studentized
Residual: SRESID
Y – Y(Estimated)
{Y – Y(Estimated)}/Standard Error of Residual
{Y – Y(Estimated)}/ Varying Standard Error of Residual
18. Influential Observation
Outliers Observations with large error
Leverage
Points
Distinct from other values on the basis of
independent values
Influential
Observation
Value the inclusion of which can affect the
coefficient of regression line
Any Value can be Influential Observation
19. Outliers With Residuals
Standardized Residuals Un standardized Residuals
Can not tell how big
residual will be considered
big.
Using the Properties of
ND helps us in making a
rule for deciding large or
small
Rule of 3.28
Rule of 2.58
Rule of 1.96
SR > 3.28
1% or More % SR > 2.58
Model is Unacceptable When
5% or More % SR > 1.96
20. Identifying Influential Cases
I Will Look at
the World
Without You
Regression is done with a particular
data set removed and that particular
value is predicted
How it Looks
This adjusted Predicted value is similar
to the Predicted Value then the value is
not an influential observation
21. Identifying Influential Cases
Adjusted
Predicted Value
The predicted value of a case without
including that case for Predicting it
DFFit Original Predicted – Adjusted Predicted
Deleted
Residual
Studentized
Deleted Residual
Original Observed– Adjusted Predicted
Deleted Residual / Standard Deviation
22. Influential Cases
Coefficient with (xa, ya) included
&
Coefficient with (xa, ya) not included
Large Change in
Coefficient
Not Large Change
in Coefficient
Influential
Observation
Not an Influential
Observation
23. Influential Cases
(Adjusted Predicted Value)
Predicted Value
DFFit =Difference= PV - APV
Influential
Observation
Small Difference
Adjusted
Predicted Value
Large Difference
Not an Influential
Observation
24. Influential Cases
(Adjusted Predicted Value)
Original Value
(OV)
Deleted Residual (DR)= OV - APV
SDR Can be compared for different
Regression Models
Adjusted Predicted
Value
(APV)
Studentized Deleted Residual=DR/SE
25. Identifying Influential Cases
Cook’s
Distance
What is it?
Leverage
Is the measure of overall
influence of the case on
the model
Mahalanobis
Distance Observation is
influential if
CD > 1
Influence of observed on
predicted
Average Leverage(AL) =
(K+1)/2
AL > 2(k+1)/2
Or
AL > 3(k+1)/2
Distance of Cases from
mean of Predictor
variables
Use Barnett & Lewis
Table
26. Identifying Influential Cases
DfBeta/Standard
Error
DfBeta
Standardized
DfBeta
Covariance
Ratio = CVR
What is it?
Observation is
influential if
>1
>2
Delete case if
CVR < 1-3(k+1)/n
Don’t Delete case if
CVR > 1+3(k+1)/n
K = Number of Predictors
Difference Between
Parameter with &
without Case
It measures whether
the case affects the
variance of Regression
Parameter
Scale Sensitive
therefore does not
provide Good CV
27.
28. Heteroscedasticity
What is it?
Changing Variance at different level
of predictor
+ + +
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The spread increases with y^
Residual
y^
Measure
29. Multicolinearity
What is it?
Strong correlation between the
predictor variables
Effects
Untrustworthy
bs
Restricted R2
Difficulty in
Picking the
Right Variable
Inflated
Standard
Error
Not
Significant
bs
Varying bs
from Sample
to Sample
The inclusion of new varaible which is
strongly correlated with the first one, R2
will not increase
The inclusion of new varaible which is
strongly correlated with the first one, R2
will not increase
30. Multicolinearity:
Measure
What is it? VIF = 1/(1-R2)
Interpretation
The lower the
value the better:
VIF < 10
VIF < 10
VIF: Variance
Inflating Factor
Durbin Watson
Range of Value is
between 0 & 4
0 = Negative correlation
4= Positive Correlation
2 = No Correlation
Desired value is 2 or
near
31. Measures of Multicolinearity:
Variance
Inflation Factor
Tolerance
Eigen Value
Variance
Proportion
The Lower the
Better
Higher the Better
The Lower the
Better
Higher the Better
Measure
Desired
Behavior
VIF > 10
T < 0.1
The Lower the
Better
Each Dimension be
related with
separate Variable
Critical Value
33. Transformation of a Variable
Reason
Nonliear is translated into linear
Methods of explanation for linear relation are known
How
Justified
Theoretically
Diagnostic Plots
Transform x Y Both
34. Transformation of a Variable
Function
Reciprocal
Y =α+ β/x
Exponential
Y =αebx
Power
Y =αxb
Log
Y =α+ β log
x
Transform
Y’ =ln(Y) Y’ =lnα+ β x
Linear Form
Y’=log(Y),
X’=log(X)
Y’ =logα+ β x’
X’ =log(X) Y’ =α+ β x’
X’ =1/x Y’ =α+ β x’
35. Regression through SPSS
Coefficients
Model Fit
Assumption
b0 & b1
SST =SSR + SSE
t
F=MSR/MSE
e is independent
e is Normally Distributed
e has constant Variance
e has straight line Relation with IV
Multicolinearity
43. Assumption
e is independent
e is Normally Distributed
e has constant Variance
e has straight line Relation with IV
Multicolinearity
44. Normality Normal Probability Plot of the Standardized Residual
Histogram of the Standardized Residual
SK and Shapiro Test
45. Normality Normal Probability Plot of the Standardized Residual
Histogram of the Standardized Residual
SK and Shapiro Test
Getting the Residual & Standardized Residual
46. Normality Normal Probability Plot of the Standardized Residual
Histogram of the Standardized Residual
SK and Shapiro Test
47.
48. Normality Normal Probability Plot of the Standardized Residual
Histogram of the Standardized Residual
SK and Shapiro Test
49.
50. Normality Normal Probability Plot of the Standardized Residual
Histogram of the Standardized Residual
SK and Shapiro Test
51.
52. Assumption
e is independent
e is Normally Distributed
e has constant Variance
e has straight line Relation with IV
Multicolinearity
Z Predicted
Z Residual
-3 -2 -1 0 321
-3
-2
-1
0
1
2
3
53.
54.
55. Assumption
e is independent
e is Normally Distributed
e has constant Variance
e has straight line Relation with IV
Multicolinearity