2. INTRODUCTION
The objective of this project is to study the
relationship between Horsepower,
Displacement, Cylinders, Acceleration and
Weight on Miles Per Gallon(MPG). The dataset
was obtained from the UCI Website and
Regression Analysis was conducted.
The reason why we choose the particular
dataset was because of its practical
applications involved in it. Miles per
Gallon(mpg) will be useful when you purchase
a car and that was one of the reasons why we
choose this dataset.
3. METHODOLOGY
The model that we have used to perform regression
analysis is multivariate. It has more than two variables
and therefore Multiple Regression Analysis is conducted.
The variable here to predict is called the dependent
variable. The variables here to predict the dependent
variable are called the independent variables.
4. Data
Sourcing
The data taken into consideration is taken from
the University of California-Irvine website. It has
been extensively used by students,educators and
researchers all over the world and is the primary
source for Regression Dataset Analysis
Link to the Dataset -
http://archive.ics.uci.edu/ml/datasets/Auto+MPG
5. VARIABLES
DEPENDENT VARIABLE:
Miles per Gallon(mpg) – Continuous
INDEPENDENT VARIABLES:
Cylinders - Multi-Valued discrete - Denotes the no
of cylinders in a car(3,4,6,8)
Displacement - Continuous - Volume of Pistons
inside a car
Horsepower - Continuous - Power of an Engine in a
car
Weight - Continuous - Weight of the car in lbs
Acceleration - Continuous - Acceleration of a car
6. MODEL 1 - Multiple Regression Analysis
Miles Per Gallon(MPG) is regressed on the four independent
variables and this is the first model of our Regression Analysis. R-
Squared explains 70.70% variation in the independent
variable(MPG).
7. MODEL 2 - INDEPENDENT
VARIABLE
TRANSFORMATION
After transforming the
independent variable with log
transformation, we found the R
squared to improve from 70.70%
to 78.98%.Also performing the
slog transformation, showed the
data to be distributed normal
which we could see from the
histogram distribution. The
formula is given below
L_mpg = β0 + β1Displacement +
β2Horsepower + β3Acceleration
+ β4Weight
8. CORRELATION
ANALYSIS
Here we found that
correlation between
1) Displacement and
Horsepower
2) Weight and
horsepower
3) Weight and
Displacement
10. HISTOGRAM & SCATTER PLOT FOR LOG-LIN MODEL
Scatter Plot Histogram
As you can see from the graphs, the Log-Lin Model appears to be a better
model because it is more normally distributed.
11. Hypothesis Testing - Paired sample t test
Hypothesis Testing to identify if the Coefficients of Two variables are
equal is performed
12. MODEL 3 - DUMMY
VARIABLE - ANALYSIS(STEP
1)
The first step to identify
the dummy variables in
the model is to identify
the no of categories in a
variable. As seen from the
table, our model has 5
categories with Eight
having the highest
frequency..
13. STEP 2 - DUMMY
VARIABLE ANALYSIS
Multiple Regression is
performed Using the
Dummy encoded Cylinders
with Cylinder 5 as the base
variable. Cylinder Variable
5 is Three which has a
frequency of 3.
14. MODEL 4 - INTERACTION TERMS & REGRESSION ANALYSIS
Regression is done on Interaction Terms (Displacement & Horsepower) and the
other independent variables. The reason why Displacement and Horsepower
was chosen is because of their high correlation value
15. MODEL 5 - Regression Analysis on Dummy Variables & Interaction
Terms.
Regression Analysis is done on the Dummy Variables and Interaction Terms to check if the R-Squared
Value is increasing. The equation for the model is given below
L_mpg = β0 + β1Displacement + β2Horsepower + β3Acceleration + β4Weight + β5CYLINDER_COUNT4
+ β6CYLINDER_COUNT2 + β7CYLINDER_COUNT3 + β8CYLINDER_COUNT5 + β9disp_horse
16. OBSERVATIONS FROM MODEL 5
Here CYLINDER_COUNT1 is being kept as
base variable and regressed on the other
independent variables.
We can see that CYLINDER_COUNT4 is 3.3%
less that CYLINDER_COUNT1
We could see (CYLINDER_COUNT2 ) is
predicted to have 11.3 – (-3.3) = 14.6 more
mpg than CYLINDER_COUNT4
To check whether the difference is
significant or not, we have performed
another model with CYLINDER_COUNT4 is
kept as the base variable.
17. Test For Significant Difference
Here CYLINDER_COUNT4 is kept as base and regressed model shows that
CYLINDER_COUNT2 has 14.6% more mpg than CYLINDER_COUNT4 (which is
evident from our previous model)
18. Testing Differences Between
Groups(F-Test)
L_mpg = β0 + 𝛿0 CYLINDER_COUNT1 + β1 displacement + 𝛿1 c1_disp + β2
horsepower + 𝛿2 c1_horse + β3 weight + 𝛿3 c1_weight + β4 acceleration + 𝛿4
c1_acc
Null hypothesis:
If 𝛿0 = 𝛿1 = 𝛿2 = 𝛿3 = 𝛿4 = 0 then we conclude that there is no difference between
the groups
Alternate:
Null hypothesis is False i.e, there is a difference between the groups
Using F-Stats to determine difference between groups(Restricted & Unrestricted)
19. UNRESTRICTED MODEL
Unrestricted model contains Independent Variables and Dummy
Variable(Cylinder Count 1) and the product of the Dummy Variable along
with Independent Variables.
21. F-Test to Determine Difference between Groups
F = (R2
u - R2
r)/q
(1 – R2
u)/ (n-k-1)
= (0.8154 – 0.7898)/5
(1 – 0.8154) / 382
=10.59
Therefore 10.59 is greater than F-Table(5,382) which is 2.2141.
Therefore we reject the null and therefore we can conclude that there
are differences in groups.
22. Test for Heteroskedasticity - Breusch Pagan Test
Multiple Regression is done using Log-Lin Model to
check for heteroskedasticity.
23. As seen from the table, the Error Term is predicted and regression is
done on the Square of the Regressors.
Hypothesis Testing for Heteroskedasticity
24. Continued..
Null Hypothesis - βdisplacement = βhorsepower= βweight= βacceleration = 0
Alternate Hypothesis - There is heteroskedasticity
F = (R2
u /k)
(1 – R2
u)/ (n-k-1)
= (0.05/4)
(1 – 0.05)/ (387)
= 5.092
Therefore 5.092 is greater than F-Table(4,387) which is 2.3719 and null is rejected. So our model exhibits
heteroskedasticity.
25. White Test for Heteroskedasticity
Multiple Regression is done using the Log-Lin Model.
26. Regression
on Cross
Products of
Regressors
and its
Square
Gen disp2 = displacement ^2
Gen horsepower2 = Horsepower ^2
Gen Acceleration2 = Acceleration ^2
Gen Weight2 = Weight ^2
Gen disp_acceleration = Displacement * Acceleration
Gen horse_acc = Horsepower * Acceleration
Gen weight_acc = Weight * Acceleration
28. Hypothesis Testing
Null Hypothesis - βdisplacement = βhorsepower= βweight= βacceleration = 0
Alternate Hypothesis - There is heteroskedasticity
F Statistic(90.44482) is greater than F-Table Value(8.08), therefore we
reject the null and confirm that there is heteroskedasticity.
30. HETEROSKEDASTICITY ROBUST STANDARD
ERRORS(HRSE)
Due to the presence
of heteroskedasticity,
the best variance and
the standard error
estimates are not
valid. Therefore we
need to find
heteroskedasticity
robust standard
errors.
When a model
exhibits
heteroskedasticity, it
is better to look at
the robust standard
errors than the OLS
standard errors.
33. Summary
Model No R-Squared Adjusted R-
Squared
Model 1 0.7070 0.7040
Model 2 0.7898 0.7876
Model 3 0.8112 0.8073
Model 4 0.8134 0.8110
Model 5 0.8286 0.8184