Auto MPG Regression Analysis

AUTO MPG
REGRESSION
ANALYSIS
ANIRUDH SRINATH
SHANKAR PRASAAD
MATHU BALAN

INTRODUCTION
 The objective of this project is to study the
relationship between Horsepower,
Displacement, Cylinders, Acceleration and
Weight on Miles Per Gallon(MPG). The dataset
was obtained from the UCI Website and
Regression Analysis was conducted.
 The reason why we choose the particular
dataset was because of its practical
applications involved in it. Miles per
Gallon(mpg) will be useful when you purchase
a car and that was one of the reasons why we
choose this dataset.

METHODOLOGY
The model that we have used to perform regression
analysis is multivariate. It has more than two variables
and therefore Multiple Regression Analysis is conducted.
The variable here to predict is called the dependent
variable. The variables here to predict the dependent
variable are called the independent variables.

Data
Sourcing
The data taken into consideration is taken from
the University of California-Irvine website. It has
been extensively used by students,educators and
researchers all over the world and is the primary
source for Regression Dataset Analysis
Link to the Dataset -
http://archive.ics.uci.edu/ml/datasets/Auto+MPG

VARIABLES
DEPENDENT VARIABLE:
Miles per Gallon(mpg) – Continuous
INDEPENDENT VARIABLES:
Cylinders - Multi-Valued discrete - Denotes the no
of cylinders in a car(3,4,6,8)
Displacement - Continuous - Volume of Pistons
inside a car
Horsepower - Continuous - Power of an Engine in a
car
Weight - Continuous - Weight of the car in lbs
Acceleration - Continuous - Acceleration of a car

MODEL 1 - Multiple Regression Analysis
Miles Per Gallon(MPG) is regressed on the four independent
variables and this is the first model of our Regression Analysis. R-
Squared explains 70.70% variation in the independent
variable(MPG).

MODEL 2 - INDEPENDENT
VARIABLE
TRANSFORMATION
 After transforming the
independent variable with log
transformation, we found the R
squared to improve from 70.70%
to 78.98%.Also performing the
slog transformation, showed the
data to be distributed normal
which we could see from the
histogram distribution. The
formula is given below
 L_mpg = β0 + β1Displacement +
β2Horsepower + β3Acceleration
+ β4Weight

CORRELATION
ANALYSIS
 Here we found that
correlation between
 1) Displacement and
Horsepower
 2) Weight and
horsepower
 3) Weight and
Displacement

HISTOGRAM & SCATTER PLOT FOR LIN-LIN MODEL
Scatter Plot HISTOGRAM

HISTOGRAM & SCATTER PLOT FOR LOG-LIN MODEL
Scatter Plot Histogram
As you can see from the graphs, the Log-Lin Model appears to be a better
model because it is more normally distributed.

Hypothesis Testing - Paired sample t test
Hypothesis Testing to identify if the Coefficients of Two variables are
equal is performed

MODEL 3 - DUMMY
VARIABLE - ANALYSIS(STEP
1)
The first step to identify
the dummy variables in
the model is to identify
the no of categories in a
variable. As seen from the
table, our model has 5
categories with Eight
having the highest
frequency..

STEP 2 - DUMMY
VARIABLE ANALYSIS
Multiple Regression is
performed Using the
Dummy encoded Cylinders
with Cylinder 5 as the base
variable. Cylinder Variable
5 is Three which has a
frequency of 3.

MODEL 4 - INTERACTION TERMS & REGRESSION ANALYSIS
Regression is done on Interaction Terms (Displacement & Horsepower) and the
other independent variables. The reason why Displacement and Horsepower
was chosen is because of their high correlation value

MODEL 5 - Regression Analysis on Dummy Variables & Interaction
Terms.
Regression Analysis is done on the Dummy Variables and Interaction Terms to check if the R-Squared
Value is increasing. The equation for the model is given below
L_mpg = β0 + β1Displacement + β2Horsepower + β3Acceleration + β4Weight + β5CYLINDER_COUNT4
+ β6CYLINDER_COUNT2 + β7CYLINDER_COUNT3 + β8CYLINDER_COUNT5 + β9disp_horse

OBSERVATIONS FROM MODEL 5
 Here CYLINDER_COUNT1 is being kept as
base variable and regressed on the other
independent variables.
 We can see that CYLINDER_COUNT4 is 3.3%
less that CYLINDER_COUNT1
 We could see (CYLINDER_COUNT2 ) is
predicted to have 11.3 – (-3.3) = 14.6 more
mpg than CYLINDER_COUNT4
 To check whether the difference is
significant or not, we have performed
another model with CYLINDER_COUNT4 is
kept as the base variable.

Test For Significant Difference
Here CYLINDER_COUNT4 is kept as base and regressed model shows that
CYLINDER_COUNT2 has 14.6% more mpg than CYLINDER_COUNT4 (which is
evident from our previous model)

Testing Differences Between
Groups(F-Test)
L_mpg = β0 + 𝛿0 CYLINDER_COUNT1 + β1 displacement + 𝛿1 c1_disp + β2
horsepower + 𝛿2 c1_horse + β3 weight + 𝛿3 c1_weight + β4 acceleration + 𝛿4
c1_acc

Null hypothesis:
If 𝛿0 = 𝛿1 = 𝛿2 = 𝛿3 = 𝛿4 = 0 then we conclude that there is no difference between
the groups
Alternate:
Null hypothesis is False i.e, there is a difference between the groups
Using F-Stats to determine difference between groups(Restricted & Unrestricted)

UNRESTRICTED MODEL
Unrestricted model contains Independent Variables and Dummy
Variable(Cylinder Count 1) and the product of the Dummy Variable along
with Independent Variables.

RESTRICTED MODEL
Restricted Model contains Regression on the Base
Model.

F-Test to Determine Difference between Groups
F = (R2
u - R2
r)/q
(1 – R2
u)/ (n-k-1)
= (0.8154 – 0.7898)/5
(1 – 0.8154) / 382
=10.59
Therefore 10.59 is greater than F-Table(5,382) which is 2.2141.
Therefore we reject the null and therefore we can conclude that there
are differences in groups.

Test for Heteroskedasticity - Breusch Pagan Test
Multiple Regression is done using Log-Lin Model to
check for heteroskedasticity.

As seen from the table, the Error Term is predicted and regression is
done on the Square of the Regressors.
Hypothesis Testing for Heteroskedasticity

Continued..
Null Hypothesis - βdisplacement = βhorsepower= βweight= βacceleration = 0
Alternate Hypothesis - There is heteroskedasticity
F = (R2
u /k)
(1 – R2
u)/ (n-k-1)
= (0.05/4)
(1 – 0.05)/ (387)
= 5.092
Therefore 5.092 is greater than F-Table(4,387) which is 2.3719 and null is rejected. So our model exhibits
heteroskedasticity.

White Test for Heteroskedasticity
Multiple Regression is done using the Log-Lin Model.

Regression
on Cross
Products of
Regressors
and its
Square
 Gen disp2 = displacement ^2
 Gen horsepower2 = Horsepower ^2
 Gen Acceleration2 = Acceleration ^2
 Gen Weight2 = Weight ^2
 Gen disp_acceleration = Displacement * Acceleration
 Gen horse_acc = Horsepower * Acceleration
 Gen weight_acc = Weight * Acceleration

Hypothesis Testing
Null Hypothesis - βdisplacement = βhorsepower= βweight= βacceleration = 0
Alternate Hypothesis - There is heteroskedasticity
F Statistic(90.44482) is greater than F-Table Value(8.08), therefore we
reject the null and confirm that there is heteroskedasticity.

Conclusion for Heteroskedasticity
As seen from the graph and the two tests, we can determine that there is
heteroskedasticity.

HETEROSKEDASTICITY ROBUST STANDARD
ERRORS(HRSE)
Due to the presence
of heteroskedasticity,
the best variance and
the standard error
estimates are not
valid. Therefore we
need to find
heteroskedasticity
robust standard
errors.
When a model
exhibits
heteroskedasticity, it
is better to look at
the robust standard
errors than the OLS
standard errors.

Summary
Model No R-Squared Adjusted R-
Squared
Model 1 0.7070 0.7040
Model 2 0.7898 0.7876
Model 3 0.8112 0.8073
Model 4 0.8134 0.8110
Model 5 0.8286 0.8184

Auto MPG Regression Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Auto MPG Regression Analysis

Similar to Auto MPG Regression Analysis (20)

Recently uploaded

Recently uploaded (20)

Auto MPG Regression Analysis