2. INTRODUCTION
The objective of the project is to study the relationship
between Gasoline Expenditure, Gasoline Price Index,
Per capita disposable income, Price Index for new
cars, Price index for old cars and price index for
public transports in the US. There may be other factors
which may affect the Gasoline Expenditure but the factors
taken here represent the expenditure of gasoline reasonably.
3. METHODOLOGY
The model taken here is multivariate i.e. it has more than two variables.
We have used Multiple linear regression technique.
Multiple regression is an extension of simple linear regression. It is used
when we want to predict the value of a variable based on the value of two
or more other variables.
The variable we want to predict is called the dependent variable (or
sometimes, the outcome, target or criterion variable). The variables we
are using to predict the value of the dependent variable are called the
independent variables (or sometimes, the predictor, explanatory or
repressor variables). Multiple regression analysis is applied here to study
the relationship between the dependent variable and all the factors
involved. The data taken into consideration is a time series data (1953-
2004).
4. DATA SOURCING
The data taken into consideration (U.S. Gasoline
market) is a time series data (1953-2004).
We collected it from a very reliable source.
It was compiled by Prof. Chris Bell, Department of
Economics,
University of North Carolina, Asheville.
www.bea.gov and www.bls.gov
5. VARIABLES
.
The variables we considered for the gasoline market are defined below:
GasExp- Total gasoline expenditure in U.S. in billions of dollars
(dependent variable)
Gasp-Price Index for gasoline.
Income- per capita Disposable Income
PNC- Price index for new cars Independent
variables
PUC- Price index for used cars
PPT- Price Index for public transportation.
6. REGRESSION STATISTICS
We fit the regression model to our data and the following results are
witnessed:
These are the “Goodness of Fit” measures. They tell you how well the
calculated linear regression equation fits your data.
The coefficient of determination of the model comes out to be 0.996 i.e.
99.6% of the variations in the gasoline expenditure are explained by the
factors taken into consideration.
7. ANOVA TABLE
The linear regression's F-test has the null
hypothesis that there is no linear relationship
between the variables
Ho: Bo=B1=B2=B3=B4=B5=0
H1: at least one of the Bi’s are not 0.
8. Here, The significance value for F-Test is 0.000 which is
less than 0.05 therefore we reject the null hypothesis
which states that there is no linear relationship between
the variables.
Thus we can assume that there is a linear relationship
between the variables in our model.
Which also indicates that, overall, the regression model
statistically, significantly predicts the outcome variable
(i.e., it is a good fit for the data).
9. COEFFICIENTS
The coefficients for every variable is statistically significant because their p-
values are smaller than 0.05. So, the model becomes:
IN MULTIPLE REGRESSION,EACH COEFFECIENT IS INTERPRETED AS THE
ESTIMATED CHANGE IN Y CORRESPONDING TO A UNIT CHANGE IN A
VARIABLE,WHEN ALL OTHER VARIALBES ARE HELD CONSTANT.
10. As per our a priori expectations we see that the coefficients of the regression are
same as expected:
We have the coefficient of gasoline price index positive as it is an price index it
can have a positive relation with consumption and thus expenditure on
gasoline.
As the per capita income increases, so should the expenditure on gasoline. The
coefficient obtained here depicts that with increasing income people tend to
spend more on gasoline directly or indirectly.
As the price index of new cars increases we witness negative relation between
the gasoline expenditure and the price index which is as expected. The
technology improvement in the new cars does not imply that the consumption
will be low and indirectly the expenditure, even if it is true, the negative effect of
price increase of new car on gasoline expenditure and the positive effect exist at
the same time, but the former is much larger than the latter.
Coefficient of price index of used cars is also negative with the meaning that the
increasing price of used cars decrease the demand of used cars and thus
decreasing the expenditure on gasoline.
As the price index of public transportation increases, total gasoline expenditure
should also increase, because cost of travelling in public transportation has
become costlier. This is also depicted by the results.
11. AUTOCORRELATION
A key assumption in regression is that the error terms are independent of
each other. In this section, we present a simple test to determine whether
there is autocorrelation i.e. whether there is a (linear) correlation
between the error term for one observation and the next.
Now , we will detect the presence of autocorrelation using Durbin Watson
‘s D test.
The Durbin-Watson test uses the following statistic:
Since most regression problems involving time series data show a positive
autocorrelation, we usually test the null hypothesis
H0: No autocorrelation ,versus the alternative hypothesis ,H1: ρ >0.
12. Using SPSS we calculated the Durbin Watson value and
it came out to be:
We checked the Durbin Watson table for 52 observations
and k=6 and got dL=1.35124 and dU=1.76942 and the
durbin-watson value from the above table is 0.733 which
lies between 0-dL.
13. So from the graph above, we conclude that some positive
autocorrelation is present.
We can also remove this positive auto-correlation by using
COCHRAN ORCUTT iterative method.
14. HOMOSCEDASTICITY
It is the violation of the assumption of homoscedasticity ( equally spread
variance) i.e. It is a problem of unequal variance of the error term.
On plotting the standardized residuals(y-axis) against the standardized
predicted values, we find that the error terms are evenly spread out over
all values implying homogeneity in the data.
15. But to confirm this, we use Spearman Rank Correlation test to detect if there
is heteroscedasticity in data or not.
For that we have to calculate Unstandardized Predicted value and
Unstandardized residual and then check the correlation between them.
We usually test the null hypothesis H0: No heteroscedasticity versus the
alternative hypothesis H1: heteroscedasticity is present.
Here, the significant value is 0.388 which is greater than 0.05 therefore we accept
the null hypothesis i.e. there is no heteroscedasticity present in the data.
That means we can say that there is homoscedasticity in the data.
16. MULTICOLLINEARITY
In statistics, Multicollinearity (also collinearity) is a
phenomenon in which two or more predictor variables in a
multiple regression model are highly correlated, meaning that
one can be linearly predicted from the others with a
substantial degree of accuracy. In this situation the coefficient
estimates of the multiple regression may change erratically in
response to small changes in the model or the data.
We are checking collinearity statistics i.e. variance inflation
factor (VIF) and tolerance level (ToL) to detect the presence
of multicollinearity in our data.
17. The value of VIF > 2.5 indicates that Multicollinearity is present in the
data.
Here, every independent variable has VIF value greater than 2.5 that
means there is STRONG MULTICOLLINEARITY present in the data.
Also, the value of ToL < 0.4 indicates that Multicollinearity is present in
the data.
And here every variable has ToL value less than 0.4 indicating the
presence of strong multicollinearity in the data.
18. REMOVAL OF MULTICOLLINEARITY
We can remove multicollinearity from the data by following methods:
Collecting additional data
If there is multicollinearity present in the data, Add some more
data(observations) to reduce/remove multicollinearity.
Removing redundant variables
Remove some redundant variables to remove/reduce
multicollinearity.
Combining variables
Define another variable which is simply the combination of two
variables which are causing multicollinearity.
19. CONCLUSION
As discussed earlier The objective of the project is to study the
relationship between Gasoline Expenditure, Gasoline Price
Index, Income, Price Indices for new cars, old cars and public
transport in the US.
We calculated the measures of goodness of fit for the model, i.e.
R = 0.998 , R2 = 0.996 , adjusted R square = 0.996 and
standard error = 3.7758 for N = 52.
From the ANOVA table, the statistical significance of the
regression model i.e. our p value 0.000 ( which is less than
0.05), indicates that the overall regression model is statistically
significant and predicts the dependent variable (i.e., it is a good
fit for the data).
20. Finally, after running regression we tested for auto-correlation,
heteroscedasticity and multi-collinearity in the model. We found that
there is strong multicollinearity and positive auto-collinearity is
present in the data.
As multicollinearity is not a problem we may ignore it and consider all
the variables to be important as they have a significant effect on the
gasoline expenditure
Heteroscedasticity is not present in the data so it does not affect our
model in any case.
Thus we conclude that our best (fitted) regression model is: