multiple Regression

Statistical Analysis
Introduction
Statistical analysis is a component of data analytics. It involves collecting,
summarizing and interpreting of every data sample. A sample, in Statistics, is
a representative selection drawn from a total population.

Objectives
The objective of the analysis is
 To check whether there is a significant impact of area and
consumption of fertilizer on production of crops.
 If so then how much production will increase by increase in area and
consumption of fertilizer.
Software
The SPSS software is used for statistical analysis.

Methodology
A multiple regression technique was applied on the data of production
and , area of crops and consumption of fertilizers. A Multiple linear
regression attempts to model the relationship between two or more
explanatory variables and a response variable. Every value of the
independent variable x is associated with a value of the dependent
variable y. This technique consists of a set of following assumptions:
 Assumption of linearity.
 Assumption of normality.
 Assumption of multicollinearity.
 Assumption of homoscedasticity.
 Assumption of autocorrelation.

Assumption of linearity
The assumption of linearity states that the multiple regression model is linear in
parameters, that is the values of regressors are fixed for the repeated sampling and that
there is sufficient variability in the values of regressors.

Interpretation
Both the scatter plots plotted against residuals and independent variables are showing a random
pattern, so we conclude that there is a linear relationship between production, area and
consumption of fertilizer.

Assumption of normality
 The assumption of normality says that the stochastic (disturbance) term ei is normally distributed. In
order to check whether our residuals are normal or not different methods are used.
 Normal Q-Q-plots are made to verify the normality of residuals. Also Kolmogorov Smirnov test and
Shapiro Wilk test is applied.

Interpretation
After viewing both Q-Q plots we can say that data points are closed to diagonal line, which indicates that
the residuals are normally distributed. Also from table we can see that both standardized and
unstandardized residuals are significant that is the p-value is greater than 0.05, so we conclude that the
residuals are normally distributed.
Kolmogorov-Smirnov Shapiro-Wilk
Statistic Df Sig. Statistic Df Sig.
Unstandardized
Residual
.133 15 .200 .971 15 .876
Standardized
Residual
.133 15 .200 .971 15 .876

Assumption of multi-collinearity
 Multi-collinearity in regression occurs when predictor variables
(independent variables) in the regression model are more highly
correlated with other predictor variables than with the dependent
variable. Good regression model should not have correlation between
the independent variables or should not have multi-collinearity.
 Multicollinearity can be assessed by examining tolerance and the
Variance Inflation Factor (VIF).

Interpretation
From table we can observe that value of VIF lies between 1 – 10 that is 1<1.754<10. So we
conclude that there is no multicollinearity in our data.
Model
Collinearity Statistics
Tolerance VIF
Area .570 1.754
Fertilizer .570 1.754

Assumption of homoscedasticity
 One of the important assumptions of the linear regression model is that the variance of
each disturbance term ei, conditional on the chosen values of the explanatory variables,
is some constant number equal to σ2. This is the assumption of homoscedasticity, or
equal (homo) spread (scedasticity), that is, equal variance.
Symbolically,
E ( e2
i ) = σ2 i =1,2, ...,n
 I have used glejser test and scatter plot to test whether data is homoscedastic or not.

Interpretation
From table we can observe that significance value of glejser test for both area and consumption of fertilizer are
0.806 and 0.933 respectively are greater than α=0.05 which indicates there is no problem of heteroscedasticity.
Also the scatter plot is showing randomness which is also indicating that variances are equal and there is no
problem of heteroscedasticity.

Assumption of auto-correlation
 The term autocorrelation may be deﬁned as “correlation between members of series of
observations ordered in time [as in time series data] or space [as in cross-sectional data].In
the regression context, the linear regression model assumes that such autocorrelation does
not exist in the disturbances ei.
 The problem of autocorrelation can be detected by using The Runs test or Durbin Watson
test.
 I have used Durbin Watson d test to detect if there is problem of autocorrelation.

Interpretation
In this analysis i used the hypothesis H0: ρ=0 versus H1:ρ≠0. Reject H0 at 2α level if d < dU or (4−d) <
dU, that is, there is statistically signiﬁcant evidence of autocorrelation, positive or negative.
dl and du at 2α=0.1 is 0.700 and 1.252 respectively.
From table we can see that d= 2.732 is greater than du =1.252 also 4-d=1.268 is greater than du so from
above decision making rule we conclude that there is no autocorrelation.
Model Durbin-Watson
1 2.732

Model fitting
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.B Std. Error Beta
1 (Constant) -32782.283 7282.849 -4.501 .001
fertilizer 4.257 .606 .595 7.025 .000
area 3.724 .663 .476 5.619 .000

Fitted Model
Y = -32782.283+ 3.724*X1 + 4.257*X2
Production = -32782.283+ 3.724*area+ 4.257*consumption of fertilizer
Interpretation
The p-value for both β1 and β2 is 0.000 which is much less than 0.05. This low (<0.05) p-value indicates that we can
reject null hypothesis of insignificance. In other words a predictor that has a low p-value is likely to be a meaningful
addition to our model because changes in the predictor's value are related to changes in the response variable.
The above fitted model shows that β1=3.724 which indicates that production will increase by 3.724 (000 tons) for
every additional (000 hectare) in area keeping the effect of fertilizer constant. Also β2 = 4.257 indicates that for
every additional (000 nutrient/ton) of consumption of fertilizer, production will increase by an average of 4.257 (000
tons) while the effect of area is constant.

R-Squared
 R-squared is a statistical measure of how close the data are to the fitted regression line. It is also
known as the coefficient of determination, or the coefficient of multiple determination for multiple
regression.
 The definition of R-squared is fairly straight-forward; it is the percentage of the response variable
variation that is explained by a linear model.
Model R R Square
Adjusted
R Square
Std. Error of the
Estimate
1 .975 .951 .943 916.5336
The tabulated value of R2 =0.951 that is 95.1% which is very closed to 100. It shows that model explains
95.1% variability of the response variable around the mean and the model better fits our data.
Interpretation

ANOVA for Regression
 Analysis of Variance (ANOVA) consists of calculations that provide information about levels of variability
within a regression model and form a basis for tests of significance. The basic regression line concept, DATA =
FIT + RESIDUAL.
Model
Sum of
Squares Df
Mean
Square F Sig.
1 Regression 1.955E8 2 9.777E7 116.391 .000
Residual 1.008E7 12 840033.786
Total 2.056E8 14
Interpretation
The significance value of regression in table is 0.000 that is less than 0.05 indicating that the model run is
statistically significant.

Conclusion
 The model is significant having F= 116.391 with p-value = 0.000 at 5% level of significance. This indicates that
multiple regression model of production of crops, area and consumption of fertilizer is significant.
 Both the regression coefficients are having p-value = 0.000 which is also significant at 5% level of significance. When
area is increased by one unit (000 hectare) production will be increased by 3.724 (000 tons), keeping the effect of fertilizer
constant. And when consumption of fertilizer is increased by one unit (000 nutrient/ton) the production will be increased
by 4.257 (000 tons), keeping the effect of area constant.
 The value of R2 is 0.951 means that 95.1% of variation in production of crops is explained by its linear relationship
with area and consumption of fertilizer and only 4.9% of variation is explained by other variables which are not included
in the model.
 So we conclude that there is a significant effect of area and consumption of fertilizer on production of crops.

multiple Regression

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to multiple Regression

Similar to multiple Regression (20)

Recently uploaded

Recently uploaded (20)

multiple Regression