X18145922 statistics ca2 final

NATIONAL COLLEGE OF IRELAND
STATISTICS FOR DATA ANALYTICS
CA 2 - PROJECT
Analysis of statistical models
Submitted by,
SRIVATSAV KATTUKOTTAI MANI
X18145922
MSc in Data Analytics ‘B’
(MSCDAD_B)

MULTIPLE REGRESSION MODEL
Multiple Regression is the method used for analysis or prediction of an independent variable
(also called as outcome) using two or more dependent variables (also called as predictors). This
method can be used to predict the variance of the model and contribution made by the
independent variables to obtain the overall variance.
Objective of Analysis:
The main objective of performing multiple regression analysis to the collected data is to predict
the Average daily traffic rate in various regions of New Zealand using other factors like peak
traffic rate, percent of cars/light commercial/medium commercial/heavy commercial vehicles.
Context of data being analysed:
Cleaned data contains 7 columns such as Average daily traffic rate, peak traffic rate and percent
of light/medium/heavy commercial1 and heavy commercial2 vehicles. All the measures are
taken for various co-ordinates and peak hours of New Zealand country.
Data Source used:
The dataset used in this model has been taken from the New Zealand Government data
depository:
https://www.data.govt.nz/
Fig.1 attached below shows the sample of cleaned data from the dataset. The raw data collected
from the depository contains nearly 9359 rows and 15 columns. To make our data suitable for
analysis using multiple regression, it has been cleaned by removing the null values and further
reduced to around 500 rows with 7 columns for making reliable predictions.
Fig.1: Sample of data used for multiple regression analysis.

Measurement levels of all variables:
1 dependent and 6 independent variables has been used in our dataset. All the variables used are
continuous and there is no ratio/interval/categorical variables used for analysis.
According to Tabachnick and Fidell (2007, p.123), the formula for calculating the size of sample
with independent variables taken into account is N > 50+8m (where m= number of independent
variables). In our case, m=6, hence N>98. Since the data obtained from depository is huge, it is
cleaned and 500 samples has been taken into account for analysis.
Procedures for multiple regression analysis:
 After importing the data into SPSS software, Select Analyze > Regression > Linear.
 Drag the dependent variable (Average daily traffic rate) and the independent variables
(peak traffic rate, percent of cars/light commercial/medium commercial/heavy commercial
vehicles) and drop into their respective fields.
 Click Statistics button, tick the Estimates, Confidence intervals set at 95%, Part and partial
correlations, Model fit, collinearity diagnostics, Descriptives check boxes.
 Click Plots button, move *ZPRED into X-box and *ZRESID into Y-box. Under
Standardized Residual Plots, tick the Histogram and Normal probability plot check boxes.
 Click Continue and OK to view the results.
Assumptions for multiple regression model:
 Dependent variable should be measured on a continuous scale.
 Two or more independent variables should be used which can be either continuous or
categorical.
 Multicollinearity should not be present. Multi-collinearity occurs when two or more
independent variables are highly correlated with each other.
 Homoscedasticity should be present. (i.e.) variances along the line of best fit remain similar
as you move along the line.
 Linear relationship between dependent and independent variable should be present. Also
there should be a linear relationship between each of the independent variables.
Checking that Assumptions used are not violated:
1. Muticollinearity should not be present:
Fig.2 attached below shows the correlations table of the analysed data. As per reference to
Julie Pallant (2007, p.155), correlation between dependant and the independent variables
should be above 0.3. From the below figure, we can see Peak traffic and Percent heavy
commercial2 variables correlate substantially with Average daily traffic rate (0.690 and -0.313
respectively). Also below figure shows the correlation between the independent variables were
not too high. (i.e. none of the independent variables has the correlation value above 0.7 (as
referred to Julie Pallant (2007, p.155) hence all are retained).

Fig.2: Correlations table of the output
Reference to Julie Pallant (2007, p.156), multicollinearity can be predicted using the
Tolerance and VIF (Variance Inflation Factor) values that correlation is very high if the
Tolerance value is less than 0.10 and VIF is above 10. From the below Fig.3, we can see, the
correlation is under control (Tolerance > 0.10 and VIF <10), thus the multicollinearity
assumption is not violated.
Fig.3: Coefficients table of the output

2. Homoscedasticity and linear relationship should be present:
Below attached Figures 4, 5 & 6 shows the Histogram Plot, Normal P-P Plot and Scatter Plot
respectively of the variables used for analysis. Histogram Plot shows that the model
undergoes normal distribution and P-P Plot shows there is no much deviations from
normality and there is a linear relationship between dependent and independent variables, at
last the Scatter Plot shows the presence of homoscedasticity as the samples are centralised.
Fig.4: Histogram Plot of the output
Fig.5: Normal P-P Plot of the output

Fig.6: Scatter Plot of the output
3. Analysing the Results or output of the model:
According to Julie Pallant (2007, p.158), R-square value explains the variance of the
dependent variable (Average daily traffic) with respect to the independent variables. From
below Fig.7 we can see the R-square (variance) of dependent variable is 53.1% and Adjusted
R-square value is 52.6% which depicts better estimation of total population value. The
quality in which the dependent variable (Average daily traffic) is predicted can be given by R
value which shows 72.9%, hence providing a good prediction level.
Fig.7: Model Summary of the model
4. Evaluation of independent variables:
From the attached Fig.8: ANOVA table, we can see that Significance (p-value) is less than
0.05 with degrees of freedom 5 and 493 and F value as 111.724 (i.e. F (5,493) =111.724),
thus making the model statistically significant.
With reference to Julie Pallant (2007, p.159), beta values under Standardized coefficients
and their Significance (p-value) explains the significant contribution of a particular variable
in explaining the dependent variable. From Fig.8.1, it is clear that three variables (Peak
traffic, percent heavy commercial 1 and percent heavy commercial 2) makes unique
significant contribution in predicting the dependent variable (Average daily traffic) with the

p-values < 0.05. To form the regression equation, we can use the Unstandardized B values as
below.
ADT => Average Daily Traffic
ADT = 58.181+ (3.186*Peak traffic)-(0.813*percent light commercial) + (0.221*percent
medium commercial)-(2.740*percent heavy commercial1)-(9.281*percent heavy
commercial2)
Fig.8: ANOVA table
Fig.8.1: Coefficients table
Percent Car variable has been excluded since it has Tolerance = 0, which means the prediction
made by this variable is redundant of another variable.
Fig.9: Excluded Variables table.

Conclusion:
By using the Multiple regression model, it can be concluded that (Peak traffic, percent heavy
commercial 1 and percent heavy commercial 2) variables makes unique significant contribution
in predicting the dependent variable (Average daily traffic) out of which Peak traffic variable
provides maximum contribution for ADT with overall quality of prediction value equals 72.9%.
---------------------------------------------------------------------------------------------------------------------
BINARY LOGISTIC REGRESSION MODEL
Binomial logistic regression is a method of analysis used for predicting the chance that the
prediction falls into one of two categories of a dichotomous dependent variable based on two or
more independent variables which can either be categorical or continuous.
Data Source used:
The dataset used in this model has been taken from UN data depository:
http://data.un.org/
Fig.1 attached below shows the sample of cleaned data from the dataset. The raw data collected
from the depository contains nearly 3095 rows and 8 columns. To make our data suitable for
analysis using logistic regression, it has been cleaned by removing the null values and further
reduced to around 500 rows with 4 columns for making reliable predictions.
Fig.10: Sample data of logistic regression model

Objective of Analysis:
The main objective of using binary logistic regression model is to predict the growth rate
(dependent variable) of a country is increased or decreased using percent of employees in
services, industries and agriculture fields (independent variables).
Context of data being analysed:
Cleaned data contains 4 columns such as Growth rate, percent services, percent industry and
percent agriculture for various countries to predict whether the growth rate change depends on
percent of employees in various sectors.
Measurement level of variables:
1 dependent and 3 independent variables has been used in our dataset where the independent
variables (Percent in services, industry and agriculture) are of continuous type and the dependent
variable (Growth rate) is of dichotomous type. Since the data obtained from depository is huge, it
is cleaned and 500 samples or residuals has been taken into account for analysis.
Procedures for multiple regression analysis:
 After importing the data into SPSS software, Select Analyze > Regression > Binary
Logistic.
 Drag the dependent variable (Growth rate) into dependent field and the independent
variables (percent of services, industry and agriculture) into the Covariates box.
 Click Options button, tick the CI for Exp (B), casewise listing of residuals, Classification
Plots and Hosmer-Lemeshow goodness of fit check boxes.
 Click Continue and OK to see the output.
Assumptions for Binary logistic regression model:
 The dependent variable should be a dichotomous or binary categorical variable.
 Independent variables should be continuous or categorical type.
 Categories of dependent variable should be mutually exclusive.
 There should be linear relationship between dependent and independent variables.
 High intercorrelation between independent (predictors) variables should be present.
Analysing the Results or output of the model:
Fig.11 attached below represents the total number of cases or samples used in this model whereas
Fig.12 shows how the dichotomous dependent variable has been encoded in SPSS. In this case, if
the growth rate is increased, it is encoded as 1 and 0 if it is decreased.
(i.e. increase = 1 and decrease = 0).

Fig.11: Case processing summary table
Fig.12: Dependent Variable encoding table
Below Fig.13 (Block 0) clearly depicts the prediction of the model by SPSS without including
the independent variables with overall percentage of 51.8%.
Fig.13: Classification table of output
Below Fig.14 (Block 1) shows the results of the logistic regression model by inclusion of all the
independent variables or predictors. Omnibus test provide better accuracy over the results

obtained for Block 0 (without predictors). In this case, we can see the significance of all the
independent variables were below 0.05, thus making the model a better one than Block 0.
Fig.14: Omnibus tests output
Hosmer and Lemeshow test is another form of testing the goodness of fit. According to Julie
Pallant (2007, p.174), poor fit is indicated by the significance value less than 0.05. In this case
from below Fig.15 we can see the significance value is 0.803 which clearly depicts the model has
a good fit.
Fig.15: Hosmer and Lemeshow Test output
Below Fig.16 has two values for Cox & Snell R-square and Nagelkerke R-square which explains
the variance of dependent variable due to the predictors in this model. We can see the variation
lies between 62% and 82.7%.
Fig.16: Model summary table of output
From the below Fig.17, we can see there is an improvement in the overall prediction of the
model with inclusion of predictors when compared to Block 0 without predictors. And we can
see the overall prediction has been increased from 51.8% to 92.2% with a sensitivity of 91.1%
prediction in increase of growth rate and 93.4% prediction of decrease in growth rate.

Fig.17: Classification table
Considering Fig.18. Variables in the equation table, we can see the percent industry and percent
services variables provide statistically significant results with p < 0.05. B values provide a direct
relationship with dependent variable (Growth rate). Since all the B values are negative, we can
depict that more percent of employees in services, industry and agriculture leads to increase in
growth rate of the country. The regression equation will be as follows:
Growth rate = 16.409 – (0.075*percent_services) – (0.453*percent_industry) – (0.163*percent
agriculture)
Fig.18: Variables in the Equation
Conclusion:
This model contains 3 independent variables (percent of employees in services, industry and
agriculture) which clearly supports the analysis significantly with χ2
(3, N=500) = 483.966,
p<0.05. Hence we can conclude that the percent of employees in industry and agriculture
contribute maximum prediction whether the growth rate of a country is increasing or decreasing
with an overall prediction quality of 92.2%.
References:
[1] SPSS survival manual, Julie Pallant, 3rd Edition (2007)
[2] https://statistics.laerd.com/spss-tutorials/binomial-logistic-regression-using-spss-statistics.php
[3] https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
[4] Using Multivariate Statistics, Tabachnick & Fidell, 5th Edition (2007)

X18145922 statistics ca2 final

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to X18145922 statistics ca2 final

Similar to X18145922 statistics ca2 final (20)

Recently uploaded

Recently uploaded (20)

X18145922 statistics ca2 final