Regression project

[Type here]
Statistics for Data
Analytics CA2 –
Regression Project
Multiple Regression and Logistic
Regression
MSc Data Analytics
Group A
x18134599
Mansi Atul Chowkkar

[Type here]
MULTI LINEAR REGRESSION
Dataset And Analysis:
I have taken BirthRate, Abortion rate ,Antenatal care and Birth attended by skilled doctor four datasets from
http://data.un.org site.
1. Total female/male birth rate of all countries from year 1995 to 2005 is from:
http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a9
2. Total abortion rate for all countries and from 1995 from 2005 is from:
3. Number of births attended by skilled doctor for all counties from 1995 to 2005 :
4. Antenatal care rate for at least one visit which shows how many females are taking care in pregnancy
from 1995 to 2005 :
I have merged all these four datasets to form one single data for multiple regression using R. I am
considering only year 2003 data for regression.
For Multiple linear regression I am taking Birth rate variable as a dependant variable abortion
rate, Births attended by specialist and Antenatal care are independent variables.
The objective of the project is whether male/female Birth ratio in percent is depends overall
male/female Abortion rate, Births attended by skilled doctor and Antenatal care rate.
Understanding Data
Data consists of 4 variables in which 3 are independent and 1 is dependent
a) BirthAttendendedBySpecialist which is independent value and it a continuous variable
with max value as 100 that means all births are attended by skilled doctor and min value
as 43.4 which means only 43.4 % of births attended by skilled doctor from overall births
b) AntenatalCare which is independent value and it a continuous variable with max value as
100 which means all women have taken proper care during or before their pregnancy and
min value as 33 that means only 33% of the women have taken proper care before or
during pregnancy
c) AbortionRate which is independent value and it a continuous variable with max value as
25.6 which means 25.6% women have aborted their child and min value as 1.2 that
means only 1.2% women have aborted their child
d) BirtRate which is dependent value and it a continuous variable with max value as 100
which means 100 % birth of female and male ratio in that particular country and min
value as 76 which means 76% girls birth from female/male.

[Type here]
Objectives and Assumptions On which data is analysed:
Assumption1:
Dependent variable should be measured on continuous scale:
BirtRate is a continuous variable since it does not have any null value or zero in it.
Descriptive Statistics
Mean Std. Deviation N
BirtRate 93.692 5.4980 53
BirthAttendendedBySpecialis
t
86.747 14.6630 53
AntenatalCare 81.43 15.819 53
AbortionRate 9.687169811320
754
6.283150966415
133
53
Descriptive statistics
Assumption2: Sample size
 In the first output box, it is provided with the descriptive statistics for
three sets of scores (Mean, Standard deviation, N).
 Mean value of BirtRate 93.692 explains that 93.692 % is mean female birth rate.
 Mean value of abortion rate is 9.68 which means 9.68% female tend to abort their child which should
be low.
 BirthattendedBySpecialistis having mean value is 86.747 which is in percentage and antenatal care
mean is 81.43which is also in percentage and both means is expected to be higher.
 Standard deviation is more for BirthattendedBySpecialistis and antenatalCare variables which means
that values are more deviated from mean value.
 Here N value is 53 which is above 30 that means any violation of normality or equality of variance
that may exist is not going to affect too much.
Assumption 3: Data must show multicollinearity:
 I am considering BirthAttendendedBySpecialist , AntenatalCare and AbortionRate these three
variables as an independent variables which are continuous .
Assumption4: Independence of observation checked by Durbin-Watson method:
 The Durbin Watson value is 1.903 that is in between 1.5 and 2.5 that means data is not
autocorrelated.
 Antenatal care, Birth attended By Specialist And abortion rate are independent and don’t have any
relationship between them.

[Type here]
Assumption5: Significant outliers, high leverage points or highly influential points
 This can be checked by the Normal Probability Plot (P-P) of the Regression ,Standardised Residual
and the Scatter-plot that were requested as part of the analysis.
 All these parameters are presented in below diagrams from spss output.
 Result is expected that points should be lie reasonably on a straight line but in this case, they are
slightly deviated from straight line. This states that there is a slight deviation from normality.
 In the Scatterplot of the standardised residuals (the second plot displayed) expected result was
most of the points must be scattered in central area and very few to be scattered in outliners.
 In this case points are slightly deviated to right side that is majority is scattered in right side of
rectangle, this means if we draw a line of regression through scattered points then regression will be
negative.

[Type here]
SPSS Outputexplanation:
Correlations
BirtRate BirthAttendendedBySpecialist AntenatalCare AbortionRate
Pearson
Correlation
BirtRate 1.000 0.677 0.729 -0.498
BirthAttendendedBySpecialist 0.677 1.000 0.740 -0.628
AntenatalCare 0.729 0.740 1.000 -0.584
AbortionRate -0.498 -0.628 -0.584 1.000
Sig. (1-
tailed)
BirtRate 0.000 0.000 0.000
BirthAttendendedBySpecialist 0.000 0.000 0.000
AntenatalCare 0.000 0.000 0.000
AbortionRate 0.000 0.000 0.000
N BirtRate 53 53 53 53
BirthAttendendedBySpecialist 53 53 53 53
AntenatalCare 53 53 53 53
AbortionRate 53 53 53 53
 AbortionRate , BirthAttendendedBySpecialist and AntenatalCare correlate substantially with
BirthRate (–.0498,0.677and 0.729 respectively).
 The correlation between each of the independent variables is not too high. In this case two
independent variables have correlation value <0.7 that means these variables are good for the
model, Antenatal care has value 0.729 which is slightly greater than 0.7 so I am considering it in my
model.

[Type here]
Coefficientsa
Model
Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
Correlations
Collinearity
Statistics
B
Std.
Error Beta
Zero-
order Partial Part Tolerance VIF
1 (Constant) 70.213 4.846 14.489 0.000
BirthAttendendedBySpecialist 0.111 0.056 0.296 1.987 0.052 0.677 0.273 0.185 0.394 2.539
AntenatalCare 0.173 0.050 0.497 3.486 0.001 0.729 0.446 0.325 0.429 2.332
AbortionRate -0.020 0.108 -0.023 -0.183 0.855 -
0.498
-0.026 -
0.017
0.574 1.742
a. DependentVariable:BirtRate
 The results are presented in the table labelled Coefficients. Two values
are given: Tolerance and VIF.
 Tolerance is an indicator of how much of the variability of the specified independent varaible is not
explained by the other independent variables in the model and is calculated using the formula 1–R
squared for each variable.
 This value is 0.574 for AbortionRate, 0.394 for BirthattendedByspecialist and 0.429 for
AntenatalCare which indicates that correlation of BirthRate with all these three variables is high.
 The VIF (Variance inflation factor), which is just the inverse of the Tolerance value (1 divided by
Tolerance). VIF values should not exceed 10.
 In this case beta value for Antenatal care value is high is 0.497, that means this variable makes
strong contribution to calculate Birth rate.
 In this example the VIF value for each independent variable is not more than 3
which is less than 10 therefore it proves that I have not violated multicollinearity
assumption.
 The equation for our regression line can be written as :
y= 70.213– 0.020(abortionRate)
0.111(BirthAttendedBySpecialist)+0.173(AntenatalRate)
The B value tells us about how much the value of y increases with the increase in the
x variable.
 The value of coefficients is significant as the value of p is less than 0.05.
 The value 0.497 in the Beta table is the highest contributor for explaining our y
variable (BirthRate) followed by BirthattendedBySpecialist and AbortionRate as
0.296, -0.023
 From sigma value we can say 0.001 and 0.052 are significant values.Sigma value of
AbortionRate is 0.855 which is not that significant and we can remove that variable
from our model.

[Type here]
Model Summaryb
Model R R Square
Adjusted R
Square
Std. Error of the
Estimate Durbin-Watson
1 .757a
.573 .547 3.6995 1.903
a. Predictors:(Constant),AbortionRate, AntenatalCare,BirthAttendendedBySpecialist
b. DependentVariable:BirtRate
 How perfectly line of regression is fitted to the model can be predicted from Model summary
 The R value in the table is the value of gives us the idea of how well our model is able to predict the
values in the dependent variable. The value of R which is 0.757 illustrates that our model gives
good level of prediction.
 In the Model Summary the value given under the heading R Square is 0.573. This tells variance in
the dependent variable (BirtRate) is explained by the model (which includes the variables of
AbortionRate,antenatalCare and BirthsattendedBySoecialist).
 In this case,the value is .573 expressed as a percentage, this means that our model (which includes
AbortionRate,antenatalCare and BirthsattendedBySoecialist) explains 57.3 per cent of the
female/male birth ratio is depends on abortion rate, antenatal care and birth attended by specialist.
 The adjusted R square explains us the value of R square according to the iprovement
observed in the model when a new variable is introduced, here the value of the
adjusted R square is 0.547 which is very close to our R square.
ANOVAa
Model Sum of Squares df Mean Square F Sig.
1 Regression 901.262 3 300.421 21.951 .000b
Residual 670.615 49 13.686
Total 1571.877 52
 In ANOVA, the sum of squares column states that about 901.26 of our response out
of 1571.877 variable is explained by our predictor variable, which also means that
around 670 of y variable was unexplained by our x variable.
 3 out of 52 degrees of freedom are used by our model.
 The significance value is 0 which is (p< 0.0005). As it is significant, we can reject null
hypothesis saying slope of the line is not zero.

[Type here]
Results
Based on the regression analysis results, the regression equation was obtained as it is
shown below:
BirthRate= 70.213– 0.020(abortionRate)
+0.111(BirthAttendedBySpecialist)+0.173(AntenatalRate)
The coefficient of independent variable in a multiple regression model is the amount by
which dependent variable changes
Here we can see that if AbortionRate increases the BirthRate will deacreses as expected
and birth attended by specialist, antenatal care increases then birth rate will increase.
This multiple linear regression analyses whether or not the three independent variables in
the model (AbortionRate, BirthAttendedBySpecialist, AntenatalRate) were significantly
predictive of the BirthRate, the dependent variable.
Firstly, the assumptions necessary for the multiple linear regression were examined and the
multi linear regression analysis was performed with the data which were thought to satisfy
the assumptions.
AntenatalCare the biggest contribution to the model with highest value in standardized
coefficients Beta as .497. The variables which having significance value less than .05 is said
to be statistically significant. We can say that AntenatalCare and BirthAttendedBySpecialist
significant as they have value 0.001 and 0.052 respectively, but AbortionRate is not that
significant as it is inversely proportional to BirthRate and have significant value as 0.855.

[Type here]
LOGISTIC REGRESSION
Problem Analysis
The data which is used for the logistics regression is same as used in the multiple regression. The
predictor variable used for logistics regression is same as above. One independent variable that is abortion
rate is converted into dichotomous such that abortion rate above 9.6% is 1 and the value which is below
9.6% is 0.
The response variable over here is same as above (BirthRate), but here the response variable is converted
to dichotomous such that the Birth rate value which corresponds to 93.6% above is 1 and the value which
is below 93.6% is 0.
Objective: To evaluate whether Above 93.6% female/male BirthRate ratio depends on abortion rate91),
antenatal care and Birth Attended by skilled doctor or not.
Understanding Data
BirthRate is a dependant variable and 93.6% threshold is set to convert data into dichotomous data.
From three independent variables, I have converted one variable Abortion rate in dichotomous varable.
Cleaning and conversion of data into dichotomous data is done by using R code and excel.
Assumptions:
SampleSize:
Case Processing Summary
Unweighted Casesa
N Percent
Selected Cases Included in Analysis 53 100.0
Missing Cases 0 .0
Total 53 100.0
Unselected Cases 0 .0
Total 53 100.0
a. If weightis in effect, see classification table for the total number of
cases.
Here Sample Size is 53, not that small.
Multicollinearity:
In binomial logistic method there is no method for testing multicollinearity .I have done using multiple
regression

[Type here]
Correlations
Birth
BirthAttendende
dBySpecialist AntenatalCare Abortion
Pearson Correlation Birth 1.000 .585 .613 -.184
t
.585 1.000 .740 -.487
AntenatalCare .613 .740 1.000 -.498
Abortion -.184 -.487 -.498 1.000
Sig. (1-tailed) Birth . .000 .000 .094
t
.000 . .000 .000
AntenatalCare .000 .000 . .000
Abortion .094 .000 .000 .
N Birth 53 53 53 53
t
53 53 53 53
AntenatalCare 53 53 53 53
Abortion 53 53 53 53
From the value above it is clear that all three independent variables are not strongly related to each
other since value is not exceeding to 0.7
Abortion is a categorical variable and converted into dichotomous
Dependent Variable
Encoding
Original Value Internal Value
0 0
1 1
Here Birth rate greater than 93.6 that is average value of Birth rate is converted as 1 and value below 93.6
is converted as 0.

[Type here]
Outliers:
Casewise Listb
Case Selected Statusa
Observed
Predicted Predicted Group
Temporary Variable
Birth Resid ZResid SResid
1 S 0** .781 1 -.781 -1.888 -1.818
7 S 0** .779 1 -.779 -1.875 -1.841
12 S 0** .694 1 -.694 -1.504 -1.765
17 S 1 .579 1 .421 .852 1.142
28 S 1** .072 0 .928 3.578 2.474
29 S 1 .521 1 .479 .958 1.289
33 S 1 .521 1 .479 .959 1.256
41 S 0** .684 1 -.684 -1.470 -1.613
42 S 0 .328 0 -.328 -.699 -1.023
43 S 0** .580 1 -.580 -1.174 -1.735
49 S 0** .502 1 -.502 -1.005 -1.279
a. S = Selected,U = Unselected cases,and ** = Misclassified cases.
b. Cases with studentized residuals greater than 1.000 are listed.
By default, cases with residual exceeding 1 are listed (classified as outliers)
There is only one case having birth rate as 1 that is greater than 93.6 is misclassified.
Dependent and Independent variables:
Birth rate is Dichotomous dependant variable and 2 independent continuous variables plus one
dichotomous independent variable is considered for Logistic regression.
All three independent variables have no correlation between themselves.

[Type here]
SPSS Output Prediction:
.
Categorical Variables Codings
Frequency
Parameter
coding
(1)
Abortion 0 26 1.000
1 27 .000
Both the categories have equal number of variables, No one category group have very less number.
Block 0: Beginning Block
Classification Tablea,b
Observed
Predicted
Birth
Percentage
Correct0 1
Step 0 Birth 0 0 10 0.0
1 0 43 100.0
Overall Percentage 81.1
a. Constantis included in the model.
b. The cut value is .500
This is a beginning block which does not contain independent variables. Overall percentage with
correctly classified cases is 81.1. We must expect increase in percentage value once all
independent variables are involved in the model.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant 1.459 .351 17.261 1 .000 4.300

[Type here]
Variables not in the Equation
Score df Sig.
Step 0 Variables BirthAttendendedBySpecialis
t
18.114 1 .000
AntenatalCare 19.948 1 .000
Abortion(1) 1.791 1 .181
Overall Statistics 23.727 3 .000
The Omnibus Tests of Model Coefficients
Omnibus Tests of Model Coefficients
Chi-square df Sig.
sStep 1 Step 23.209 3 .000
Block 23.209 3 .000
Model 23.209 3 .000
 Significant value tells us if there are significant difference between actual and predicted values.
 In this case, the value is .000 (p<.0005). Therefore, the model fit is acceptable and ideal.
 The chi-square value, which we will need to report in our results, is 23.20 with 3 degrees of
freedom.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 9.981 8 .266
Hosmer and Lemeshow chi square value is 9.981 with 5 degrees of freedom and significance value is
0.266(it should be greater than 0.05) which implies support for model.
Model Summary
Step -2 Log likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 28.127a
.355 .572
a. Estimation terminated atiteration number 6 because
parameter estimates changed byless than .001.
 The cox & snell R square of 0.355 and Nagelkerke R square of 0.4572 are analogous to R2
measure.

[Type here]
 In this example, the two values are suggesting that between 35.5% and 45.72% of the variability is
explained by this set of variables.
Classification Tablea
Observed
Predicted
Birth
Percentage
Correct0 1
Step 1 Birth 0 4 6 40.0
1 1 42 97.7
Overall Percentage 86.8
a. The cut value is .500
 The percentage for corrected module with including independent variables is 86.6% which is
improved by 6.7%.
 This model is 97.7% sensitive and 40% is specificity
 BirthRate with 0 that is less than 93.6% is predicted to be 40% and Birth rate greater than 93.6%
which is 1 is predicted to be 97.7%
 Whereas, BirthRate with value 0 is not predicted is 60% and BirthRate with value 1 is not predicted
is 2.8%
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for
EXP(B)
Lower Upper
Step
1a
BirthAttendendedBySpecialist 0.083 0.045 3.380 1 0.066 1.087 0.995 1.187
AntenatalCare 0.104 0.048 4.680 1 0.031 1.110 1.010 1.220
Abortion(1) -1.779 1.233 2.080 1 0.149 0.169 0.015 1.893
Constant -
12.629
4.652 7.368 1 0.007 0.000
a. Variable(s) entered on step 1: BirthAttendendedBySpecialist,AntenatalCare,Abortion.
 This table provide values for the variables which contribute in our model.
 Test used here is Wald Test, the value under column name wald represent statistics value of each
of the predictor.
 Sig value represent significant value of each of the variable in the model and value should be
greater than 0.05.
 We can clearly see that BirthattendedBySpecialist (sig=0.066) and AntenatalCare (sig=0.031)
variables are more significant as compared to Abortion which is a categorical variable in the model.

[Type here]
Results
Based on logistic regression analysis we have the equation as below:
logit(p) = -12.629 + 0.083 (BirthAttendendedBySpecialist )+ 0.104 (AntenatalCare )-1.779 (AbortionRate()1)
In this model we can remove AbortionRate from model as it is having significant value greater than 0.05
which is 0.149 which implies AbortionRate which is a categorical variable is not contributing strongly in our
model.Whereas BirthAttendendedBySpecialist and AntenatalCare are contributing strongly in our model.
Ideally AbortionRate should be inversely proportional to BirthRate and it is proved from Logistic regression
that it is not contributing to increase Birth rate.
We will get probability after substituting respective independent variables in logic regression equation. If the
probability is greater than 0.5 then BirthRate will be 1 that means it is greater than 93.6% and if probability
is less than 0.5 that means BirthRate will be 0 that is it is less than 93.6%

[Type here]
References:
1. Pallant, Julie. SPSS Survival Manual : a Step by Step Guide to Data Analysis Using SPSS.
Maidenhead :Open University Press/McGraw-Hill, 2010. Print.
2. https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
3. https://www.sheffield.ac.uk/polopoly_fs/1.531431!/file/MASHRegression_Further_SPSS.pdf

Regression project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Regression project

Similar to Regression project (20)

More from MansiChowkkar

More from MansiChowkkar (6)

Recently uploaded

Recently uploaded (20)

Regression project