Statistics for Data Analytics

Statistics for Data Analytics
CA-2
Multiple Regression and Logistic Regression Analysis
MSc Data Analytics
JAN 2019 - B
x18134301
Tushar Shailesh Dalvi

MULTIPLE REGRESSION
Objective of the study: Using automated Software SPSS, analyse the Data using Multiple Regression
operation to Compare dependent variable with two dependent variables
Data Source:
All Data set is taken From Global Health Observatory data repository, Detail information listed below
1. 1st data taken from http://apps.who.int/gho/data/view.main.182?lang=en which contain Child
Mortality rate by country for year 2010.
2. 2nd data source taken from http://apps.who.int/gho/data/view.main.ghe2002015-CH6?lang=en
which having Measles disease data by country for year 2010.
3. 3rd data source taken from http://apps.who.int/gho/data/view.main.ghe2002015-CH2?lang=en
which having HIV/AIDS disease data by country for year 2010.
Description of Dataset:
• Data is collected from Global Health Observatory data repository which contain child mortality rate
of each countries for year 2010 for both sexes. Child mortality rate is assumed as dependent
variables which is taken by software as prediction variable.
• 2nd and 3rd data collected from same as Global Health Observatory data repository which contains
the death rate for both sexes from each country with the disease name Measles and HIV/AIDS for
Year 2010.
• Measles and HIV/AIDS is treated as Independent Variable which is converted as predictor variables
in SPSS.
Assumption:
• There is no multicollinearity in data.
• The values of the residuals are independent.
• The values of the residuals are normally distributed.
Description of Analysis:
to predict the cause of Child Mortality by 2 different Independent variables. In this Dataset Dependent
Variable is a continuous Variable hence I used Multilinear Regression Model. this data set obtained to
execute Multilinear regression to foretell child mortality by Measles, HIV/AIDS and Malaria disease.
• B- Values: It indicates the effect that one standard deviation unit change in the independent
variable has on the dependent variable.
• Durbin Watson Method: The value which we get in Durbin Watson method should be near to 2 for
efficient result.
• R & R square: R and R square values get in model Summary, R value explain how much variance in
the dependent variable (Pallant, 2016). R square values explains that strength of the relationship of
dependent and independent variable.
• F-Test: F is the ratio of the Model Mean Square to the Error Mean Square, it also shows us
weather multiple regression model is good at fortelling values than using mean.
• Collinearity Statistics: Collinearity Test is used to check whether taken predictor variables are
closely related to each other or not. This test confirms by two result one tolerance which should be
close to 1 and second should be VIF which should not be greater than 10.

• Sig.: Sig. tells that whether this variable is making a statistically significant unique contribution to
the equation. If the Sig. value is less than .05 (.01, .0001, etc.), the variable is making a significant
unique contribution to the prediction of the dependent variable (Pallant, 2016).
Descriptive Statistics
Mean
Std.
Deviation N
Both sex 30.124 30.2154 194
Measles .548 2.6052 194
HIV/AIDS .583 1.4477 194
Table 1.1
Descriptive Statistics is the Initial measurement which explain mean, Standard Deviation and Number of
parameters used for operation. In Table 1.1 measurement unit for the values of both parameters Measles
and HIV/AIDS are in Percentage.
Correlations
Both sex Measles HIV/AIDS
Pearson
Correlation
Both sex 1.000 .431 .526
Measles .431 1.000 .137
HIV/AIDS .526 .137 1.000
Sig. (1-tailed) Both sex . .000 .000
Measles .000 . .029
HIV/AIDS .000 .029 .
N Both sex 194 194 194
Measles 194 194 194
HIV/AIDS 194 194 194
Table 1.2
• The correlation Table provides how dependent Variable is Correlated with the dependent variable,
usually correlation values are lies between -1 and 1.
• In above Fig 1.2 dependent Variable Both Sex have Positive Correlation with Measles and HIV/AIDS
with the values 0.431 and 0.526 Respectively.
• My independent Variable have positive correlation between each other with value 0.137. The Sig.
(1-tailed) is the significance level of our correlation is showing 0.00 for dependent variable with
Independent variable, but for independent variable showing 0.29.
• The last Parameter N which number of Records are equal for all Variable is same with Value of 194.

Variables Entered/Removeda
Model
Variables
Entered
Variables
Removed Method
1 HIV/AIDS,
Measlesb
. Enter
a. Dependent Variable: Both sex
b. All requested variables entered.
Table 1.3
In above Fig 1.3 shown that which method we used for Regression in SPSS which is Enter Method.
Model Summaryb
Model R
R
Square
Adjusted
R Square
Std.
Error of
the
Estimate
Change Statistics
Durbin-
Watson
R
Square
Change
F
Change df1 df2 Sig. F Change
1 .639a .408 .402 23.3603 .408 65.947 2 191 .000 1.946
a. Predictors: (Constant), HIV/AIDS, Measles
b. Dependent Variable: Both sex
Table 1.4
Model Summary explain us how effectively regression line is fitted to the model.
• In Model Summary R value tell us how effectively our model can predict values in dependent
variable. In above Table 1.4 R value is 0.639 which shows that prediction level of our Model is Good.
• The Value of R Square explains that how close the data are shaped regression line, R square Value
always lies in between 0 to 100%, In Table 1.4 R Square value is 0.408 i.e. 40%, which means 40%
data are shaped to regression line.
• The obtained value of our average sample at 95% CI is 1.946 which means our observation fits in to
data model which bring us conclusion that our predictor & prediction variables is continuous.
Which states that Child Mortality of sample size is influenced by Measles and HIV/AIDS.
ANOVAa
Model
Sum of
Squares Df
Mean
Square F Sig.
1 Regression 71974.599 2 35987.299 65.947 .000b
Residual 104229.117 191 545.702
Total 176203.716 193
b. Predictors: (Constant), HIV/AIDS, Measles
Table 1.5

• In ANOVA, we can see that Significant Value is 0. Should be less than 0.0 5 hence model is
significant
• Df values tell that 2 out of 193 Degree of Freedom are used by our model.
Coefficientsa
Model
Unstandardiz
ed
Coefficients
Standardize
d
Coefficients
t Sig.
95.0%
Confidence
Interval for B Correlations
Collinearity
Statistics
B
Std.
Error Beta
Lower
Boun
d
Upper
Boun
d
Zero
-
orde
r
Partia
l
Par
t
Tolera
nce VIF
1 (Constan
t)
22.00
4
1.828 12.03
9
.00
0
18.39
9
25.61
0
Measles 4.249 .652 .366 6.521 .00
0
2.964 5.534 .431 .427 .36
3
.981 1.019
HIV/AIDS 9.935 1.173 .476 8.473 .00
0
7.622 12.24
7
.526 .523 .47
2
.981 1.019
Table 1.6
In Above Table 1.5 explains us about constant and slope which helps to make regression line equation.
• Under the Standardized Coefficients Beta values of HIV/AIDS is 0.476 and for Measles 0.366, which
means HIV/AIDS is highest contributor in Y variable.
• In table 1.5, VIF under Collinearity Statistics is 1.019 which is below 10.

Fig 1.1
In the Histogram Fig 1.1 we can see that the values in the model slightly left skewed in the graph
which means its not matching normal distribution, also there are many low values and few high
values.
Fig 1.2
• Fig 1.2 illustrate that the data is normal, but it has little bit deviation.
• The relationship between the sample percentiles and theoretical percentiles is not linear. Again, the
condition that the error terms are normally distributed is not met.

Fig 1.3
• Due to outliners, the Scatter plot illustrate that majority of the values are at the left side of the
scatter plot.
• In scatter plot points are overlapping each other, From Fig 1.3, it appears that standardized
predicted values relationship to residual is roughly linear around zero.
• We can conclude that the relationship between predictors and the response variable is zero,
because the residuals seem to be randomly scattered and overlapping each other around zero.
Conclusion:
Using Multiple Regression helps to find out that independent and dependent variables are associated to
each other. even indistinct amount of changes in Independent variable cause change in dependent
variable.

Logistics Regression
Objective of the study: Using automated Software SPSS, analyse the Data using Logistics Regression
operation to Compare dichotomous dependent variable with another Independent variable.
Data Source:
All Data set is taken From UN Data repository, Detail information listed below.
• Data taken from http://data.un.org/DocumentData.aspx?id=320 which contain Currently married
Men and Women data from different countries for year 2010.
Description of Dataset:
• Data is collected from UN Data repository which contain Currently married men and Women data
of each countries for year 2010 for both sexes.
• Response Variable in data contain Male and Female which converted into dichotomous form which
means for male we used 0 and for Female used 1.
• In data independent Variable contains Age group 25-29 & 30-34 and their values contain marriage
Rate for different country for 2010.
Analysis Description:
Logistics Regression analysis completed to demonstrate perception of hugeness of various autonomous
factors on the expectation of the dichotomous ward variable, which must be finished up by accepting 95%
Confidence Interval which can either be valid or false expressing our strategy to be correct or wrong
separately.
Assumptions:
• The Prediction variable must show relationship with predictor variable also show improvement to
build a data model on re-estimation.
• A new model will introduce and replace used one, If the current model doesn’t any improvement
model.
• Assuming existence of linear relationship between the prediction variable and the predictor
variables.
Case Processing Summary
Unweighted Casesa N Percent
Selected Cases Included in
Analysis
69 100.0
Missing Cases 0 .0
Total 69 100.0
Unselected Cases 0 .0
Total 69 100.0
a. If weight is in effect, see classification table for
the total number of cases.
Table 2.1
Case processing Summary explains that for analysis all cases are used.

Block 0: Beginning Block
Classification Tablea,b
Predicted
Observe
d
0
Sex
1
Percentag
e
Correct
Step 0 Sex 0 35 0 100.0
1 34 0 .0
Overall
Percentage
50.7
a. Constant is included in the model.
b. The cut value is .500
Table 2.2
Above Classification table 2.2 Indicates that how much values used by which variable. Table indicates that
majority of values used by in 0 and they covered 50.7% of total values, remaining 49.3% values are not
covered by SPSS.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant -.029 .241 .014 1 .904 .971
Table 2.3
In the Table 2.3 In this hypothesis Degrees of Freedom is 1 because and Exp(B) is calculated 0.971 from
Classification Table, i.e. 34/35=.971.
Block 1: Method = Enter
Omnibus Tests of Model
Coefficients
Chi-square df Sig.
Step 1 Step 12.517 2 .002
Block 12.517 2 .002
Model 12.517 2 .002
Table 2.4
In table 2.4 Omnibus Test of Model Coefficients shows the value of p is 0.002 (Sig.) which is satisfy the
condition that P should be less than 0.05 the Chi-Square value of omnibus is 12.517 with 2 degree of
freedom. which perform better in block 1 hence, we can say that Block 1 is better than Block 0.

Model Summary
Step
-2 Log
likelihoo
d
Cox & Snell
R Square
Nagelkerke
R Square
1 83.123
a
.166 .221
a. Estimation terminated at iteration number 4
because parameter estimates changed by less
than .001.
Table 2.5
In Table 2.5 two Cox & Snell R Square and Nagelkerke R Square are the pseudo R-squares (Pallant, 2016). In
those two model values are 0.166 and 0.221 Respectively. They show the variation quantity in prediction
variable which is assumed by the predictor variables collectively also dependent variable is influenced by
dependent variable.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 9.900 8 .272
Table 2.6
In Table 2.6 shows that Chi-Square and significance values are 9.900 and 0.272 respectively. According to
the Hosmer and Lemeshow Test, for Proper fitting of a model the significance value should be greater than
0.05. Since our significance value is equal to 0.272, our model is properly fitted for test (Pallant, 2016).
Classification
Tablea
Predicted
Sex Percentag
e
Correct
Observe
d
0 1
Step 1 Sex 0 22 13 62.9
1 12 22 64.7
Overall
Percentage
63.8
a. The cut value is .500
Table 2.7
In above Classification Table (Table 2.7), Predictor variable are included in Table hence, the previous values
in Block 0 are different than above Classification values. In above Table 62.9% values are covered by 0 and
64.7% values are in 1, Overall 63.8% values are covered and remaining 36.2% values are not covered.

Variables in the Equation
B S.E. Wald df Sig. Exp(B)
95% C.I.for EXP(B)
Lower Upper
Step 1a 25-29 -.119 .040 8.574 1 .003 .888 .821 .962
30-34 .126 .050 6.355 1 .012 1.135 1.029 1.252
Constant -2.509 1.534 2.675 1 .102 .081
a. Variable(s) entered on step 1: 25-29, 30-34.
Table 2.8
• Above Table 2.8 Variables in the Equation provides information about contribution of each of
our predictor variables. the value for each predictor Mentioned in the column Wald. Sig. value is
less than 0.05 that means that variables values are contributing large predictive ability for
model. In our Case age group 25- 29 and 30-34 having 0.003 and 0.012.
• Variable in the Equation will tell you about the direction of the relationship (which factors
increase the likelihood of a yes answer and which factors decrease it). If you have coded all your
dependent and independent categorical variables correctly (with 0=no, or lack of the
characteristic; 1=yes, or the presence of the characteristic), negative B values indicate that an
increase in the independent variable score will result in a decreased probability of the case
recording a score of 1 in the dependent variable (Pallant, 2016).
Conclusion:
Our Conclusion from all test shows that Independent and dependent variables are building relationship
with each other to build data model which encompasses all variables. The age group between 25-29 having
a smaller number of marriage rate in compare with age 30-34. There is fine amount of change in
dependent variable due to influence of independent variable. Chart shows that how closely dependent and
independent variables are related to each other. Increase and decrease in independent variable cause
change in dependent variable.

Statistics for Data Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Statistics for Data Analytics

Similar to Statistics for Data Analytics (20)

Recently uploaded

Recently uploaded (20)

Statistics for Data Analytics