Statistics_Regression_Project

Statistics Report on Multiple Regression and Logistic Regression
1
National College of Ireland
STATISTICS REPORT
ON
MULTIPLE REGRESSION AND
LOGISTIC REGRESSION
By
Alekhya Bhupati
(x18132634)
MSc in Data Analytics
(MSCDAD_A)

2
MULTIPLE REGRESSION ANALYSIS
Multiple regression is extension of simple linear regression, where value is predicated based on
two or more variables
Assumptions:
• The dependent variable is continuous in nature.
• Two or more independent variables can be either continuous or categorical.
• The dependent and each of the independent variables has some linear relationship.
Data source
This analysis has been done on Gender Inequality Index (GII). Data source link is as follows:
http://data.un.org/DocumentData.aspx?id=391
Objective
In this analysis we are using multiple regression on our data source to
• Study the various factors effecting GII.
• Study the relationship between all the factors effecting GII.
Data information
In this project I have considered “Gender Inequality Index (GII)” as independent variable and
dependent variables as follows
1) Maternal mortality ratio
2) Adolescent birth rate
3) Share of seats in parliament
4) Population with at least some secondary education
5) Labor force participation rate
Figure 1: Screenshot of data used for Multiple Regression

3
Software
R is simple, effective and opensource language and which is highly used for analyzing data
manipulation, data handling, data visualization, statistical result and graphics.
In R studio, we use ‘read.csv’ command to load the data as shown below:
mg<- read.csv("Gender_Inequality_Index.csv",TRUE,",")
Data cleaning
The raw data consists of 228 rows of information for 10 columns. To have high quality data, rows
with insignificant and missing values were eliminated using R code as shown in the below diagram
to make our data suitable for Multiple regression. After cleaning, the data set consist of 159 rows
and 7 columns
Figure 2: R code for data cleaning
Output of multiple regression data summary
This table shows the summary of the data in terms of maximum, minimum, mean, median,
1st
quartile, 3rd
quartile of each factor.
> summary(GII)

4
Figure 3: summary(GII)
Correlation matrix
Correlation shows the relationship between two variables and describes the whether the dependent
variable is having positive correlation or negative correlation.
> cor(GII[2:7])
Figure 4: Correlation Matrix
From this Correlation Matrix, we can say 2 variables ‘Maternal Mortality Ratio’ and ‘Adolescent
Birth’ rate is following the positive trend with GII and the remaining 3 three variables ‘Share of
seats by Women in Parliament’, ‘Population with at least some secondary education’ and ‘Labor
force Participation rate’ are following negative trend.
Pairwise matrix of scatter plot
Using below command we can easily analyze the relationship between each component.
> pairs(GII[2:7])

5
Figure 5: Pairwise matrix of scatter plot
This Figure 5 represents, with the increase of ‘Maternal Mortality rate’ the ‘Adolescent Birth rate’ is
also increasing this shows the correlation between these two components is positive. And with the
increase of ‘Maternal Mortality rate’ there is decrease in the ‘Share of seats by Women in
Parliament’ percentage means correlation between these two components is negative. In the same
way we can analyze the correlation relationship for all the component.
Linear model
>GII.final <-
lm(Gender_Inequality_Index~Maternal_Mortality_Ratio+Adolescent_Birth_Rate+Share_of_Seat
s_by_Women_in_Parliment+Population_with_at_least_some_secondary_education+Labour_forc
e_participation_rate,data=GII)
> GII.final
Figure 6: Coefficients

6
Formulae for Multiple regression Model
Where Y is predicted value for dependent variable
And b0, b1, b2 are estimates of X1, X2, X3
In Figure.6 (Coefficients), we have obtained unstandardized coefficients in the form of B values. The
B values can be assigned to each dependent variable. substitute the values for our independent
variables to predict our dependent variable (GII) is:
Gender_Inequality_Index=0.1797501 + 0.0003497xMaternal_mortality_Ratio +
0.0024407xAdolescent_Birth_Rate - 0.0030140xShare_of_seats_by_women_in_Parliament -
0.0015625xPopulation_with_at_least_some_secondary_education -
0.0033249xLabor_force_Participation_rate
> Summary(GII.final)
Figure 7: summary(GII.final)
In Figure.7, we can determine whether the independent variables used in the test are statistically
significant or not. It is evident that the ‘Maternal Mortality Ratio’, ‘Adolescent Birth’, ‘Share of
seats by Women in Parliament’ and the ‘Labor force Participation rate’ of GII are statistically
significant. However, the ‘Population with at least some secondary education’ of GII is found to be
slightly insignificant.
An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome
variable, so that the variables used in our model are relatable. The coefficient of determination (R
Square) of 0.8286 shows that our independent variables got 82.86% variability on our dependent
variable.

7
we have degrees of freedom values 5 and 153 and we also have an F value of 147.9. These values
can also be represented as F (5,153) = 147.9. Also, from the output, we have a P-value of <2.2e-
16 which is lesser than 0.05 and this proves that our data has a good fit for the regression model
we have.
Analysis of variance for individual terms
> library("car")
> Anova(GII.final)
Figure 8: Anova
Simple plot of predicted values with 1-to-1 line
> GII.Predict <- GII
> GII.Predict$Predict_Value <- predict(GII.final)
> plot(Predict_Value ~ Gender_Inequality_Index,data = GII.Predict,main="Predicted vs
Actual", sub ="Dependent Variable: Gender Inequality Index",xlab = "Actual response
value",ylab ="Predicted response value")
> abline(0,1, col="blue" ,lwd=2 )
Figure 9: Predicted Response vs Actual Response

8
Histogram
> hist(residuals(GII.final), col="darkgray")
Figure 10: Histogram
Residual plot
> plot(GII.final,which = 1 )
Figure 11: Residual Plot
From the Figure 9,10,11, the independent variables are normally distributed and have linear
relationship with the dependent variable (GII).
Conclusion
In this multiple regression analysis, the statistical significance to find out the Gender Inequality
Index can be grouped as F (5,153) = 147.9 and the prediction percentage is 82.86. Four out of five
variables used in our test are statistically significant.

9
LOGISTIC REGRESSION ANALYSIS
Logistic regression is a statistical method to analyze the data and relationship between one of two
categories of a dichotomous dependent variable based on one or more independent variables that
can be either continuous or categorical.
Assumptions:
1. The dependent variable is dichotomous or binary in nature.
2. There must be two or more independent variables, or predictors, for a logistic regression.
3. There should be some relationship between the dependent and each of the independent variables
used for analysis.
Logistic Regression in mathematic term:
Logit(p)=
For I = 1 to n
Data source
http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a101 – Employment-to-Population ratio
Objective
In this analysis, we are using binary logistic regression to check the probability of employability
rate (dependent variable) with gender, age, years and countries.
Data information
Measurement levels of variables:
In our filtered data, we have one dependent variable and three independent variables. In this Statistical
analysis, variables were grouped into the following categories.
Nominal variables: Employability_Rate, Gender
Ordinal variables: Country
Interval variables: Year, Age
Ratio variables: None
Context of Data used:
Dependent variable:
Employability_Rate (percentage of employment-to-population ratio) coded as ‘1’ if the percentage is
higher than 55.13 else ‘0’

10
Independent variable:
Gender coded as ‘0’ for Female and ‘1’ for Male.
Age coded as ‘0’ for 15+ yr and ‘1’ for 15-24 yr
Country coded as '1' for Australia, '2' for Canada, '3' for China, '4' for Germany, '5' for India,
'6' for Ireland, '7' for New Zealand, '8' for South Africa, '9' for United Kingdom, '10' for United States of
America
Software
SPSS software is used for analyzing the output of logistic regression.
R studio is used to clean the data.
Data cleaning
The raw data consists of 11265 rows of information for 7 columns. I have selected data for 10 countries
over selected years for significant and clear analysis. To have high quality data, rows with insignificant
and missing values were eliminated using R code as shown in the below diagram to make our data
suitable for our analysis. After cleaning, the data set consist of 641 rows and 5 columns.
Figure 12: Screenshot of data used for Logistic Regression
Analysis Method:
To perform this statistical analysis, we must run the Binary logistic Regression in SPSS software by
following the steps below:
• Import the cleaned and transformed data into SPSS and set proper measures for each variable.
• Click on Analyze – Regression – Binary Logistic from the menu and a dialog box will appear.
• In the dialog box appeared, move Employability_Rate as the dependent variable and Country,
year, gender and age as the independent variables.
• In the options menu, set confidence interval to 95% and make sure residual statistics, goodness
of fit and classification plots checkboxes was chosen.
• Click on continue and verify the details and click OK to run the program in SPSS software to
interpret the results from the data set.

11
Results:
We have several tables generated as a result of our multiple regression model. We’ll go through each
representation and interpret our findings.
Figure 13: Model Summary
In Figure 13 (Model Summary), we have obtained two R Square values of 0.281 and 0.377 based on
two different scales. For the convenience, we will ignore the Cox & Snell’s R value and consider
Nagelkerke’s value. The Nagelkerke R Square value of 0.377 shows that our independent variables
account to 37.7% of the dependent variable’s variability.
Figure 14: Classification Table
The overall percentage of the binary logistic regression 76.3 percent. This clearly depicts that our
prediction is highly accurate, and this can be used to predict the employability rate using the other
independent variables. The cut value in the table is the probability of an event happening. If the
probability is less than the cut value, then it’s categorized in the first group. Or else, it falls in the second
group.
Figure 15: Variables Table

12
From Figure 15 (Variables Table), we can predict the possibility of an event by varying an independent
variable by 1 unit keeping the others unchanged. This test also called as Wald test and used for status of
predictor variable. In this table we have looking for significance value which is less than .05. The
statistical significance of the test shows that all the variables are significant.
Conclusion:
On executing the binary logistic regression analysis to predict the employability rate for selected
countries over the years, we observed that all four independent variables are statistically significant.
Also, our model determined an R square value of 37.7% and the regression accounts for about 73.3%
accuracy.
References:
[1] PALLANT, J. SPSS Survival Manual. 6th
Edition. McGraw Hill, 2016.
[2] IBM SPSS 25 https://www.ibm.com/analytics/spss-statistics-software.
[3] Brett Lantz (2013) Machine learning with R. Second Edition.

Statistics_Regression_Project

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Statistics_Regression_Project

Similar to Statistics_Regression_Project (15)

Recently uploaded

Recently uploaded (20)

Statistics_Regression_Project