The document provides an overview of multiple regression and logistic regression analyses conducted on gender inequality data. For multiple regression, five factors were examined as predictors of the gender inequality index. The analysis found the factors of maternal mortality ratio, adolescent birth rate, and labor force participation rate to be statistically significant predictors. For logistic regression, employment rate was predicted based on gender, age, country, and year, with the full model accounting for 37.7% of variability in employment rate.
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Multiple Regression and Logistic Regression performed on data to evaluate the relation between birth rate and abortion rate for male and female using SPSS
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Multiple Regression and Logistic Regression performed on data to evaluate the relation between birth rate and abortion rate for male and female using SPSS
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
Preprocessing of Low Response Data for Predictive Modelingijtsrd
"For training a model, the raw data have to go through various preprocessing phases like Cleaning, Missing Values Imputation, Dimension Variable reduction, and Sampling. These steps are data and problem specific and affect the accuracy of the model at a very large extent. For the current scenario, we have 2.2M records with 511 variables. This data was used in a Direct Mail Campaign of some Life Insurance Products and now we know which record had a positive response for the campaign. Rows records 2,259,747 Columns 511 Rows with positive response 2,739, i.e. Response Rate 0.1212 . The dataset is not complete, i.e. we have to take care of missing values. Farzana Naz | Imaad Shafi | Md Kamre Alam ""Preprocessing of Low Response Data for Predictive Modeling"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd21667.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/21667/preprocessing-of-low-response-data-for-predictive-modeling/farzana-naz"
Brief notes on heteroscedasticity, very helpful for those who are bigners to econometrics. i thought this course to the students of BS economics, these notes include all the necessary proofs.
The project aims at predicting healthcare cost against actual data as provided by US survey of hospital, The dataset on which analysis has been done is a sample dataset used for educational purposes only.
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Preprocessing of Low Response Data for Predictive Modelingijtsrd
"For training a model, the raw data have to go through various preprocessing phases like Cleaning, Missing Values Imputation, Dimension Variable reduction, and Sampling. These steps are data and problem specific and affect the accuracy of the model at a very large extent. For the current scenario, we have 2.2M records with 511 variables. This data was used in a Direct Mail Campaign of some Life Insurance Products and now we know which record had a positive response for the campaign. Rows records 2,259,747 Columns 511 Rows with positive response 2,739, i.e. Response Rate 0.1212 . The dataset is not complete, i.e. we have to take care of missing values. Farzana Naz | Imaad Shafi | Md Kamre Alam ""Preprocessing of Low Response Data for Predictive Modeling"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd21667.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/21667/preprocessing-of-low-response-data-for-predictive-modeling/farzana-naz"
Brief notes on heteroscedasticity, very helpful for those who are bigners to econometrics. i thought this course to the students of BS economics, these notes include all the necessary proofs.
The project aims at predicting healthcare cost against actual data as provided by US survey of hospital, The dataset on which analysis has been done is a sample dataset used for educational purposes only.
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Forecasting Stock Market using Multiple Linear Regressionijtsrd
Regression is one of the most powerful statistical methods used in business and marketing researches. This paper shows the important instance of regression methodology called Multiple Linear Regression MLR and proposes a framework of the forecasting of the Stock Index Price, based on the Interest Rate and the Unemployment Rate. This paper was applied the aid of the Statistical Package for Social Sciences SPSS version 23 and PYTHON version 3.7. Yee Mon Khaing | Myint Myint Yee | Ei Ei Aung "Forecasting Stock Market using Multiple Linear Regression" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd27819.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/27819/forecasting-stock-market-using-multiple-linear-regression/yee-mon-khaing
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...inventionjournals
International Journal of Mathematics and Statistics Invention (IJMSI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJMSI publishes research articles and reviews within the whole field Mathematics and Statistics, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
Presentation by U. Devrim Demirel, CBO's Fiscal Policy Studies Unit Chief, and James Otterson at the 28th International Conference of The Society for Computational Economics.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Statistics_Regression_Project
1. Statistics Report on Multiple Regression and Logistic Regression
1
National College of Ireland
STATISTICS REPORT
ON
MULTIPLE REGRESSION AND
LOGISTIC REGRESSION
By
Alekhya Bhupati
(x18132634)
MSc in Data Analytics
(MSCDAD_A)
2. Statistics Report on Multiple Regression and Logistic Regression
2
MULTIPLE REGRESSION ANALYSIS
Multiple regression is extension of simple linear regression, where value is predicated based on
two or more variables
Assumptions:
• The dependent variable is continuous in nature.
• Two or more independent variables can be either continuous or categorical.
• The dependent and each of the independent variables has some linear relationship.
Data source
This analysis has been done on Gender Inequality Index (GII). Data source link is as follows:
http://data.un.org/DocumentData.aspx?id=391
Objective
In this analysis we are using multiple regression on our data source to
• Study the various factors effecting GII.
• Study the relationship between all the factors effecting GII.
Data information
In this project I have considered “Gender Inequality Index (GII)” as independent variable and
dependent variables as follows
1) Maternal mortality ratio
2) Adolescent birth rate
3) Share of seats in parliament
4) Population with at least some secondary education
5) Labor force participation rate
Figure 1: Screenshot of data used for Multiple Regression
3. Statistics Report on Multiple Regression and Logistic Regression
3
Software
R is simple, effective and opensource language and which is highly used for analyzing data
manipulation, data handling, data visualization, statistical result and graphics.
In R studio, we use ‘read.csv’ command to load the data as shown below:
mg<- read.csv("Gender_Inequality_Index.csv",TRUE,",")
Data cleaning
The raw data consists of 228 rows of information for 10 columns. To have high quality data, rows
with insignificant and missing values were eliminated using R code as shown in the below diagram
to make our data suitable for Multiple regression. After cleaning, the data set consist of 159 rows
and 7 columns
Figure 2: R code for data cleaning
Output of multiple regression data summary
This table shows the summary of the data in terms of maximum, minimum, mean, median,
1st
quartile, 3rd
quartile of each factor.
> summary(GII)
4. Statistics Report on Multiple Regression and Logistic Regression
4
Figure 3: summary(GII)
Correlation matrix
Correlation shows the relationship between two variables and describes the whether the dependent
variable is having positive correlation or negative correlation.
> cor(GII[2:7])
Figure 4: Correlation Matrix
From this Correlation Matrix, we can say 2 variables ‘Maternal Mortality Ratio’ and ‘Adolescent
Birth’ rate is following the positive trend with GII and the remaining 3 three variables ‘Share of
seats by Women in Parliament’, ‘Population with at least some secondary education’ and ‘Labor
force Participation rate’ are following negative trend.
Pairwise matrix of scatter plot
Using below command we can easily analyze the relationship between each component.
> pairs(GII[2:7])
5. Statistics Report on Multiple Regression and Logistic Regression
5
Figure 5: Pairwise matrix of scatter plot
This Figure 5 represents, with the increase of ‘Maternal Mortality rate’ the ‘Adolescent Birth rate’ is
also increasing this shows the correlation between these two components is positive. And with the
increase of ‘Maternal Mortality rate’ there is decrease in the ‘Share of seats by Women in
Parliament’ percentage means correlation between these two components is negative. In the same
way we can analyze the correlation relationship for all the component.
Linear model
>GII.final <-
lm(Gender_Inequality_Index~Maternal_Mortality_Ratio+Adolescent_Birth_Rate+Share_of_Seat
s_by_Women_in_Parliment+Population_with_at_least_some_secondary_education+Labour_forc
e_participation_rate,data=GII)
> GII.final
Figure 6: Coefficients
6. Statistics Report on Multiple Regression and Logistic Regression
6
Formulae for Multiple regression Model
Where Y is predicted value for dependent variable
And b0, b1, b2 are estimates of X1, X2, X3
In Figure.6 (Coefficients), we have obtained unstandardized coefficients in the form of B values. The
B values can be assigned to each dependent variable. substitute the values for our independent
variables to predict our dependent variable (GII) is:
Gender_Inequality_Index=0.1797501 + 0.0003497xMaternal_mortality_Ratio +
0.0024407xAdolescent_Birth_Rate - 0.0030140xShare_of_seats_by_women_in_Parliament -
0.0015625xPopulation_with_at_least_some_secondary_education -
0.0033249xLabor_force_Participation_rate
> Summary(GII.final)
Figure 7: summary(GII.final)
In Figure.7, we can determine whether the independent variables used in the test are statistically
significant or not. It is evident that the ‘Maternal Mortality Ratio’, ‘Adolescent Birth’, ‘Share of
seats by Women in Parliament’ and the ‘Labor force Participation rate’ of GII are statistically
significant. However, the ‘Population with at least some secondary education’ of GII is found to be
slightly insignificant.
An R2 value close to 1 indicates that the model explains a large portion of the variance in the outcome
variable, so that the variables used in our model are relatable. The coefficient of determination (R
Square) of 0.8286 shows that our independent variables got 82.86% variability on our dependent
variable.
7. Statistics Report on Multiple Regression and Logistic Regression
7
we have degrees of freedom values 5 and 153 and we also have an F value of 147.9. These values
can also be represented as F (5,153) = 147.9. Also, from the output, we have a P-value of <2.2e-
16 which is lesser than 0.05 and this proves that our data has a good fit for the regression model
we have.
Analysis of variance for individual terms
> library("car")
> Anova(GII.final)
Figure 8: Anova
Simple plot of predicted values with 1-to-1 line
> GII.Predict <- GII
> GII.Predict$Predict_Value <- predict(GII.final)
> plot(Predict_Value ~ Gender_Inequality_Index,data = GII.Predict,main="Predicted vs
Actual", sub ="Dependent Variable: Gender Inequality Index",xlab = "Actual response
value",ylab ="Predicted response value")
> abline(0,1, col="blue" ,lwd=2 )
Figure 9: Predicted Response vs Actual Response
8. Statistics Report on Multiple Regression and Logistic Regression
8
Histogram
> hist(residuals(GII.final), col="darkgray")
Figure 10: Histogram
Residual plot
> plot(GII.final,which = 1 )
Figure 11: Residual Plot
From the Figure 9,10,11, the independent variables are normally distributed and have linear
relationship with the dependent variable (GII).
Conclusion
In this multiple regression analysis, the statistical significance to find out the Gender Inequality
Index can be grouped as F (5,153) = 147.9 and the prediction percentage is 82.86. Four out of five
variables used in our test are statistically significant.
9. Statistics Report on Multiple Regression and Logistic Regression
9
LOGISTIC REGRESSION ANALYSIS
Logistic regression is a statistical method to analyze the data and relationship between one of two
categories of a dichotomous dependent variable based on one or more independent variables that
can be either continuous or categorical.
Assumptions:
1. The dependent variable is dichotomous or binary in nature.
2. There must be two or more independent variables, or predictors, for a logistic regression.
3. There should be some relationship between the dependent and each of the independent variables
used for analysis.
Logistic Regression in mathematic term:
Logit(p)=
For I = 1 to n
Data source
http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a101 – Employment-to-Population ratio
Objective
In this analysis, we are using binary logistic regression to check the probability of employability
rate (dependent variable) with gender, age, years and countries.
Data information
Measurement levels of variables:
In our filtered data, we have one dependent variable and three independent variables. In this Statistical
analysis, variables were grouped into the following categories.
Nominal variables: Employability_Rate, Gender
Ordinal variables: Country
Interval variables: Year, Age
Ratio variables: None
Context of Data used:
Dependent variable:
Employability_Rate (percentage of employment-to-population ratio) coded as ‘1’ if the percentage is
higher than 55.13 else ‘0’
10. Statistics Report on Multiple Regression and Logistic Regression
10
Independent variable:
Gender coded as ‘0’ for Female and ‘1’ for Male.
Age coded as ‘0’ for 15+ yr and ‘1’ for 15-24 yr
Country coded as '1' for Australia, '2' for Canada, '3' for China, '4' for Germany, '5' for India,
'6' for Ireland, '7' for New Zealand, '8' for South Africa, '9' for United Kingdom, '10' for United States of
America
Software
SPSS software is used for analyzing the output of logistic regression.
R studio is used to clean the data.
Data cleaning
The raw data consists of 11265 rows of information for 7 columns. I have selected data for 10 countries
over selected years for significant and clear analysis. To have high quality data, rows with insignificant
and missing values were eliminated using R code as shown in the below diagram to make our data
suitable for our analysis. After cleaning, the data set consist of 641 rows and 5 columns.
Figure 12: Screenshot of data used for Logistic Regression
Analysis Method:
To perform this statistical analysis, we must run the Binary logistic Regression in SPSS software by
following the steps below:
• Import the cleaned and transformed data into SPSS and set proper measures for each variable.
• Click on Analyze – Regression – Binary Logistic from the menu and a dialog box will appear.
• In the dialog box appeared, move Employability_Rate as the dependent variable and Country,
year, gender and age as the independent variables.
• In the options menu, set confidence interval to 95% and make sure residual statistics, goodness
of fit and classification plots checkboxes was chosen.
• Click on continue and verify the details and click OK to run the program in SPSS software to
interpret the results from the data set.
11. Statistics Report on Multiple Regression and Logistic Regression
11
Results:
We have several tables generated as a result of our multiple regression model. We’ll go through each
representation and interpret our findings.
Figure 13: Model Summary
In Figure 13 (Model Summary), we have obtained two R Square values of 0.281 and 0.377 based on
two different scales. For the convenience, we will ignore the Cox & Snell’s R value and consider
Nagelkerke’s value. The Nagelkerke R Square value of 0.377 shows that our independent variables
account to 37.7% of the dependent variable’s variability.
Figure 14: Classification Table
The overall percentage of the binary logistic regression 76.3 percent. This clearly depicts that our
prediction is highly accurate, and this can be used to predict the employability rate using the other
independent variables. The cut value in the table is the probability of an event happening. If the
probability is less than the cut value, then it’s categorized in the first group. Or else, it falls in the second
group.
Figure 15: Variables Table
12. Statistics Report on Multiple Regression and Logistic Regression
12
From Figure 15 (Variables Table), we can predict the possibility of an event by varying an independent
variable by 1 unit keeping the others unchanged. This test also called as Wald test and used for status of
predictor variable. In this table we have looking for significance value which is less than .05. The
statistical significance of the test shows that all the variables are significant.
Conclusion:
On executing the binary logistic regression analysis to predict the employability rate for selected
countries over the years, we observed that all four independent variables are statistically significant.
Also, our model determined an R square value of 37.7% and the regression accounts for about 73.3%
accuracy.
References:
[1] PALLANT, J. SPSS Survival Manual. 6th
Edition. McGraw Hill, 2016.
[2] IBM SPSS 25 https://www.ibm.com/analytics/spss-statistics-software.
[3] Brett Lantz (2013) Machine learning with R. Second Edition.