SlideShare a Scribd company logo
STATISTICS FOR
DATA ANALYTICS
PROJECT
Multiple and Logistic Regression
Sarthak Khare
18180485
Multiple Linear Regression Analysis
Objective: The objective of this analysis is to apply multiple linear regression analysis on our life expectancy
dataset to predict the life expectancy based on different predictors such as population, pollution, alcohol
consumption etc. and run diagnostic tests to check if all these predictors are significant in the prediction of life
expectancy. We also need to check if our model satisfies all the assumptions of a multiple linear regression
model like linearity, homoscedasticity, etc.
Background on Data:
For Multiple linear regression analysis, multiple datasets have been sourced from ‘who.int’ website’s, public
health and environment data,1 and then pre-processed and merged in R, into a single file.
- Data has been merged by country.
- ‘Life_exp’ is the dependent variable we are trying to predict using other independent variable as
defined in the data dictionary below.
- Data has 4 independent and 1 dependent variable.
- After merging and cleaning the data, we are left with a sample size of 182 unique observations.
Data dictionary:
Variabl
e
Meas
ure
Type Description URL
life_exp scale Dependen
t Variable
Life expectancy at the age(60) years.
This will be the predicted variable
http://apps.who.int/gho/data/nod
e.main.SDG2016LEX?lang=en
alc_con
sumptio
n
scale Independe
nt
Variable
Alcohol consumption per capita http://apps.who.int/gho/data/nod
e.main.SDG35?lang=en
pm25 scale Independe
nt
Variable
Concentrations of fine particulate
matter(PM2.5) in the country
http://apps.who.int/gho/data/nod
e.main.SDG116?lang=en
populati
on
scale Independe
nt
Variable
population in thousands of the
country
http://apps.who.int/gho/data/nod
e.main.SDGPOP?lang=en
uhc scale Independe
nt
Variable
Universal health coverage index of
the country
http://apps.who.int/gho/data/nod
e.main.SDG38?lang=en
Below is sample of the data:
Country alc_consumption life_exp pm25 population uhc
Afghanistan 0.2 16.3 59.9 34 656 34
Albania 7.5 20.8 18.2 2926 58
Algeria 0.9 21.9 34.5 40 606 76
Angola 6.4 17.3 28.4 28 813 38
Antigua and Barbuda 7 19.7 18 101 73
Argentina 9.8 21.8 11.7 43 847 76
Armenia 5.5 19.6 32.9 2925 66
Australia 10.6 25.6 7.3 24 126 86
1 http://apps.who.int/gho/data/node.main.1?lang=en
Assumptions of Multiple Linear Regression Analysis:
1. Linearity: While doing a multiple linear regression analysis, we need to check whether our dependent
variable has a linear relationship with our independent variables. We can do this by looking at
scatterplots of DV and IVs. Graph 1.1 shows our outcome variable(life_exp) has a strong linear
relationship with the our predictors (uhc , pm25 & alc_consumption). We can also validate this by
looking at the residuals vs predicted value graph(Graph 1.2). The graph shows no evidence of a systemic
relationship. Hence, we can assume our model is a linear model.
Graph 1.1
Graph 1.2
2. Homoscedasticity: Homoscedasticity states that errors should have constant variances. This can be
validated by plotting the residuals vs fitted value graphs. If the residuals in the graph look like noise i.e.
have no obvious pattern, then we can say we have homoscedasticity.
In graph 1.2, we can see there is no obvious patter between the residuals and the fitted values and
hence conclude that our model has homoscedasticity.
3. Autocorrelation between errors: we can check for autocorrelation or independence of error terms by
checking the Durbin-Watson statistic. A Durbin-Watson value close to 2 indicates independence of
errors.
In Table 3.1, we can see the Durbin-Watson for our model is 1.954, hence we can assume there is no
autocorrelation between errors in our model.
Table 3.1
4. Normally distributed error: One of the assumptions of a linear model is that the residuals should be
normally distributed with a mean of 0. To check this, we can plot a histogram or check the probability
plot of the residuals. Histogram should be normally distributed and the probability plot should have the
points on the straight 45-degree line.
We can check this for our model in the graphs 4.1 and 4.2 below, thus proving our assumptions to be
correct.
Graph 4.1
Graph 4.2
5. Multicollinearity(Absence): Multicollinearity is when two or more independent variables have a strong
relationship/collinearity with each other. In Linear regression models, there should not be any
multicollinearity between the independent variables.
A Pearson correlation matrix can give an estimate of multicollinearity between variables, if any variable
has a value of |0.8| or above. From table 5.1 we can see, none of the independent variables have
r>|0.8| with any other independent variable, therefore, we can assume that they are not correlated.
Another test for checking multicollinearity is the VIF test. If VIF for a predictor is greater than 10, then
it will be collinear with other predictors. From table 5.2, we can see, none of the predictor variables
have a VIF>10, so we can conclude, there is no multicollinearity in our model.
Table 5.1
Table 5.2
6. Influential data points: On its own, a data point may be an outlier but not necessarily have an effect
on the regression line, similarly a data point may have leverage, but on its own, does not influence the
regression line. However, if a data point has both leverage and is an outlier, it becomes an influential
data point. We check for an influential data point by measuring its Cook’s distance. If Cook’s distance
is 1 or greater, data point in considered to be an influential data point.
For our model, we can check the residual statistics Cook’s distance in the table 6.2 below and see its
maximum value is .205. Hence, we do not have any influential points in our data that need to be
removed. Also, I have individually checked the cook’s distances of all the independent variables, and
have not found any point to be of any significant influence (table 6.1)
Table 6.1
Table 6.2
Model Evaluation and Selection
To evaluate our regression model, we will have to look at the summary output of the model.
Table 7
Here, we look at the R square value of the model, which is 0.689. R square of the model explains the variance in
the predicted variable that is explained by the model. Our model explains 68.9% of variance in the predicted
values. Adjusted R square is a modified version of R square which penalizes the model for introducing more
independent variables. Both R square and adjusted R square values can only lie between 0 and 1 and the closer
the value is to 1, the better our model is at predicting actual values.
ANOVA
F statistic tells us whether our model is better at predicting values than just the mean. The significance value
tells us if our null hypotheses of all the coefficients being equal to zero is true. As the significance value is <0.001,
we can reject our null hypotheses that all coefficients are zero
Table 8
Evaluating independent variables:
Regression coefficients (β) determines the factor by which the dependent variable changes based on 1 unit
change in one independent variable, given all the other independent variables are kept constant. Based on table
9, we can determine the unstandardized coefficients(β) of all the independent variables and the y-intercept from
the column ‘Unstandardized B’
Also, we can see from the table that only “pm25” and “uhc” are statistically significant at a 95% C.I(Sig<0.05).
Based on this, we can remove “alc_consumption” and “population” from our regression equation.
Our regression equation for predicting life expectancy(Y) would thus be:
Y = 11.176 - 0.024*pm25 + 0.147*uhc
Table 9
Summary: Multiple regression model was applied to our dataset of 182 records to predict the ‘life
expectancy’(life_exp) based on the independent variables – ‘alcohol consumption’(alc_consumption),
‘pollution’(pm25), ‘population’, ‘universal health coverage’(uhc).
A preliminary analysis to check the assumptions of multiple linear regression, such as multicollinearity,
homoscedasticity, etc., was conducted and found to be satisfying all the assumptions.
Our model was then applied and produced an adjusted R squared value of 0.682 and based on a confidence
interval of 95%, only 2 of the variables, uhc & pm25, were found to be significant, with coefficients of 0.147 and
-0.024 respectively.
Logistic Regression Analysis
Objective:
The objective of the analysis is to apply logistic binary regression analysis to predict the binary outcome variable
‘life_exp_binary’(full information on data below) and check if our model satisfies all the assumptions of the
model and perform diagnostics if it doesn’t.
Based on the results obtained, we will further evaluate our model, by using evaluation methods such as Hosmer-
Lemshow test, classification matrix, etc.
Background on data:
For performing Logistic regression analysis, the same dataset sourced from ‘who.int’ website, which was earlier
used to conduct multiple linear regression, is being used. The predictor variable ‘life_exp’ has been logically
converted in R, to a binary variable ‘life_exp_binary’ based on the median of ‘life_exp’ as below:
- Life_exp_binary (>median(life_exp)) = 1 (indicates a high life expectancy)
- Life_exp_binary (<=median(life_exp)) = 0 (indicates a low life expectancy)
All the other independent variables are again being used to predict the outcome variable ‘life_exp_binary’.
Assumptions of Logistic Regression Analysis:
1. Sample size: Logistic regression model assumes that our sample has at least 60 cases and 20 cases per
predictor variable. Our model satisfies this assumption, as we have 4 predictor variables, so our model
should have a minimum sample size of 80, which is met as our sample size is 182.
2. Multicollinearity: As we are working with the same dataset which was used in Multiple linear
regression analysis, we can say there is no multicollinearity in the data, based on our earlier analysis.
3. Outlier: We can again say that there are no outliers in the data based on the analysis conducted in
multiple regression tests using the same data.
Model Evaluation:
To evaluate our logistic regression model, we will look at the following factors:
1. Block 0: Block 0 is our null model or a baseline model against which our final model may be compared.
Null model contains no independent variables. In the table 10 below, we can see our model has an
accuracy of 52.2, when no predictor variables have been used.
Table 10
2. Omnibus test(Block 1): Block 1 is our model having all the independent variables. Omnibus test here
explains if our full model has improved over the null model. Here p<0.001(sig) indicates that our full
model is an improvement over the null model, so adding predictors enhances the model.
Table 11
3. Model Summary: From model summary, we can estimate a variance of 50.5% to 67.4% in the predicted
variable using this model by using the Cox & Snell and Nagelkerke R square statistics which are
analogous to the R square statistics used in Linear regression.
Table 12
4. Hosmer-Lemeshow test: is an indicator of the goodness of fit of the model. For Hosmer-Lemesor test,
non-significance is an indicator of good fit. So, for our model, p(Sig.) should be greater than 0.05 , which
can be seen in the table below. With a sig. value of 0.3, our model proves to be a good fit.
Table 13
5. Classification Table: From the classification table, we can check the accuracy, specificity, sensitivity etc.
of the model. From table 13 below, we can see that, our model- block 1 has an improved
accuracy(correctly predicted values) of 79.7% which is a considerable improvement over the null model
– block 0 value of 52.2%
Table 13
6. Interpretation of variables in the model: Table 14 explains the influence and importance of each
variable in the logistic regression model. Here, we can use the Wald statistic, which is similar to the t-
statistic used in the regression model, to check the significance of the independent variables. If the Sig.
is <0.05 , we can say our predictor variables are significant at 95% confidence intervals.
From table 14, we can see only 2 predictors, uhc and pm25, are the only significant variables so we
can drop the other two variables, population and alc_consumption, from our model.
The column EXP(B) in the table gives us the odds ratio of the predictor variable. So if the Odds-ratio >
1, then the odds of that outcome occurring increase if we increase the value of the predictor variable.
Alternatively, if Odd-ratio <1, then the odds of the outcome occurring decreases if we increase the
value of the predictor. For example, we can say the odds of having a high life expectancy increases
1.207 times with an increase in the ‘universal health coverage’(uhc) of the country.
Table 14
Based on the above, we can form the below equation for our model.
Y = e-10.86+0.188*uhc-0.046*pm25
/1+ e-10.86+0.188*uhc-0.046*pm25
Summary: By applying binary logistic regression model on our dataset to predict the life_exp_binary variable,
we were able to correctly predict 79.7% of the values. We were also able to conclude with 95% confidence
that only uhc and pm25 variables were significant at predicting our outcome variable.
References:
1. Grande, D. T., n.d. Interpreting output for multiple linear regression in SPSS. [Online]
Available at: https://www.youtube.com/watch?v=WQeAsZxsXdQ
2. Anon., n.d. [Online]
Available at:
https://www.sheffield.ac.uk/polopoly_fs/1.233565!/file/logistic_regression_using_SPSS_level1_MAS
H.pdf

More Related Content

What's hot

Wisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost PredictionWisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost Prediction
Prasann Prem
 
7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares
Yugesh Dutt Panday
 
Multicolinearity
MulticolinearityMulticolinearity
Multicolinearity
Pawan Kawan
 
Multiple Regression and Logistic Regression
Multiple Regression and Logistic RegressionMultiple Regression and Logistic Regression
Multiple Regression and Logistic Regression
Kaushik Rajan
 
Multicollinearity
MulticollinearityMulticollinearity
Multicollinearity
Bernard Asia
 
Gc3611111116
Gc3611111116Gc3611111116
Gc3611111116
IJERA Editor
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
Satish Gupta
 
Multicollinearity1
Multicollinearity1Multicollinearity1
Multicollinearity1Muhammad Ali
 
Qualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic ModelQualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic Model
ijceronline
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
Antoine De Henau
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
DrZahid Khan
 
Chapter 9 Regression
Chapter 9 RegressionChapter 9 Regression
Chapter 9 Regressionghalan
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
Anirudha si
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
Geethu Rangan
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
Muhammad Ali
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
Indus University
 
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby RaoSumit Prajapati
 
Logistic regression
Logistic regressionLogistic regression
Logistic regressionsaba khan
 

What's hot (20)

Wisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost PredictionWisconsin hospital - Healthcare Cost Prediction
Wisconsin hospital - Healthcare Cost Prediction
 
7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares7 classical assumptions of ordinary least squares
7 classical assumptions of ordinary least squares
 
Multicolinearity
MulticolinearityMulticolinearity
Multicolinearity
 
Multiple Regression and Logistic Regression
Multiple Regression and Logistic RegressionMultiple Regression and Logistic Regression
Multiple Regression and Logistic Regression
 
Multicollinearity
MulticollinearityMulticollinearity
Multicollinearity
 
Gc3611111116
Gc3611111116Gc3611111116
Gc3611111116
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Multiple Linear Regression
Multiple Linear Regression Multiple Linear Regression
Multiple Linear Regression
 
Multicollinearity1
Multicollinearity1Multicollinearity1
Multicollinearity1
 
Qualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic ModelQualitative Analysis of a Discrete SIR Epidemic Model
Qualitative Analysis of a Discrete SIR Epidemic Model
 
Multiple regression
Multiple regressionMultiple regression
Multiple regression
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Chapter 9 Regression
Chapter 9 RegressionChapter 9 Regression
Chapter 9 Regression
 
Multinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationshipsMultinomial logisticregression basicrelationships
Multinomial logisticregression basicrelationships
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
 
Ols by hiron
Ols by hironOls by hiron
Ols by hiron
 
Heteroscedasticity
HeteroscedasticityHeteroscedasticity
Heteroscedasticity
 
Multiple Linear Regression
Multiple Linear RegressionMultiple Linear Regression
Multiple Linear Regression
 
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby Rao
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 

Similar to Stats ca report_18180485

Statistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way AnovaStatistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way Anova
Nisheet Mahajan
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
Rithish Kumar
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
YashIyengar
 
X18145922 statistics ca2 final
X18145922   statistics ca2 finalX18145922   statistics ca2 final
X18145922 statistics ca2 final
SRIVATSAV KATTUKOTTAI MANI
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
Saleesh Satheeshchandran
 
30REGRESSION Regression is a statistical tool that a.docx
30REGRESSION  Regression is a statistical tool that a.docx30REGRESSION  Regression is a statistical tool that a.docx
30REGRESSION Regression is a statistical tool that a.docx
tarifarmarie
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
AsadJaved304231
 
linear regression PDF.pdf
linear regression PDF.pdflinear regression PDF.pdf
linear regression PDF.pdf
JoshuaLau29
 
Sem with amos ii
Sem with amos iiSem with amos ii
Sem with amos ii
Jordan Sitorus
 
Statistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionStatistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic Regression
SindhujanDhayalan
 
X18136931 statistics ca2_updated
X18136931 statistics ca2_updatedX18136931 statistics ca2_updated
X18136931 statistics ca2_updated
KarthikSundaresanSub
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
TanyaWadhwani4
 
Solving stepwise regression problems
Solving stepwise regression problemsSolving stepwise regression problems
Solving stepwise regression problemsSoma Sinha Roy
 
Ders 2 ols .ppt
Ders 2 ols .pptDers 2 ols .ppt
Ders 2 ols .ppt
Ergin Akalpler
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis pptElkana Rorio
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
Derek Kane
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
AlemAyahu
 
Econometrics project
Econometrics projectEconometrics project
Econometrics project
Shubham Joon
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
Shivaram Prakash
 

Similar to Stats ca report_18180485 (20)

Statistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way AnovaStatistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way Anova
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
X18145922 statistics ca2 final
X18145922   statistics ca2 finalX18145922   statistics ca2 final
X18145922 statistics ca2 final
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
30REGRESSION Regression is a statistical tool that a.docx
30REGRESSION  Regression is a statistical tool that a.docx30REGRESSION  Regression is a statistical tool that a.docx
30REGRESSION Regression is a statistical tool that a.docx
 
Logistic regression and analysis using statistical information
Logistic regression and analysis using statistical informationLogistic regression and analysis using statistical information
Logistic regression and analysis using statistical information
 
linear regression PDF.pdf
linear regression PDF.pdflinear regression PDF.pdf
linear regression PDF.pdf
 
Sem with amos ii
Sem with amos iiSem with amos ii
Sem with amos ii
 
Statistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic RegressionStatistical analysis of Multiple and Logistic Regression
Statistical analysis of Multiple and Logistic Regression
 
X18136931 statistics ca2_updated
X18136931 statistics ca2_updatedX18136931 statistics ca2_updated
X18136931 statistics ca2_updated
 
Multiple Regression.ppt
Multiple Regression.pptMultiple Regression.ppt
Multiple Regression.ppt
 
Wine.Final.Project.MJv3
Wine.Final.Project.MJv3Wine.Final.Project.MJv3
Wine.Final.Project.MJv3
 
Solving stepwise regression problems
Solving stepwise regression problemsSolving stepwise regression problems
Solving stepwise regression problems
 
Ders 2 ols .ppt
Ders 2 ols .pptDers 2 ols .ppt
Ders 2 ols .ppt
 
Regression analysis ppt
Regression analysis pptRegression analysis ppt
Regression analysis ppt
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Multinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdfMultinomial Logistic Regression.pdf
Multinomial Logistic Regression.pdf
 
Econometrics project
Econometrics projectEconometrics project
Econometrics project
 
German credit score shivaram prakash
German credit score shivaram prakashGerman credit score shivaram prakash
German credit score shivaram prakash
 

Recently uploaded

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 

Stats ca report_18180485

  • 1. STATISTICS FOR DATA ANALYTICS PROJECT Multiple and Logistic Regression Sarthak Khare 18180485
  • 2. Multiple Linear Regression Analysis Objective: The objective of this analysis is to apply multiple linear regression analysis on our life expectancy dataset to predict the life expectancy based on different predictors such as population, pollution, alcohol consumption etc. and run diagnostic tests to check if all these predictors are significant in the prediction of life expectancy. We also need to check if our model satisfies all the assumptions of a multiple linear regression model like linearity, homoscedasticity, etc. Background on Data: For Multiple linear regression analysis, multiple datasets have been sourced from ‘who.int’ website’s, public health and environment data,1 and then pre-processed and merged in R, into a single file. - Data has been merged by country. - ‘Life_exp’ is the dependent variable we are trying to predict using other independent variable as defined in the data dictionary below. - Data has 4 independent and 1 dependent variable. - After merging and cleaning the data, we are left with a sample size of 182 unique observations. Data dictionary: Variabl e Meas ure Type Description URL life_exp scale Dependen t Variable Life expectancy at the age(60) years. This will be the predicted variable http://apps.who.int/gho/data/nod e.main.SDG2016LEX?lang=en alc_con sumptio n scale Independe nt Variable Alcohol consumption per capita http://apps.who.int/gho/data/nod e.main.SDG35?lang=en pm25 scale Independe nt Variable Concentrations of fine particulate matter(PM2.5) in the country http://apps.who.int/gho/data/nod e.main.SDG116?lang=en populati on scale Independe nt Variable population in thousands of the country http://apps.who.int/gho/data/nod e.main.SDGPOP?lang=en uhc scale Independe nt Variable Universal health coverage index of the country http://apps.who.int/gho/data/nod e.main.SDG38?lang=en Below is sample of the data: Country alc_consumption life_exp pm25 population uhc Afghanistan 0.2 16.3 59.9 34 656 34 Albania 7.5 20.8 18.2 2926 58 Algeria 0.9 21.9 34.5 40 606 76 Angola 6.4 17.3 28.4 28 813 38 Antigua and Barbuda 7 19.7 18 101 73 Argentina 9.8 21.8 11.7 43 847 76 Armenia 5.5 19.6 32.9 2925 66 Australia 10.6 25.6 7.3 24 126 86 1 http://apps.who.int/gho/data/node.main.1?lang=en
  • 3. Assumptions of Multiple Linear Regression Analysis: 1. Linearity: While doing a multiple linear regression analysis, we need to check whether our dependent variable has a linear relationship with our independent variables. We can do this by looking at scatterplots of DV and IVs. Graph 1.1 shows our outcome variable(life_exp) has a strong linear relationship with the our predictors (uhc , pm25 & alc_consumption). We can also validate this by looking at the residuals vs predicted value graph(Graph 1.2). The graph shows no evidence of a systemic relationship. Hence, we can assume our model is a linear model. Graph 1.1 Graph 1.2 2. Homoscedasticity: Homoscedasticity states that errors should have constant variances. This can be validated by plotting the residuals vs fitted value graphs. If the residuals in the graph look like noise i.e. have no obvious pattern, then we can say we have homoscedasticity. In graph 1.2, we can see there is no obvious patter between the residuals and the fitted values and hence conclude that our model has homoscedasticity.
  • 4. 3. Autocorrelation between errors: we can check for autocorrelation or independence of error terms by checking the Durbin-Watson statistic. A Durbin-Watson value close to 2 indicates independence of errors. In Table 3.1, we can see the Durbin-Watson for our model is 1.954, hence we can assume there is no autocorrelation between errors in our model. Table 3.1 4. Normally distributed error: One of the assumptions of a linear model is that the residuals should be normally distributed with a mean of 0. To check this, we can plot a histogram or check the probability plot of the residuals. Histogram should be normally distributed and the probability plot should have the points on the straight 45-degree line. We can check this for our model in the graphs 4.1 and 4.2 below, thus proving our assumptions to be correct. Graph 4.1
  • 5. Graph 4.2 5. Multicollinearity(Absence): Multicollinearity is when two or more independent variables have a strong relationship/collinearity with each other. In Linear regression models, there should not be any multicollinearity between the independent variables. A Pearson correlation matrix can give an estimate of multicollinearity between variables, if any variable has a value of |0.8| or above. From table 5.1 we can see, none of the independent variables have r>|0.8| with any other independent variable, therefore, we can assume that they are not correlated. Another test for checking multicollinearity is the VIF test. If VIF for a predictor is greater than 10, then it will be collinear with other predictors. From table 5.2, we can see, none of the predictor variables have a VIF>10, so we can conclude, there is no multicollinearity in our model. Table 5.1
  • 6. Table 5.2 6. Influential data points: On its own, a data point may be an outlier but not necessarily have an effect on the regression line, similarly a data point may have leverage, but on its own, does not influence the regression line. However, if a data point has both leverage and is an outlier, it becomes an influential data point. We check for an influential data point by measuring its Cook’s distance. If Cook’s distance is 1 or greater, data point in considered to be an influential data point. For our model, we can check the residual statistics Cook’s distance in the table 6.2 below and see its maximum value is .205. Hence, we do not have any influential points in our data that need to be removed. Also, I have individually checked the cook’s distances of all the independent variables, and have not found any point to be of any significant influence (table 6.1) Table 6.1 Table 6.2
  • 7. Model Evaluation and Selection To evaluate our regression model, we will have to look at the summary output of the model. Table 7 Here, we look at the R square value of the model, which is 0.689. R square of the model explains the variance in the predicted variable that is explained by the model. Our model explains 68.9% of variance in the predicted values. Adjusted R square is a modified version of R square which penalizes the model for introducing more independent variables. Both R square and adjusted R square values can only lie between 0 and 1 and the closer the value is to 1, the better our model is at predicting actual values. ANOVA F statistic tells us whether our model is better at predicting values than just the mean. The significance value tells us if our null hypotheses of all the coefficients being equal to zero is true. As the significance value is <0.001, we can reject our null hypotheses that all coefficients are zero Table 8 Evaluating independent variables: Regression coefficients (β) determines the factor by which the dependent variable changes based on 1 unit change in one independent variable, given all the other independent variables are kept constant. Based on table 9, we can determine the unstandardized coefficients(β) of all the independent variables and the y-intercept from the column ‘Unstandardized B’ Also, we can see from the table that only “pm25” and “uhc” are statistically significant at a 95% C.I(Sig<0.05). Based on this, we can remove “alc_consumption” and “population” from our regression equation. Our regression equation for predicting life expectancy(Y) would thus be: Y = 11.176 - 0.024*pm25 + 0.147*uhc
  • 8. Table 9 Summary: Multiple regression model was applied to our dataset of 182 records to predict the ‘life expectancy’(life_exp) based on the independent variables – ‘alcohol consumption’(alc_consumption), ‘pollution’(pm25), ‘population’, ‘universal health coverage’(uhc). A preliminary analysis to check the assumptions of multiple linear regression, such as multicollinearity, homoscedasticity, etc., was conducted and found to be satisfying all the assumptions. Our model was then applied and produced an adjusted R squared value of 0.682 and based on a confidence interval of 95%, only 2 of the variables, uhc & pm25, were found to be significant, with coefficients of 0.147 and -0.024 respectively.
  • 9. Logistic Regression Analysis Objective: The objective of the analysis is to apply logistic binary regression analysis to predict the binary outcome variable ‘life_exp_binary’(full information on data below) and check if our model satisfies all the assumptions of the model and perform diagnostics if it doesn’t. Based on the results obtained, we will further evaluate our model, by using evaluation methods such as Hosmer- Lemshow test, classification matrix, etc. Background on data: For performing Logistic regression analysis, the same dataset sourced from ‘who.int’ website, which was earlier used to conduct multiple linear regression, is being used. The predictor variable ‘life_exp’ has been logically converted in R, to a binary variable ‘life_exp_binary’ based on the median of ‘life_exp’ as below: - Life_exp_binary (>median(life_exp)) = 1 (indicates a high life expectancy) - Life_exp_binary (<=median(life_exp)) = 0 (indicates a low life expectancy) All the other independent variables are again being used to predict the outcome variable ‘life_exp_binary’. Assumptions of Logistic Regression Analysis: 1. Sample size: Logistic regression model assumes that our sample has at least 60 cases and 20 cases per predictor variable. Our model satisfies this assumption, as we have 4 predictor variables, so our model should have a minimum sample size of 80, which is met as our sample size is 182. 2. Multicollinearity: As we are working with the same dataset which was used in Multiple linear regression analysis, we can say there is no multicollinearity in the data, based on our earlier analysis. 3. Outlier: We can again say that there are no outliers in the data based on the analysis conducted in multiple regression tests using the same data. Model Evaluation: To evaluate our logistic regression model, we will look at the following factors: 1. Block 0: Block 0 is our null model or a baseline model against which our final model may be compared. Null model contains no independent variables. In the table 10 below, we can see our model has an accuracy of 52.2, when no predictor variables have been used.
  • 10. Table 10 2. Omnibus test(Block 1): Block 1 is our model having all the independent variables. Omnibus test here explains if our full model has improved over the null model. Here p<0.001(sig) indicates that our full model is an improvement over the null model, so adding predictors enhances the model. Table 11 3. Model Summary: From model summary, we can estimate a variance of 50.5% to 67.4% in the predicted variable using this model by using the Cox & Snell and Nagelkerke R square statistics which are analogous to the R square statistics used in Linear regression.
  • 11. Table 12 4. Hosmer-Lemeshow test: is an indicator of the goodness of fit of the model. For Hosmer-Lemesor test, non-significance is an indicator of good fit. So, for our model, p(Sig.) should be greater than 0.05 , which can be seen in the table below. With a sig. value of 0.3, our model proves to be a good fit. Table 13 5. Classification Table: From the classification table, we can check the accuracy, specificity, sensitivity etc. of the model. From table 13 below, we can see that, our model- block 1 has an improved accuracy(correctly predicted values) of 79.7% which is a considerable improvement over the null model – block 0 value of 52.2% Table 13 6. Interpretation of variables in the model: Table 14 explains the influence and importance of each variable in the logistic regression model. Here, we can use the Wald statistic, which is similar to the t- statistic used in the regression model, to check the significance of the independent variables. If the Sig. is <0.05 , we can say our predictor variables are significant at 95% confidence intervals. From table 14, we can see only 2 predictors, uhc and pm25, are the only significant variables so we can drop the other two variables, population and alc_consumption, from our model. The column EXP(B) in the table gives us the odds ratio of the predictor variable. So if the Odds-ratio > 1, then the odds of that outcome occurring increase if we increase the value of the predictor variable. Alternatively, if Odd-ratio <1, then the odds of the outcome occurring decreases if we increase the value of the predictor. For example, we can say the odds of having a high life expectancy increases 1.207 times with an increase in the ‘universal health coverage’(uhc) of the country.
  • 12. Table 14 Based on the above, we can form the below equation for our model. Y = e-10.86+0.188*uhc-0.046*pm25 /1+ e-10.86+0.188*uhc-0.046*pm25 Summary: By applying binary logistic regression model on our dataset to predict the life_exp_binary variable, we were able to correctly predict 79.7% of the values. We were also able to conclude with 95% confidence that only uhc and pm25 variables were significant at predicting our outcome variable. References: 1. Grande, D. T., n.d. Interpreting output for multiple linear regression in SPSS. [Online] Available at: https://www.youtube.com/watch?v=WQeAsZxsXdQ 2. Anon., n.d. [Online] Available at: https://www.sheffield.ac.uk/polopoly_fs/1.233565!/file/logistic_regression_using_SPSS_level1_MAS H.pdf