Stats ca report_18180485

STATISTICS FOR
DATA ANALYTICS
PROJECT
Multiple and Logistic Regression
Sarthak Khare
18180485

Multiple Linear Regression Analysis
Objective: The objective of this analysis is to apply multiple linear regression analysis on our life expectancy
dataset to predict the life expectancy based on different predictors such as population, pollution, alcohol
consumption etc. and run diagnostic tests to check if all these predictors are significant in the prediction of life
expectancy. We also need to check if our model satisfies all the assumptions of a multiple linear regression
model like linearity, homoscedasticity, etc.
Background on Data:
For Multiple linear regression analysis, multiple datasets have been sourced from ‘who.int’ website’s, public
health and environment data,1 and then pre-processed and merged in R, into a single file.
- Data has been merged by country.
- ‘Life_exp’ is the dependent variable we are trying to predict using other independent variable as
defined in the data dictionary below.
- Data has 4 independent and 1 dependent variable.
- After merging and cleaning the data, we are left with a sample size of 182 unique observations.
Data dictionary:
Variabl
e
Meas
ure
Type Description URL
life_exp scale Dependen
t Variable
Life expectancy at the age(60) years.
This will be the predicted variable
http://apps.who.int/gho/data/nod
e.main.SDG2016LEX?lang=en
alc_con
sumptio
n
scale Independe
nt
Variable
Alcohol consumption per capita http://apps.who.int/gho/data/nod
e.main.SDG35?lang=en
pm25 scale Independe
nt
Variable
Concentrations of fine particulate
matter(PM2.5) in the country
populati
on
scale Independe
nt
Variable
population in thousands of the
country
e.main.SDGPOP?lang=en
uhc scale Independe
nt
Variable
Universal health coverage index of
the country
Below is sample of the data:
Country alc_consumption life_exp pm25 population uhc
Afghanistan 0.2 16.3 59.9 34 656 34
Albania 7.5 20.8 18.2 2926 58
Algeria 0.9 21.9 34.5 40 606 76
Angola 6.4 17.3 28.4 28 813 38
Antigua and Barbuda 7 19.7 18 101 73
Argentina 9.8 21.8 11.7 43 847 76
Armenia 5.5 19.6 32.9 2925 66
Australia 10.6 25.6 7.3 24 126 86
1 http://apps.who.int/gho/data/node.main.1?lang=en

Assumptions of Multiple Linear Regression Analysis:
1. Linearity: While doing a multiple linear regression analysis, we need to check whether our dependent
variable has a linear relationship with our independent variables. We can do this by looking at
scatterplots of DV and IVs. Graph 1.1 shows our outcome variable(life_exp) has a strong linear
relationship with the our predictors (uhc , pm25 & alc_consumption). We can also validate this by
looking at the residuals vs predicted value graph(Graph 1.2). The graph shows no evidence of a systemic
relationship. Hence, we can assume our model is a linear model.
Graph 1.1
Graph 1.2
2. Homoscedasticity: Homoscedasticity states that errors should have constant variances. This can be
validated by plotting the residuals vs fitted value graphs. If the residuals in the graph look like noise i.e.
have no obvious pattern, then we can say we have homoscedasticity.
In graph 1.2, we can see there is no obvious patter between the residuals and the fitted values and
hence conclude that our model has homoscedasticity.

3. Autocorrelation between errors: we can check for autocorrelation or independence of error terms by
checking the Durbin-Watson statistic. A Durbin-Watson value close to 2 indicates independence of
errors.
In Table 3.1, we can see the Durbin-Watson for our model is 1.954, hence we can assume there is no
autocorrelation between errors in our model.
Table 3.1
4. Normally distributed error: One of the assumptions of a linear model is that the residuals should be
normally distributed with a mean of 0. To check this, we can plot a histogram or check the probability
plot of the residuals. Histogram should be normally distributed and the probability plot should have the
points on the straight 45-degree line.
We can check this for our model in the graphs 4.1 and 4.2 below, thus proving our assumptions to be
correct.
Graph 4.1

Graph 4.2
5. Multicollinearity(Absence): Multicollinearity is when two or more independent variables have a strong
relationship/collinearity with each other. In Linear regression models, there should not be any
multicollinearity between the independent variables.
A Pearson correlation matrix can give an estimate of multicollinearity between variables, if any variable
has a value of |0.8| or above. From table 5.1 we can see, none of the independent variables have
r>|0.8| with any other independent variable, therefore, we can assume that they are not correlated.
Another test for checking multicollinearity is the VIF test. If VIF for a predictor is greater than 10, then
it will be collinear with other predictors. From table 5.2, we can see, none of the predictor variables
have a VIF>10, so we can conclude, there is no multicollinearity in our model.
Table 5.1

Table 5.2
6. Influential data points: On its own, a data point may be an outlier but not necessarily have an effect
on the regression line, similarly a data point may have leverage, but on its own, does not influence the
regression line. However, if a data point has both leverage and is an outlier, it becomes an influential
data point. We check for an influential data point by measuring its Cook’s distance. If Cook’s distance
is 1 or greater, data point in considered to be an influential data point.
For our model, we can check the residual statistics Cook’s distance in the table 6.2 below and see its
maximum value is .205. Hence, we do not have any influential points in our data that need to be
removed. Also, I have individually checked the cook’s distances of all the independent variables, and
have not found any point to be of any significant influence (table 6.1)
Table 6.1
Table 6.2

Model Evaluation and Selection
To evaluate our regression model, we will have to look at the summary output of the model.
Table 7
Here, we look at the R square value of the model, which is 0.689. R square of the model explains the variance in
the predicted variable that is explained by the model. Our model explains 68.9% of variance in the predicted
values. Adjusted R square is a modified version of R square which penalizes the model for introducing more
independent variables. Both R square and adjusted R square values can only lie between 0 and 1 and the closer
the value is to 1, the better our model is at predicting actual values.
ANOVA
F statistic tells us whether our model is better at predicting values than just the mean. The significance value
tells us if our null hypotheses of all the coefficients being equal to zero is true. As the significance value is <0.001,
we can reject our null hypotheses that all coefficients are zero
Table 8
Evaluating independent variables:
Regression coefficients (β) determines the factor by which the dependent variable changes based on 1 unit
change in one independent variable, given all the other independent variables are kept constant. Based on table
9, we can determine the unstandardized coefficients(β) of all the independent variables and the y-intercept from
the column ‘Unstandardized B’
Also, we can see from the table that only “pm25” and “uhc” are statistically significant at a 95% C.I(Sig<0.05).
Based on this, we can remove “alc_consumption” and “population” from our regression equation.
Our regression equation for predicting life expectancy(Y) would thus be:
Y = 11.176 - 0.024*pm25 + 0.147*uhc

Table 9
Summary: Multiple regression model was applied to our dataset of 182 records to predict the ‘life
expectancy’(life_exp) based on the independent variables – ‘alcohol consumption’(alc_consumption),
‘pollution’(pm25), ‘population’, ‘universal health coverage’(uhc).
A preliminary analysis to check the assumptions of multiple linear regression, such as multicollinearity,
homoscedasticity, etc., was conducted and found to be satisfying all the assumptions.
Our model was then applied and produced an adjusted R squared value of 0.682 and based on a confidence
interval of 95%, only 2 of the variables, uhc & pm25, were found to be significant, with coefficients of 0.147 and
-0.024 respectively.

Logistic Regression Analysis
Objective:
The objective of the analysis is to apply logistic binary regression analysis to predict the binary outcome variable
‘life_exp_binary’(full information on data below) and check if our model satisfies all the assumptions of the
model and perform diagnostics if it doesn’t.
Based on the results obtained, we will further evaluate our model, by using evaluation methods such as Hosmer-
Lemshow test, classification matrix, etc.
Background on data:
For performing Logistic regression analysis, the same dataset sourced from ‘who.int’ website, which was earlier
used to conduct multiple linear regression, is being used. The predictor variable ‘life_exp’ has been logically
converted in R, to a binary variable ‘life_exp_binary’ based on the median of ‘life_exp’ as below:
- Life_exp_binary (>median(life_exp)) = 1 (indicates a high life expectancy)
- Life_exp_binary (<=median(life_exp)) = 0 (indicates a low life expectancy)
All the other independent variables are again being used to predict the outcome variable ‘life_exp_binary’.
Assumptions of Logistic Regression Analysis:
1. Sample size: Logistic regression model assumes that our sample has at least 60 cases and 20 cases per
predictor variable. Our model satisfies this assumption, as we have 4 predictor variables, so our model
should have a minimum sample size of 80, which is met as our sample size is 182.
2. Multicollinearity: As we are working with the same dataset which was used in Multiple linear
regression analysis, we can say there is no multicollinearity in the data, based on our earlier analysis.
3. Outlier: We can again say that there are no outliers in the data based on the analysis conducted in
multiple regression tests using the same data.
Model Evaluation:
To evaluate our logistic regression model, we will look at the following factors:
1. Block 0: Block 0 is our null model or a baseline model against which our final model may be compared.
Null model contains no independent variables. In the table 10 below, we can see our model has an
accuracy of 52.2, when no predictor variables have been used.

Table 10
2. Omnibus test(Block 1): Block 1 is our model having all the independent variables. Omnibus test here
explains if our full model has improved over the null model. Here p<0.001(sig) indicates that our full
model is an improvement over the null model, so adding predictors enhances the model.
Table 11
3. Model Summary: From model summary, we can estimate a variance of 50.5% to 67.4% in the predicted
variable using this model by using the Cox & Snell and Nagelkerke R square statistics which are
analogous to the R square statistics used in Linear regression.

Table 12
4. Hosmer-Lemeshow test: is an indicator of the goodness of fit of the model. For Hosmer-Lemesor test,
non-significance is an indicator of good fit. So, for our model, p(Sig.) should be greater than 0.05 , which
can be seen in the table below. With a sig. value of 0.3, our model proves to be a good fit.
Table 13
5. Classification Table: From the classification table, we can check the accuracy, specificity, sensitivity etc.
of the model. From table 13 below, we can see that, our model- block 1 has an improved
accuracy(correctly predicted values) of 79.7% which is a considerable improvement over the null model
– block 0 value of 52.2%
Table 13
6. Interpretation of variables in the model: Table 14 explains the influence and importance of each
variable in the logistic regression model. Here, we can use the Wald statistic, which is similar to the t-
statistic used in the regression model, to check the significance of the independent variables. If the Sig.
is <0.05 , we can say our predictor variables are significant at 95% confidence intervals.
From table 14, we can see only 2 predictors, uhc and pm25, are the only significant variables so we
can drop the other two variables, population and alc_consumption, from our model.
The column EXP(B) in the table gives us the odds ratio of the predictor variable. So if the Odds-ratio >
1, then the odds of that outcome occurring increase if we increase the value of the predictor variable.
Alternatively, if Odd-ratio <1, then the odds of the outcome occurring decreases if we increase the
value of the predictor. For example, we can say the odds of having a high life expectancy increases
1.207 times with an increase in the ‘universal health coverage’(uhc) of the country.

Table 14
Based on the above, we can form the below equation for our model.
Y = e-10.86+0.188*uhc-0.046*pm25
/1+ e-10.86+0.188*uhc-0.046*pm25
Summary: By applying binary logistic regression model on our dataset to predict the life_exp_binary variable,
we were able to correctly predict 79.7% of the values. We were also able to conclude with 95% confidence
that only uhc and pm25 variables were significant at predicting our outcome variable.
References:
1. Grande, D. T., n.d. Interpreting output for multiple linear regression in SPSS. [Online]
Available at: https://www.youtube.com/watch?v=WQeAsZxsXdQ
2. Anon., n.d. [Online]
Available at:
https://www.sheffield.ac.uk/polopoly_fs/1.233565!/file/logistic_regression_using_SPSS_level1_MAS
H.pdf

Stats ca report_18180485

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stats ca report_18180485

Similar to Stats ca report_18180485 (20)

Recently uploaded

Recently uploaded (20)

Stats ca report_18180485