[M3A4] Data Analysis and Interpretation Specialization

DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 4

Test A Logistic Regression Model M3A4
1 Introduction
In this assignment, I will discuss some things that you should keep in mind as you continue to use data analysis in the
future. I will also teach you how to test a categorical explanatory variable with more than two categories in a multiple
regression analysis. Finally, I introduce you to logistic regression analysis for a binary response variable with multiple
explanatory variables. Logistic regression is simply another form of the linear regression model, so the basic idea is
the same as a multiple regression analysis.
But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables.
I will gain experience testing and interpreting a logistic regression model, including using odds ratios and conﬁdence
intervals to determine the magnitude of the association between your explanatory variables and response variable.
You can use the same explanatory variables that you used to test your multiple regression model with a quantitative
outcome, but your response variable needs to be binary (categorical with 2 categories). If you have a quantitative
response variable, you will have to bin it into 2 categories. Alternatively, you can choose a different binary response
variable from your data set that you can use to test a logistic regression model. If you have a categorical response
variable with more than two categories, you will need to collapse it into two categories.
Document written in LATEX
template_version_01.tex
2

2 Python Code
For the last assingment of this module, I will use the AddHealth dataset to test a logistic regression model with mutliple
explanatory variables and a categorical, binary response variable.
First of all, I import all required libraries and use pandas to read in the data set. Then, I set all the variables to numeric
and recode them to binary (1 = yes and 0 = no).
To reduce the loading time, I create a new dataset called mydata, only with the variables that I’m going to work with.
# import libraries
import pandas
import numpy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# reading in the data set we want to work with
data = pandas.read_csv(working_folder+"M3A4data_addhealth_pds.csv",low_memory=False)
# setting variables to numeric
data['H1GH23J'] = pandas.to_numeric(data['H1GH23J'], errors='coerce')
data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce')
data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce')
data['H1GH52'] = pandas.to_numeric(data['H1GH52'], errors='coerce')
data['H1ED16'] = pandas.to_numeric(data['H1ED16'], errors='coerce')
# recode variable observations to 0=no, 1=yes
def NOBREAKFAST(x):
if x['H1GH23J'] == 1:
return 1
else:
return 0
data['NOBREAKFAST'] = data.apply(lambda x: NOBREAKFAST(x), axis = 1)
def WATCHTV(x):
if x['H1DA8'] >= 1:
return 1
else:
return 0
data['WATCHTV'] = data.apply(lambda x: WATCHTV(x), axis = 1)
def PLAYSPORT(x):
if x['H1DA5'] >= 1:
return 1
else:
return 0
data['PLAYSPORT'] = data.apply(lambda x: PLAYSPORT(x), axis = 1)
def ENOUGHSLEEP(x):
if x['H1GH52'] == 1:
return 1
else:
return 0
data['ENOUGHSLEEP'] = data.apply(lambda x: ENOUGHSLEEP(x), axis = 1)
def TROUBLEPAYATT(x):
if x['H1ED16'] >= 1:
return 1
else:
return 0
data['TROUBLEPAYATT'] = data.apply(lambda x: TROUBLEPAYATT(x), axis = 1)
# create a personalized dataset only with the chosen variables for this research
mydata = data[['NOBREAKFAST','WATCHTV','PLAYSPORT','ENOUGHSLEEP','TROUBLEPAYATT']].dropna()
3

Explanatory variables:
• NOBREAKFAST = Have nothing for breakfast (1 = yes and 0 = no)
• WATCHTV = Watch TV (1 = yes and 0 = no)
• PLAYSPORT = Play an active sport (1 = yes and 0 = no)
• ENOUGHSLEEP = Have enough sleep hours (1 = yes and 0 = no)
Response variable:
• TROUBLEPAYATT = Have trouble to pay attention at school (1 = yes and 0 = no)
Research question: Are those having nothing for breakfast more or less likely to have trouble paying attention at
school? To answer that question, I will use the logit function setting "TROUBLEPAYATT" as my response variable
and "NOBREAKFAST" as explanatory variable. In addition, I will use odds ratios to explain the probability of having
trobles to pay attention at school when having vs. no having breakfast.
# logistic regression model
lreg1 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST', data=mydata).fit()
print(lreg1.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6502
Method: MLE Df Model: 1
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.002436
Time: 17:34:11 Log-Likelihood: -3564.0
converged: True LL-Null: -3572.7
LLR p-value: 3.014e-05
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.1030 0.032 34.485 0.000 1.040 1.166
NOBREAKFAST 0.3169 0.078 4.088 0.000 0.165 0.469
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.829976 3.207982 3.013057
NOBREAKFAST 1.179349 1.598154 1.372874
The generated output indicates:
• Number of obervations: 6504
• P value of "NOBREAKFAST" is lower than the α-level of 0.05: the regression is significant.
• The coeficient of "NOBREAKFAST" is positive
• Interpretation of the odds ratio: students having nothing for breakfast are 1.37 times more likely to have trouble
paying attention at school than students having breakfast.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 1.18 and 1.60.
4

I will now add a second explanatory variable "ENOUGHSLEEP " to the model and study the results.
lreg2 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP', data=mydata).fit()
conf['OR'] = params
==============================================================================
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.7497 0.072 24.392 0.000 1.609 1.890
NOBREAKFAST 0.2243 0.079 2.852 0.004 0.070 0.378
ENOUGHSLEEP -0.8107 0.077 -10.571 0.000 -0.961 -0.660
===============================================================================
Odds Ratios
Intercept 4.998263 6.621168 5.752768
NOBREAKFAST 1.072676 1.459978 1.251432
ENOUGHSLEEP 0.382490 0.516639 0.444533
• P value of "ENOUGHSLEEP" is lower than α-level of 0.05: the regression is significant.
• The coeficient of "ENOUGHSLEEP" is negative
• Interpretation of the odds ratio: students having enough sleep hours are 0.44 times less likely to have trouble
paying attention at school than students not having enough sleep hours.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 0.38 and 0.52.
5

To conclude, in order to ﬁnd a possible confounding, I will add more explanatory variables to the model.
lreg3 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP + WATCHTV + PLAYSPORT',
data=mydata).fit()
conf['OR'] = params
==============================================================================
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.3389 0.205 6.547 0.000 0.938 1.740
NOBREAKFAST 0.2338 0.079 2.963 0.003 0.079 0.388
ENOUGHSLEEP -0.8226 0.077 -10.688 0.000 -0.973 -0.672
WATCHTV 0.3679 0.195 1.882 0.060 -0.015 0.751
PLAYSPORT 0.0829 0.065 1.277 0.202 -0.044 0.210
===============================================================================
Odds Ratios
Intercept 2.555077 5.695756 3.814852
NOBREAKFAST 1.082392 1.474696 1.263408
ENOUGHSLEEP 0.377780 0.510817 0.439291
WATCHTV 0.984947 2.119010 1.444684
PLAYSPORT 0.956615 1.233856 1.086428
• The P values of "WATCHTV" and "PLAYSPORT" exceed the α-level of 0.05: the regression is non-signiﬁcant.
They are confounding variables.
6

3 Codebook
7

8

[M3A4] Data Analysis and Interpretation Specialization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [M3A4] Data Analysis and Interpretation Specialization

Similar to [M3A4] Data Analysis and Interpretation Specialization (20)

Recently uploaded

Recently uploaded (20)

[M3A4] Data Analysis and Interpretation Specialization