SlideShare a Scribd company logo
DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 4
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
1 Introduction
In this assignment, I will discuss some things that you should keep in mind as you continue to use data analysis in the
future. I will also teach you how to test a categorical explanatory variable with more than two categories in a multiple
regression analysis. Finally, I introduce you to logistic regression analysis for a binary response variable with multiple
explanatory variables. Logistic regression is simply another form of the linear regression model, so the basic idea is
the same as a multiple regression analysis.
But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables.
I will gain experience testing and interpreting a logistic regression model, including using odds ratios and confidence
intervals to determine the magnitude of the association between your explanatory variables and response variable.
You can use the same explanatory variables that you used to test your multiple regression model with a quantitative
outcome, but your response variable needs to be binary (categorical with 2 categories). If you have a quantitative
response variable, you will have to bin it into 2 categories. Alternatively, you can choose a different binary response
variable from your data set that you can use to test a logistic regression model. If you have a categorical response
variable with more than two categories, you will need to collapse it into two categories.
Document written in LATEX
template_version_01.tex
2
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
2 Python Code
For the last assingment of this module, I will use the AddHealth dataset to test a logistic regression model with mutliple
explanatory variables and a categorical, binary response variable.
First of all, I import all required libraries and use pandas to read in the data set. Then, I set all the variables to numeric
and recode them to binary (1 = yes and 0 = no).
To reduce the loading time, I create a new dataset called mydata, only with the variables that I’m going to work with.
# import libraries
import pandas
import numpy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# reading in the data set we want to work with
data = pandas.read_csv(working_folder+"M3A4data_addhealth_pds.csv",low_memory=False)
# setting variables to numeric
data['H1GH23J'] = pandas.to_numeric(data['H1GH23J'], errors='coerce')
data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce')
data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce')
data['H1GH52'] = pandas.to_numeric(data['H1GH52'], errors='coerce')
data['H1ED16'] = pandas.to_numeric(data['H1ED16'], errors='coerce')
# recode variable observations to 0=no, 1=yes
def NOBREAKFAST(x):
if x['H1GH23J'] == 1:
return 1
else:
return 0
data['NOBREAKFAST'] = data.apply(lambda x: NOBREAKFAST(x), axis = 1)
def WATCHTV(x):
if x['H1DA8'] >= 1:
return 1
else:
return 0
data['WATCHTV'] = data.apply(lambda x: WATCHTV(x), axis = 1)
def PLAYSPORT(x):
if x['H1DA5'] >= 1:
return 1
else:
return 0
data['PLAYSPORT'] = data.apply(lambda x: PLAYSPORT(x), axis = 1)
def ENOUGHSLEEP(x):
if x['H1GH52'] == 1:
return 1
else:
return 0
data['ENOUGHSLEEP'] = data.apply(lambda x: ENOUGHSLEEP(x), axis = 1)
def TROUBLEPAYATT(x):
if x['H1ED16'] >= 1:
return 1
else:
return 0
data['TROUBLEPAYATT'] = data.apply(lambda x: TROUBLEPAYATT(x), axis = 1)
# create a personalized dataset only with the chosen variables for this research
mydata = data[['NOBREAKFAST','WATCHTV','PLAYSPORT','ENOUGHSLEEP','TROUBLEPAYATT']].dropna()
Document written in LATEX
template_version_01.tex
3
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
Explanatory variables:
• NOBREAKFAST = Have nothing for breakfast (1 = yes and 0 = no)
• WATCHTV = Watch TV (1 = yes and 0 = no)
• PLAYSPORT = Play an active sport (1 = yes and 0 = no)
• ENOUGHSLEEP = Have enough sleep hours (1 = yes and 0 = no)
Response variable:
• TROUBLEPAYATT = Have trouble to pay attention at school (1 = yes and 0 = no)
Research question: Are those having nothing for breakfast more or less likely to have trouble paying attention at
school? To answer that question, I will use the logit function setting "TROUBLEPAYATT" as my response variable
and "NOBREAKFAST" as explanatory variable. In addition, I will use odds ratios to explain the probability of having
trobles to pay attention at school when having vs. no having breakfast.
# logistic regression model
lreg1 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST', data=mydata).fit()
print(lreg1.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6502
Method: MLE Df Model: 1
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.002436
Time: 17:34:11 Log-Likelihood: -3564.0
converged: True LL-Null: -3572.7
LLR p-value: 3.014e-05
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.1030 0.032 34.485 0.000 1.040 1.166
NOBREAKFAST 0.3169 0.078 4.088 0.000 0.165 0.469
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.829976 3.207982 3.013057
NOBREAKFAST 1.179349 1.598154 1.372874
The generated output indicates:
• Number of obervations: 6504
• P value of "NOBREAKFAST" is lower than the α-level of 0.05: the regression is significant.
• The coeficient of "NOBREAKFAST" is positive
• Interpretation of the odds ratio: students having nothing for breakfast are 1.37 times more likely to have trouble
paying attention at school than students having breakfast.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 1.18 and 1.60.
Document written in LATEX
template_version_01.tex
4
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
I will now add a second explanatory variable "ENOUGHSLEEP " to the model and study the results.
# logistic regression model
lreg2 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP', data=mydata).fit()
print(lreg2.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6501
Method: MLE Df Model: 2
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.01991
Time: 17:34:11 Log-Likelihood: -3501.6
converged: True LL-Null: -3572.7
LLR p-value: 1.272e-31
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.7497 0.072 24.392 0.000 1.609 1.890
NOBREAKFAST 0.2243 0.079 2.852 0.004 0.070 0.378
ENOUGHSLEEP -0.8107 0.077 -10.571 0.000 -0.961 -0.660
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 4.998263 6.621168 5.752768
NOBREAKFAST 1.072676 1.459978 1.251432
ENOUGHSLEEP 0.382490 0.516639 0.444533
The generated output indicates:
• P value of "ENOUGHSLEEP" is lower than α-level of 0.05: the regression is significant.
• The coeficient of "ENOUGHSLEEP" is negative
• Interpretation of the odds ratio: students having enough sleep hours are 0.44 times less likely to have trouble
paying attention at school than students not having enough sleep hours.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 0.38 and 0.52.
Document written in LATEX
template_version_01.tex
5
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
To conclude, in order to find a possible confounding, I will add more explanatory variables to the model.
# logistic regression model
lreg3 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP + WATCHTV + PLAYSPORT',
data=mydata).fit()
print(lreg3.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg3.params
conf = lreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6499
Method: MLE Df Model: 4
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.02063
Time: 17:34:11 Log-Likelihood: -3499.0
converged: True LL-Null: -3572.7
LLR p-value: 7.243e-31
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.3389 0.205 6.547 0.000 0.938 1.740
NOBREAKFAST 0.2338 0.079 2.963 0.003 0.079 0.388
ENOUGHSLEEP -0.8226 0.077 -10.688 0.000 -0.973 -0.672
WATCHTV 0.3679 0.195 1.882 0.060 -0.015 0.751
PLAYSPORT 0.0829 0.065 1.277 0.202 -0.044 0.210
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.555077 5.695756 3.814852
NOBREAKFAST 1.082392 1.474696 1.263408
ENOUGHSLEEP 0.377780 0.510817 0.439291
WATCHTV 0.984947 2.119010 1.444684
PLAYSPORT 0.956615 1.233856 1.086428
The generated output indicates:
• The P values of "WATCHTV" and "PLAYSPORT" exceed the α-level of 0.05: the regression is non-significant.
They are confounding variables.
Document written in LATEX
template_version_01.tex
6
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
3 Codebook
Document written in LATEX
template_version_01.tex
7
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
Document written in LATEX
template_version_01.tex
8

More Related Content

What's hot

Chapter 4 - multiple regression
Chapter 4  - multiple regressionChapter 4  - multiple regression
Chapter 4 - multiple regression
Tauseef khan
 
Malhotra18
Malhotra18Malhotra18
Logistic Ordinal Regression
Logistic Ordinal RegressionLogistic Ordinal Regression
Logistic Ordinal Regression
Sri Ambati
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
Dexlab Analytics
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
Dr Athar Khan
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?
Smarten Augmented Analytics
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
NBER
 
Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
Dhritiman Chakrabarti
 
Multilevel Binary Logistic Regression
Multilevel Binary Logistic RegressionMultilevel Binary Logistic Regression
Multilevel Binary Logistic Regression
Economic Research Forum
 
SEM
SEMSEM
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
Edureka!
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Smarten Augmented Analytics
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
Avjinder (Avi) Kaler
 
Les5e ppt 11
Les5e ppt 11Les5e ppt 11
Les5e ppt 11
Subas Nandy
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
Avjinder (Avi) Kaler
 
Machine learning session2
Machine learning   session2Machine learning   session2
Machine learning session2
Abhimanyu Dwivedi
 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parameters
Burdwan University
 

What's hot (20)

Chapter 4 - multiple regression
Chapter 4  - multiple regressionChapter 4  - multiple regression
Chapter 4 - multiple regression
 
Malhotra18
Malhotra18Malhotra18
Malhotra18
 
Logistic Ordinal Regression
Logistic Ordinal RegressionLogistic Ordinal Regression
Logistic Ordinal Regression
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
 
Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
 
Multilevel Binary Logistic Regression
Multilevel Binary Logistic RegressionMultilevel Binary Logistic Regression
Multilevel Binary Logistic Regression
 
SEM
SEMSEM
SEM
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
 
Regression for class teaching
Regression for class teachingRegression for class teaching
Regression for class teaching
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Hypo
HypoHypo
Hypo
 
Les5e ppt 11
Les5e ppt 11Les5e ppt 11
Les5e ppt 11
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Machine learning session2
Machine learning   session2Machine learning   session2
Machine learning session2
 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parameters
 

Similar to [M3A4] Data Analysis and Interpretation Specialization

[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
Andrea Rubio
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
Andrea Rubio
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization
Andrea Rubio
 
[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
Andrea Rubio
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
ankit_ppt
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Dessy Amirudin
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
Raouf KESKES
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
Kyriakos Chatzidimitriou
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
Arunangsu Sahu
 
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxDIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
lynettearnold46882
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
ajondaree
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
BharatDaiyaBharat
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Matthew Clark
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
Max Kleiner
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
Max Kleiner
 
1624.pptx
1624.pptx1624.pptx
1624.pptx
Jyoti863900
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
Andrea Rubio
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
gadissaassefa
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
RanjithKumar888622
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 

Similar to [M3A4] Data Analysis and Interpretation Specialization (20)

[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization
 
[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxDIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 
1624.pptx
1624.pptx1624.pptx
1624.pptx
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 

Recently uploaded

Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
AADYARAJPANDEY1
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
azzyixes
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
Cherry
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
AADYARAJPANDEY1
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
anitaento25
 

Recently uploaded (20)

Cancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate PathwayCancer cell metabolism: special Reference to Lactate Pathway
Cancer cell metabolism: special Reference to Lactate Pathway
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
justice-and-fairness-ethics with example
justice-and-fairness-ethics with examplejustice-and-fairness-ethics with example
justice-and-fairness-ethics with example
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Large scale production of streptomycin.pptx
Large scale production of streptomycin.pptxLarge scale production of streptomycin.pptx
Large scale production of streptomycin.pptx
 
filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCINGRNA INTERFERENCE: UNRAVELING GENETIC SILENCING
RNA INTERFERENCE: UNRAVELING GENETIC SILENCING
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
insect morphology and physiology of insect
insect morphology and physiology of insectinsect morphology and physiology of insect
insect morphology and physiology of insect
 

[M3A4] Data Analysis and Interpretation Specialization

  • 1. DATA ANALYSIS COLLECTION ASSIGNMENT Data Analysis And Interpretation Specialization Test A Logistic Regression Model Andrea Rubio Amorós June 15, 2017 Modul 3 Assignment 4
  • 2. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 1 Introduction In this assignment, I will discuss some things that you should keep in mind as you continue to use data analysis in the future. I will also teach you how to test a categorical explanatory variable with more than two categories in a multiple regression analysis. Finally, I introduce you to logistic regression analysis for a binary response variable with multiple explanatory variables. Logistic regression is simply another form of the linear regression model, so the basic idea is the same as a multiple regression analysis. But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables. I will gain experience testing and interpreting a logistic regression model, including using odds ratios and confidence intervals to determine the magnitude of the association between your explanatory variables and response variable. You can use the same explanatory variables that you used to test your multiple regression model with a quantitative outcome, but your response variable needs to be binary (categorical with 2 categories). If you have a quantitative response variable, you will have to bin it into 2 categories. Alternatively, you can choose a different binary response variable from your data set that you can use to test a logistic regression model. If you have a categorical response variable with more than two categories, you will need to collapse it into two categories. Document written in LATEX template_version_01.tex 2
  • 3. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 2 Python Code For the last assingment of this module, I will use the AddHealth dataset to test a logistic regression model with mutliple explanatory variables and a categorical, binary response variable. First of all, I import all required libraries and use pandas to read in the data set. Then, I set all the variables to numeric and recode them to binary (1 = yes and 0 = no). To reduce the loading time, I create a new dataset called mydata, only with the variables that I’m going to work with. # import libraries import pandas import numpy import matplotlib.pyplot as plt import statsmodels.formula.api as smf # reading in the data set we want to work with data = pandas.read_csv(working_folder+"M3A4data_addhealth_pds.csv",low_memory=False) # setting variables to numeric data['H1GH23J'] = pandas.to_numeric(data['H1GH23J'], errors='coerce') data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce') data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce') data['H1GH52'] = pandas.to_numeric(data['H1GH52'], errors='coerce') data['H1ED16'] = pandas.to_numeric(data['H1ED16'], errors='coerce') # recode variable observations to 0=no, 1=yes def NOBREAKFAST(x): if x['H1GH23J'] == 1: return 1 else: return 0 data['NOBREAKFAST'] = data.apply(lambda x: NOBREAKFAST(x), axis = 1) def WATCHTV(x): if x['H1DA8'] >= 1: return 1 else: return 0 data['WATCHTV'] = data.apply(lambda x: WATCHTV(x), axis = 1) def PLAYSPORT(x): if x['H1DA5'] >= 1: return 1 else: return 0 data['PLAYSPORT'] = data.apply(lambda x: PLAYSPORT(x), axis = 1) def ENOUGHSLEEP(x): if x['H1GH52'] == 1: return 1 else: return 0 data['ENOUGHSLEEP'] = data.apply(lambda x: ENOUGHSLEEP(x), axis = 1) def TROUBLEPAYATT(x): if x['H1ED16'] >= 1: return 1 else: return 0 data['TROUBLEPAYATT'] = data.apply(lambda x: TROUBLEPAYATT(x), axis = 1) # create a personalized dataset only with the chosen variables for this research mydata = data[['NOBREAKFAST','WATCHTV','PLAYSPORT','ENOUGHSLEEP','TROUBLEPAYATT']].dropna() Document written in LATEX template_version_01.tex 3
  • 4. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 Explanatory variables: • NOBREAKFAST = Have nothing for breakfast (1 = yes and 0 = no) • WATCHTV = Watch TV (1 = yes and 0 = no) • PLAYSPORT = Play an active sport (1 = yes and 0 = no) • ENOUGHSLEEP = Have enough sleep hours (1 = yes and 0 = no) Response variable: • TROUBLEPAYATT = Have trouble to pay attention at school (1 = yes and 0 = no) Research question: Are those having nothing for breakfast more or less likely to have trouble paying attention at school? To answer that question, I will use the logit function setting "TROUBLEPAYATT" as my response variable and "NOBREAKFAST" as explanatory variable. In addition, I will use odds ratios to explain the probability of having trobles to pay attention at school when having vs. no having breakfast. # logistic regression model lreg1 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST', data=mydata).fit() print(lreg1.summary()) # odds ratios with 95% confidence intervals print("Odds Ratios") params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI','Upper CI','OR'] print(numpy.exp(conf)) Logit Regression Results ============================================================================== Dep. Variable: TROUBLEPAYATT No. Observations: 6504 Model: Logit Df Residuals: 6502 Method: MLE Df Model: 1 Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.002436 Time: 17:34:11 Log-Likelihood: -3564.0 converged: True LL-Null: -3572.7 LLR p-value: 3.014e-05 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.1030 0.032 34.485 0.000 1.040 1.166 NOBREAKFAST 0.3169 0.078 4.088 0.000 0.165 0.469 =============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 2.829976 3.207982 3.013057 NOBREAKFAST 1.179349 1.598154 1.372874 The generated output indicates: • Number of obervations: 6504 • P value of "NOBREAKFAST" is lower than the α-level of 0.05: the regression is significant. • The coeficient of "NOBREAKFAST" is positive • Interpretation of the odds ratio: students having nothing for breakfast are 1.37 times more likely to have trouble paying attention at school than students having breakfast. • Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 1.18 and 1.60. Document written in LATEX template_version_01.tex 4
  • 5. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 I will now add a second explanatory variable "ENOUGHSLEEP " to the model and study the results. # logistic regression model lreg2 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP', data=mydata).fit() print(lreg2.summary()) # odds ratios with 95% confidence intervals print("Odds Ratios") params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI','Upper CI','OR'] print(numpy.exp(conf)) Logit Regression Results ============================================================================== Dep. Variable: TROUBLEPAYATT No. Observations: 6504 Model: Logit Df Residuals: 6501 Method: MLE Df Model: 2 Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.01991 Time: 17:34:11 Log-Likelihood: -3501.6 converged: True LL-Null: -3572.7 LLR p-value: 1.272e-31 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.7497 0.072 24.392 0.000 1.609 1.890 NOBREAKFAST 0.2243 0.079 2.852 0.004 0.070 0.378 ENOUGHSLEEP -0.8107 0.077 -10.571 0.000 -0.961 -0.660 =============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 4.998263 6.621168 5.752768 NOBREAKFAST 1.072676 1.459978 1.251432 ENOUGHSLEEP 0.382490 0.516639 0.444533 The generated output indicates: • P value of "ENOUGHSLEEP" is lower than α-level of 0.05: the regression is significant. • The coeficient of "ENOUGHSLEEP" is negative • Interpretation of the odds ratio: students having enough sleep hours are 0.44 times less likely to have trouble paying attention at school than students not having enough sleep hours. • Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 0.38 and 0.52. Document written in LATEX template_version_01.tex 5
  • 6. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 To conclude, in order to find a possible confounding, I will add more explanatory variables to the model. # logistic regression model lreg3 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP + WATCHTV + PLAYSPORT', data=mydata).fit() print(lreg3.summary()) # odds ratios with 95% confidence intervals print("Odds Ratios") params = lreg3.params conf = lreg3.conf_int() conf['OR'] = params conf.columns = ['Lower CI','Upper CI','OR'] print(numpy.exp(conf)) Logit Regression Results ============================================================================== Dep. Variable: TROUBLEPAYATT No. Observations: 6504 Model: Logit Df Residuals: 6499 Method: MLE Df Model: 4 Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.02063 Time: 17:34:11 Log-Likelihood: -3499.0 converged: True LL-Null: -3572.7 LLR p-value: 7.243e-31 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.3389 0.205 6.547 0.000 0.938 1.740 NOBREAKFAST 0.2338 0.079 2.963 0.003 0.079 0.388 ENOUGHSLEEP -0.8226 0.077 -10.688 0.000 -0.973 -0.672 WATCHTV 0.3679 0.195 1.882 0.060 -0.015 0.751 PLAYSPORT 0.0829 0.065 1.277 0.202 -0.044 0.210 =============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 2.555077 5.695756 3.814852 NOBREAKFAST 1.082392 1.474696 1.263408 ENOUGHSLEEP 0.377780 0.510817 0.439291 WATCHTV 0.984947 2.119010 1.444684 PLAYSPORT 0.956615 1.233856 1.086428 The generated output indicates: • The P values of "WATCHTV" and "PLAYSPORT" exceed the α-level of 0.05: the regression is non-significant. They are confounding variables. Document written in LATEX template_version_01.tex 6
  • 7. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 3 Codebook Document written in LATEX template_version_01.tex 7
  • 8. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 Document written in LATEX template_version_01.tex 8