SlideShare a Scribd company logo
1 of 8
Download to read offline
DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 4
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
1 Introduction
In this assignment, I will discuss some things that you should keep in mind as you continue to use data analysis in the
future. I will also teach you how to test a categorical explanatory variable with more than two categories in a multiple
regression analysis. Finally, I introduce you to logistic regression analysis for a binary response variable with multiple
explanatory variables. Logistic regression is simply another form of the linear regression model, so the basic idea is
the same as a multiple regression analysis.
But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables.
I will gain experience testing and interpreting a logistic regression model, including using odds ratios and confidence
intervals to determine the magnitude of the association between your explanatory variables and response variable.
You can use the same explanatory variables that you used to test your multiple regression model with a quantitative
outcome, but your response variable needs to be binary (categorical with 2 categories). If you have a quantitative
response variable, you will have to bin it into 2 categories. Alternatively, you can choose a different binary response
variable from your data set that you can use to test a logistic regression model. If you have a categorical response
variable with more than two categories, you will need to collapse it into two categories.
Document written in LATEX
template_version_01.tex
2
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
2 Python Code
For the last assingment of this module, I will use the AddHealth dataset to test a logistic regression model with mutliple
explanatory variables and a categorical, binary response variable.
First of all, I import all required libraries and use pandas to read in the data set. Then, I set all the variables to numeric
and recode them to binary (1 = yes and 0 = no).
To reduce the loading time, I create a new dataset called mydata, only with the variables that I’m going to work with.
# import libraries
import pandas
import numpy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# reading in the data set we want to work with
data = pandas.read_csv(working_folder+"M3A4data_addhealth_pds.csv",low_memory=False)
# setting variables to numeric
data['H1GH23J'] = pandas.to_numeric(data['H1GH23J'], errors='coerce')
data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce')
data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce')
data['H1GH52'] = pandas.to_numeric(data['H1GH52'], errors='coerce')
data['H1ED16'] = pandas.to_numeric(data['H1ED16'], errors='coerce')
# recode variable observations to 0=no, 1=yes
def NOBREAKFAST(x):
if x['H1GH23J'] == 1:
return 1
else:
return 0
data['NOBREAKFAST'] = data.apply(lambda x: NOBREAKFAST(x), axis = 1)
def WATCHTV(x):
if x['H1DA8'] >= 1:
return 1
else:
return 0
data['WATCHTV'] = data.apply(lambda x: WATCHTV(x), axis = 1)
def PLAYSPORT(x):
if x['H1DA5'] >= 1:
return 1
else:
return 0
data['PLAYSPORT'] = data.apply(lambda x: PLAYSPORT(x), axis = 1)
def ENOUGHSLEEP(x):
if x['H1GH52'] == 1:
return 1
else:
return 0
data['ENOUGHSLEEP'] = data.apply(lambda x: ENOUGHSLEEP(x), axis = 1)
def TROUBLEPAYATT(x):
if x['H1ED16'] >= 1:
return 1
else:
return 0
data['TROUBLEPAYATT'] = data.apply(lambda x: TROUBLEPAYATT(x), axis = 1)
# create a personalized dataset only with the chosen variables for this research
mydata = data[['NOBREAKFAST','WATCHTV','PLAYSPORT','ENOUGHSLEEP','TROUBLEPAYATT']].dropna()
Document written in LATEX
template_version_01.tex
3
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
Explanatory variables:
• NOBREAKFAST = Have nothing for breakfast (1 = yes and 0 = no)
• WATCHTV = Watch TV (1 = yes and 0 = no)
• PLAYSPORT = Play an active sport (1 = yes and 0 = no)
• ENOUGHSLEEP = Have enough sleep hours (1 = yes and 0 = no)
Response variable:
• TROUBLEPAYATT = Have trouble to pay attention at school (1 = yes and 0 = no)
Research question: Are those having nothing for breakfast more or less likely to have trouble paying attention at
school? To answer that question, I will use the logit function setting "TROUBLEPAYATT" as my response variable
and "NOBREAKFAST" as explanatory variable. In addition, I will use odds ratios to explain the probability of having
trobles to pay attention at school when having vs. no having breakfast.
# logistic regression model
lreg1 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST', data=mydata).fit()
print(lreg1.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6502
Method: MLE Df Model: 1
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.002436
Time: 17:34:11 Log-Likelihood: -3564.0
converged: True LL-Null: -3572.7
LLR p-value: 3.014e-05
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.1030 0.032 34.485 0.000 1.040 1.166
NOBREAKFAST 0.3169 0.078 4.088 0.000 0.165 0.469
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.829976 3.207982 3.013057
NOBREAKFAST 1.179349 1.598154 1.372874
The generated output indicates:
• Number of obervations: 6504
• P value of "NOBREAKFAST" is lower than the α-level of 0.05: the regression is significant.
• The coeficient of "NOBREAKFAST" is positive
• Interpretation of the odds ratio: students having nothing for breakfast are 1.37 times more likely to have trouble
paying attention at school than students having breakfast.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 1.18 and 1.60.
Document written in LATEX
template_version_01.tex
4
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
I will now add a second explanatory variable "ENOUGHSLEEP " to the model and study the results.
# logistic regression model
lreg2 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP', data=mydata).fit()
print(lreg2.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6501
Method: MLE Df Model: 2
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.01991
Time: 17:34:11 Log-Likelihood: -3501.6
converged: True LL-Null: -3572.7
LLR p-value: 1.272e-31
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.7497 0.072 24.392 0.000 1.609 1.890
NOBREAKFAST 0.2243 0.079 2.852 0.004 0.070 0.378
ENOUGHSLEEP -0.8107 0.077 -10.571 0.000 -0.961 -0.660
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 4.998263 6.621168 5.752768
NOBREAKFAST 1.072676 1.459978 1.251432
ENOUGHSLEEP 0.382490 0.516639 0.444533
The generated output indicates:
• P value of "ENOUGHSLEEP" is lower than α-level of 0.05: the regression is significant.
• The coeficient of "ENOUGHSLEEP" is negative
• Interpretation of the odds ratio: students having enough sleep hours are 0.44 times less likely to have trouble
paying attention at school than students not having enough sleep hours.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 0.38 and 0.52.
Document written in LATEX
template_version_01.tex
5
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
To conclude, in order to find a possible confounding, I will add more explanatory variables to the model.
# logistic regression model
lreg3 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP + WATCHTV + PLAYSPORT',
data=mydata).fit()
print(lreg3.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg3.params
conf = lreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6499
Method: MLE Df Model: 4
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.02063
Time: 17:34:11 Log-Likelihood: -3499.0
converged: True LL-Null: -3572.7
LLR p-value: 7.243e-31
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.3389 0.205 6.547 0.000 0.938 1.740
NOBREAKFAST 0.2338 0.079 2.963 0.003 0.079 0.388
ENOUGHSLEEP -0.8226 0.077 -10.688 0.000 -0.973 -0.672
WATCHTV 0.3679 0.195 1.882 0.060 -0.015 0.751
PLAYSPORT 0.0829 0.065 1.277 0.202 -0.044 0.210
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.555077 5.695756 3.814852
NOBREAKFAST 1.082392 1.474696 1.263408
ENOUGHSLEEP 0.377780 0.510817 0.439291
WATCHTV 0.984947 2.119010 1.444684
PLAYSPORT 0.956615 1.233856 1.086428
The generated output indicates:
• The P values of "WATCHTV" and "PLAYSPORT" exceed the α-level of 0.05: the regression is non-significant.
They are confounding variables.
Document written in LATEX
template_version_01.tex
6
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
3 Codebook
Document written in LATEX
template_version_01.tex
7
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
Document written in LATEX
template_version_01.tex
8

More Related Content

What's hot

Chapter 4 - multiple regression
Chapter 4  - multiple regressionChapter 4  - multiple regression
Chapter 4 - multiple regressionTauseef khan
 
Logistic Ordinal Regression
Logistic Ordinal RegressionLogistic Ordinal Regression
Logistic Ordinal RegressionSri Ambati
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionDexlab Analytics
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Dr Athar Khan
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?Smarten Augmented Analytics
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsNBER
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaEdureka!
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values Smarten Augmented Analytics
 

What's hot (20)

Chapter 4 - multiple regression
Chapter 4  - multiple regressionChapter 4  - multiple regression
Chapter 4 - multiple regression
 
Malhotra18
Malhotra18Malhotra18
Malhotra18
 
Logistic Ordinal Regression
Logistic Ordinal RegressionLogistic Ordinal Regression
Logistic Ordinal Regression
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
 
Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?What is KNN Classification and How Can This Analysis Help an Enterprise?
What is KNN Classification and How Can This Analysis Help an Enterprise?
 
Econometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse ModelsEconometrics of High-Dimensional Sparse Models
Econometrics of High-Dimensional Sparse Models
 
Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
 
Multilevel Binary Logistic Regression
Multilevel Binary Logistic RegressionMultilevel Binary Logistic Regression
Multilevel Binary Logistic Regression
 
SEM
SEMSEM
SEM
 
Linear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | EdurekaLinear Regression vs Logistic Regression | Edureka
Linear Regression vs Logistic Regression | Edureka
 
Regression for class teaching
Regression for class teachingRegression for class teaching
Regression for class teaching
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values  Random Forest Regression Analysis Reveals Impact of Variables on Target Values
Random Forest Regression Analysis Reveals Impact of Variables on Target Values
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Hypo
HypoHypo
Hypo
 
Les5e ppt 11
Les5e ppt 11Les5e ppt 11
Les5e ppt 11
 
Descriptive statistics and graphs
Descriptive statistics and graphsDescriptive statistics and graphs
Descriptive statistics and graphs
 
Machine learning session2
Machine learning   session2Machine learning   session2
Machine learning session2
 
Statistical parameters
Statistical parametersStatistical parameters
Statistical parameters
 

Similar to [M3A4] Data Analysis and Interpretation Specialization

[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization Andrea Rubio
 
[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization Andrea Rubio
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkDessy Amirudin
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIRaouf KESKES
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
 
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxDIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxlynettearnold46882
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxajondaree
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptBharatDaiyaBharat
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Matthew Clark
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIIMax Kleiner
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Max Kleiner
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation SpecializationAndrea Rubio
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_reportRavi Gupta
 

Similar to [M3A4] Data Analysis and Interpretation Specialization (20)

[M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization [M3A3] Data Analysis and Interpretation Specialization
[M3A3] Data Analysis and Interpretation Specialization
 
[M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization [M2A2] Data Analysis and Interpretation Specialization
[M2A2] Data Analysis and Interpretation Specialization
 
[M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization [M2A3] Data Analysis and Interpretation Specialization
[M2A3] Data Analysis and Interpretation Specialization
 
[M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization [M4A2] Data Analysis and Interpretation Specialization
[M4A2] Data Analysis and Interpretation Specialization
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Reinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAIReinforcement learning Research experiments OpenAI
Reinforcement learning Research experiments OpenAI
 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxDIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docx
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
 
Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016Bioactivity Predictive ModelingMay2016
Bioactivity Predictive ModelingMay2016
 
maXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VIImaXbox starter69 Machine Learning VII
maXbox starter69 Machine Learning VII
 
Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62Machine Learning Guide maXbox Starter62
Machine Learning Guide maXbox Starter62
 
1624.pptx
1624.pptx1624.pptx
1624.pptx
 
[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization[M4A1] Data Analysis and Interpretation Specialization
[M4A1] Data Analysis and Interpretation Specialization
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
 
Ai_Project_report
Ai_Project_reportAi_Project_report
Ai_Project_report
 

Recently uploaded

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Recently uploaded (20)

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 

[M3A4] Data Analysis and Interpretation Specialization

  • 1. DATA ANALYSIS COLLECTION ASSIGNMENT Data Analysis And Interpretation Specialization Test A Logistic Regression Model Andrea Rubio Amorós June 15, 2017 Modul 3 Assignment 4
  • 2. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 1 Introduction In this assignment, I will discuss some things that you should keep in mind as you continue to use data analysis in the future. I will also teach you how to test a categorical explanatory variable with more than two categories in a multiple regression analysis. Finally, I introduce you to logistic regression analysis for a binary response variable with multiple explanatory variables. Logistic regression is simply another form of the linear regression model, so the basic idea is the same as a multiple regression analysis. But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables. I will gain experience testing and interpreting a logistic regression model, including using odds ratios and confidence intervals to determine the magnitude of the association between your explanatory variables and response variable. You can use the same explanatory variables that you used to test your multiple regression model with a quantitative outcome, but your response variable needs to be binary (categorical with 2 categories). If you have a quantitative response variable, you will have to bin it into 2 categories. Alternatively, you can choose a different binary response variable from your data set that you can use to test a logistic regression model. If you have a categorical response variable with more than two categories, you will need to collapse it into two categories. Document written in LATEX template_version_01.tex 2
  • 3. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 2 Python Code For the last assingment of this module, I will use the AddHealth dataset to test a logistic regression model with mutliple explanatory variables and a categorical, binary response variable. First of all, I import all required libraries and use pandas to read in the data set. Then, I set all the variables to numeric and recode them to binary (1 = yes and 0 = no). To reduce the loading time, I create a new dataset called mydata, only with the variables that I’m going to work with. # import libraries import pandas import numpy import matplotlib.pyplot as plt import statsmodels.formula.api as smf # reading in the data set we want to work with data = pandas.read_csv(working_folder+"M3A4data_addhealth_pds.csv",low_memory=False) # setting variables to numeric data['H1GH23J'] = pandas.to_numeric(data['H1GH23J'], errors='coerce') data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce') data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce') data['H1GH52'] = pandas.to_numeric(data['H1GH52'], errors='coerce') data['H1ED16'] = pandas.to_numeric(data['H1ED16'], errors='coerce') # recode variable observations to 0=no, 1=yes def NOBREAKFAST(x): if x['H1GH23J'] == 1: return 1 else: return 0 data['NOBREAKFAST'] = data.apply(lambda x: NOBREAKFAST(x), axis = 1) def WATCHTV(x): if x['H1DA8'] >= 1: return 1 else: return 0 data['WATCHTV'] = data.apply(lambda x: WATCHTV(x), axis = 1) def PLAYSPORT(x): if x['H1DA5'] >= 1: return 1 else: return 0 data['PLAYSPORT'] = data.apply(lambda x: PLAYSPORT(x), axis = 1) def ENOUGHSLEEP(x): if x['H1GH52'] == 1: return 1 else: return 0 data['ENOUGHSLEEP'] = data.apply(lambda x: ENOUGHSLEEP(x), axis = 1) def TROUBLEPAYATT(x): if x['H1ED16'] >= 1: return 1 else: return 0 data['TROUBLEPAYATT'] = data.apply(lambda x: TROUBLEPAYATT(x), axis = 1) # create a personalized dataset only with the chosen variables for this research mydata = data[['NOBREAKFAST','WATCHTV','PLAYSPORT','ENOUGHSLEEP','TROUBLEPAYATT']].dropna() Document written in LATEX template_version_01.tex 3
  • 4. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 Explanatory variables: • NOBREAKFAST = Have nothing for breakfast (1 = yes and 0 = no) • WATCHTV = Watch TV (1 = yes and 0 = no) • PLAYSPORT = Play an active sport (1 = yes and 0 = no) • ENOUGHSLEEP = Have enough sleep hours (1 = yes and 0 = no) Response variable: • TROUBLEPAYATT = Have trouble to pay attention at school (1 = yes and 0 = no) Research question: Are those having nothing for breakfast more or less likely to have trouble paying attention at school? To answer that question, I will use the logit function setting "TROUBLEPAYATT" as my response variable and "NOBREAKFAST" as explanatory variable. In addition, I will use odds ratios to explain the probability of having trobles to pay attention at school when having vs. no having breakfast. # logistic regression model lreg1 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST', data=mydata).fit() print(lreg1.summary()) # odds ratios with 95% confidence intervals print("Odds Ratios") params = lreg1.params conf = lreg1.conf_int() conf['OR'] = params conf.columns = ['Lower CI','Upper CI','OR'] print(numpy.exp(conf)) Logit Regression Results ============================================================================== Dep. Variable: TROUBLEPAYATT No. Observations: 6504 Model: Logit Df Residuals: 6502 Method: MLE Df Model: 1 Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.002436 Time: 17:34:11 Log-Likelihood: -3564.0 converged: True LL-Null: -3572.7 LLR p-value: 3.014e-05 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.1030 0.032 34.485 0.000 1.040 1.166 NOBREAKFAST 0.3169 0.078 4.088 0.000 0.165 0.469 =============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 2.829976 3.207982 3.013057 NOBREAKFAST 1.179349 1.598154 1.372874 The generated output indicates: • Number of obervations: 6504 • P value of "NOBREAKFAST" is lower than the α-level of 0.05: the regression is significant. • The coeficient of "NOBREAKFAST" is positive • Interpretation of the odds ratio: students having nothing for breakfast are 1.37 times more likely to have trouble paying attention at school than students having breakfast. • Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 1.18 and 1.60. Document written in LATEX template_version_01.tex 4
  • 5. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 I will now add a second explanatory variable "ENOUGHSLEEP " to the model and study the results. # logistic regression model lreg2 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP', data=mydata).fit() print(lreg2.summary()) # odds ratios with 95% confidence intervals print("Odds Ratios") params = lreg2.params conf = lreg2.conf_int() conf['OR'] = params conf.columns = ['Lower CI','Upper CI','OR'] print(numpy.exp(conf)) Logit Regression Results ============================================================================== Dep. Variable: TROUBLEPAYATT No. Observations: 6504 Model: Logit Df Residuals: 6501 Method: MLE Df Model: 2 Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.01991 Time: 17:34:11 Log-Likelihood: -3501.6 converged: True LL-Null: -3572.7 LLR p-value: 1.272e-31 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.7497 0.072 24.392 0.000 1.609 1.890 NOBREAKFAST 0.2243 0.079 2.852 0.004 0.070 0.378 ENOUGHSLEEP -0.8107 0.077 -10.571 0.000 -0.961 -0.660 =============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 4.998263 6.621168 5.752768 NOBREAKFAST 1.072676 1.459978 1.251432 ENOUGHSLEEP 0.382490 0.516639 0.444533 The generated output indicates: • P value of "ENOUGHSLEEP" is lower than α-level of 0.05: the regression is significant. • The coeficient of "ENOUGHSLEEP" is negative • Interpretation of the odds ratio: students having enough sleep hours are 0.44 times less likely to have trouble paying attention at school than students not having enough sleep hours. • Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 0.38 and 0.52. Document written in LATEX template_version_01.tex 5
  • 6. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 To conclude, in order to find a possible confounding, I will add more explanatory variables to the model. # logistic regression model lreg3 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP + WATCHTV + PLAYSPORT', data=mydata).fit() print(lreg3.summary()) # odds ratios with 95% confidence intervals print("Odds Ratios") params = lreg3.params conf = lreg3.conf_int() conf['OR'] = params conf.columns = ['Lower CI','Upper CI','OR'] print(numpy.exp(conf)) Logit Regression Results ============================================================================== Dep. Variable: TROUBLEPAYATT No. Observations: 6504 Model: Logit Df Residuals: 6499 Method: MLE Df Model: 4 Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.02063 Time: 17:34:11 Log-Likelihood: -3499.0 converged: True LL-Null: -3572.7 LLR p-value: 7.243e-31 =============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.3389 0.205 6.547 0.000 0.938 1.740 NOBREAKFAST 0.2338 0.079 2.963 0.003 0.079 0.388 ENOUGHSLEEP -0.8226 0.077 -10.688 0.000 -0.973 -0.672 WATCHTV 0.3679 0.195 1.882 0.060 -0.015 0.751 PLAYSPORT 0.0829 0.065 1.277 0.202 -0.044 0.210 =============================================================================== Odds Ratios Lower CI Upper CI OR Intercept 2.555077 5.695756 3.814852 NOBREAKFAST 1.082392 1.474696 1.263408 ENOUGHSLEEP 0.377780 0.510817 0.439291 WATCHTV 0.984947 2.119010 1.444684 PLAYSPORT 0.956615 1.233856 1.086428 The generated output indicates: • The P values of "WATCHTV" and "PLAYSPORT" exceed the α-level of 0.05: the regression is non-significant. They are confounding variables. Document written in LATEX template_version_01.tex 6
  • 7. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 3 Codebook Document written in LATEX template_version_01.tex 7
  • 8. Data Analysis And Interpretation Specialization Test A Logistic Regression Model M3A4 Document written in LATEX template_version_01.tex 8