- The document discusses testing a logistic regression model with a binary response variable (trouble paying attention in school) and multiple explanatory variables using data from the AddHealth dataset.
- A logistic regression model is created with "NOBREAKFAST" as the single explanatory variable, finding students with no breakfast are 1.37 times more likely to have trouble paying attention.
- A second model adds the variable "ENOUGHSLEEP", finding enough sleep reduces the likelihood by a factor of 0.44.
- A third full model is created to check for confounding, but findings remain consistent with no breakfast increasing the likelihood of trouble paying attention.
Contains
a.Statistics-1
b. SAS-1
c. Statistics-2
d. Market Research
e. MS Excel
f. SAS-2
g. Data Audit & Data Sanitization
h. SQL
i. Model Building
j. HR
Contains
a.Statistics-1
b. SAS-1
c. Statistics-2
d. Market Research
e. MS Excel
f. SAS-2
g. Data Audit & Data Sanitization
h. SQL
i. Model Building
j. HR
Statistical Inference Part II: Types of Sampling DistributionDexlab Analytics
This is an in-depth analysis of the way different types of sampling distribution works focusing on their specific functions and interrelations as part of the discussion on the theory of sampling.
The KNN (K Nearest Neighbors) algorithm analyzes all available data points and classifies this data, then classifies new cases based on these established categories. It is useful for recognizing patterns and for estimating. The KNN Classification algorithm is useful in determining probable outcome and results, and in forecasting and predicting results, given the existence of multiple variables.
Eduard Ponarin- Higher School of Economics, Russia
Veronica Kostenko- The National Research University
ERF Training Workshop on Opinion Poll Data Analysis Using Multilevel Models
Beirut, Lebanon August 22-23, 2016
www.erf.org.eg
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
This overview discusses the predictive analytical technique known as Random Forest Regression, a method of analysis that creates a set of Decision Trees from a randomly selected subset of the training set, and aggregates by averaging values from different decision trees to decide the final target value. This technique is useful to determine which predictors have a significant impact on the target values, e.g., the impact of average rainfall, city location, parking availability, distance from hospital, and distance from shopping on the price of a house, or the impact of years of experience, position and productive hours on employee salary. Random Forest Regression is limited to predicting numeric output so the dependent variable has to be numeric in nature. The minimum sample size is 20 cases per independent variable. Random Forest Regression is just one of the numerous predictive analytical techniques and algorithms included in the Assisted Predictive Modeling module of the Smarten augmented analytics solution. This solution is designed to serve business users with sophisticated tools that are easy to use and require no data science or technical skills. Smarten is a representative vendor in multiple Gartner reports including the Gartner Modern BI and Analytics Platform report and the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms Report.
In this slide, variables types, probability theory behind the algorithms and its uses including distribution is explained. Also theorems like bayes theorem is also explained.
This article provides a brief discussion on several statistical parameters that are most commonly used in any measurement and analysis process. There are a plethora of such parameters but the most important and widely used are briefed in here.
Statistical Inference Part II: Types of Sampling DistributionDexlab Analytics
This is an in-depth analysis of the way different types of sampling distribution works focusing on their specific functions and interrelations as part of the discussion on the theory of sampling.
The KNN (K Nearest Neighbors) algorithm analyzes all available data points and classifies this data, then classifies new cases based on these established categories. It is useful for recognizing patterns and for estimating. The KNN Classification algorithm is useful in determining probable outcome and results, and in forecasting and predicting results, given the existence of multiple variables.
Eduard Ponarin- Higher School of Economics, Russia
Veronica Kostenko- The National Research University
ERF Training Workshop on Opinion Poll Data Analysis Using Multilevel Models
Beirut, Lebanon August 22-23, 2016
www.erf.org.eg
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
This overview discusses the predictive analytical technique known as Random Forest Regression, a method of analysis that creates a set of Decision Trees from a randomly selected subset of the training set, and aggregates by averaging values from different decision trees to decide the final target value. This technique is useful to determine which predictors have a significant impact on the target values, e.g., the impact of average rainfall, city location, parking availability, distance from hospital, and distance from shopping on the price of a house, or the impact of years of experience, position and productive hours on employee salary. Random Forest Regression is limited to predicting numeric output so the dependent variable has to be numeric in nature. The minimum sample size is 20 cases per independent variable. Random Forest Regression is just one of the numerous predictive analytical techniques and algorithms included in the Assisted Predictive Modeling module of the Smarten augmented analytics solution. This solution is designed to serve business users with sophisticated tools that are easy to use and require no data science or technical skills. Smarten is a representative vendor in multiple Gartner reports including the Gartner Modern BI and Analytics Platform report and the Gartner Magic Quadrant for Business Intelligence and Analytics Platforms Report.
In this slide, variables types, probability theory behind the algorithms and its uses including distribution is explained. Also theorems like bayes theorem is also explained.
This article provides a brief discussion on several statistical parameters that are most commonly used in any measurement and analysis process. There are a plethora of such parameters but the most important and widely used are briefed in here.
Sixteen (16) simple rules for building robust machine learning models. Invited talk for the AMA call of the Research Data Alliance (RDA) Early Career and Engagement Interest Group (ECEIG).
DIRECTIONS READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE I.docxlynettearnold46882
DIRECTIONS: READ THE FOLLOWING STUDENT POST AND RESPOND EVALUATE ITS CONTENT. PLEASE CITE ALL REFERENCES
Katie Kessler
Unit 2 Discussion 1
Top of Form
The word “noir” is used to remember the scaling of measurement in psychology (Embretson, 2004). In short, the letters stand for nominal, ordinal, interval and ratio (Embretson, 2004). To give a brief introduction of what each scale measures, “nominal is the simplest way to measure” because it focuses on categorizing measurements on a scale of category, according to Embretson (2004). An example of nominal is eye color. “Ordinal measures in terms of ranking, interval measures scores of tests that focus on unobservable mental functioning and ratio focuses on measuring activities in the physical world, such as someone’s running time” (Embretson, 2004). With different scales of measurement, there are two methods to compare sets of data. These include norm-referenced and criterion-referenced testing. According to Embretson (2004) norm-referenced testing “yields information on a testtaker’s standing or ranking relative to some comparison group of testtakers.” In other words, it focuses on the performance of peers. Criterion-referenced testing is a little different because it focuses on examining individual’s scores to a set standard (Embretson, 2004).
The ability for ordinal measurement scale to be utilized on a standardized test as a norm-referenced test is high since an ordinal scale is based upon ranking and norm-referenced testing gathers information on the examinees ranking compared to a group of testtakers. For example, a study conducted on decision making with the use of ordinal variables states that ordinal measurement scales has the ability to be utilized by norm-referenced testing (Barua, Kademane, Das, Gubbiyappa, Verma, & Al-Dubai, 2014).On the other hand, ordinal scaling would not be a strong measurement for criterion-referenced testing because it focuses on the ranking rather than the measurement of the scores to be close to a set standard.
Ratio scaling directs its focus on measuring objects and activities in the physical world which would be beneficial for criterion-referenced testing instead of norm-referenced testing. Imagine a marathon runner who was trying to beat the world’s fastest time running a marathon. Criterion-referenced testing allows the runner to be aware of the set standard the marathon runner needs to beat to be the best and set a new standard. Norm-referenced testing would not be as useful because the marathon runner would not have the standard measurement he or she needs to beat. However, the marathon runner would be aware of the relative time he or she needs to beat to be the best. That is not as helpful as the criterion-referenced testing because runners need an exact number instead of a relative number in comparison to other runners.
Norm-referenced data would be collected by “the standards relative to a group, such as means and standard deviation.
Introduction to use machine learning in python and pascal to do such a thing like train prime numbers when there are algorithms in place to determine prime numbers. See a dataframe, feature extracting and a few plots to re-search for another hot experiment to predict prime numbers.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
[M3A4] Data Analysis and Interpretation Specialization
1. DATA ANALYSIS COLLECTION
ASSIGNMENT
Data Analysis And Interpretation Specialization
Test A Logistic Regression Model
Andrea Rubio Amorós
June 15, 2017
Modul 3
Assignment 4
2. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
1 Introduction
In this assignment, I will discuss some things that you should keep in mind as you continue to use data analysis in the
future. I will also teach you how to test a categorical explanatory variable with more than two categories in a multiple
regression analysis. Finally, I introduce you to logistic regression analysis for a binary response variable with multiple
explanatory variables. Logistic regression is simply another form of the linear regression model, so the basic idea is
the same as a multiple regression analysis.
But, unlike the multiple regression model, the logistic regression model is designed to test binary response variables.
I will gain experience testing and interpreting a logistic regression model, including using odds ratios and confidence
intervals to determine the magnitude of the association between your explanatory variables and response variable.
You can use the same explanatory variables that you used to test your multiple regression model with a quantitative
outcome, but your response variable needs to be binary (categorical with 2 categories). If you have a quantitative
response variable, you will have to bin it into 2 categories. Alternatively, you can choose a different binary response
variable from your data set that you can use to test a logistic regression model. If you have a categorical response
variable with more than two categories, you will need to collapse it into two categories.
Document written in LATEX
template_version_01.tex
2
3. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
2 Python Code
For the last assingment of this module, I will use the AddHealth dataset to test a logistic regression model with mutliple
explanatory variables and a categorical, binary response variable.
First of all, I import all required libraries and use pandas to read in the data set. Then, I set all the variables to numeric
and recode them to binary (1 = yes and 0 = no).
To reduce the loading time, I create a new dataset called mydata, only with the variables that I’m going to work with.
# import libraries
import pandas
import numpy
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
# reading in the data set we want to work with
data = pandas.read_csv(working_folder+"M3A4data_addhealth_pds.csv",low_memory=False)
# setting variables to numeric
data['H1GH23J'] = pandas.to_numeric(data['H1GH23J'], errors='coerce')
data['H1DA8'] = pandas.to_numeric(data['H1DA8'], errors='coerce')
data['H1DA5'] = pandas.to_numeric(data['H1DA5'], errors='coerce')
data['H1GH52'] = pandas.to_numeric(data['H1GH52'], errors='coerce')
data['H1ED16'] = pandas.to_numeric(data['H1ED16'], errors='coerce')
# recode variable observations to 0=no, 1=yes
def NOBREAKFAST(x):
if x['H1GH23J'] == 1:
return 1
else:
return 0
data['NOBREAKFAST'] = data.apply(lambda x: NOBREAKFAST(x), axis = 1)
def WATCHTV(x):
if x['H1DA8'] >= 1:
return 1
else:
return 0
data['WATCHTV'] = data.apply(lambda x: WATCHTV(x), axis = 1)
def PLAYSPORT(x):
if x['H1DA5'] >= 1:
return 1
else:
return 0
data['PLAYSPORT'] = data.apply(lambda x: PLAYSPORT(x), axis = 1)
def ENOUGHSLEEP(x):
if x['H1GH52'] == 1:
return 1
else:
return 0
data['ENOUGHSLEEP'] = data.apply(lambda x: ENOUGHSLEEP(x), axis = 1)
def TROUBLEPAYATT(x):
if x['H1ED16'] >= 1:
return 1
else:
return 0
data['TROUBLEPAYATT'] = data.apply(lambda x: TROUBLEPAYATT(x), axis = 1)
# create a personalized dataset only with the chosen variables for this research
mydata = data[['NOBREAKFAST','WATCHTV','PLAYSPORT','ENOUGHSLEEP','TROUBLEPAYATT']].dropna()
Document written in LATEX
template_version_01.tex
3
4. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
Explanatory variables:
• NOBREAKFAST = Have nothing for breakfast (1 = yes and 0 = no)
• WATCHTV = Watch TV (1 = yes and 0 = no)
• PLAYSPORT = Play an active sport (1 = yes and 0 = no)
• ENOUGHSLEEP = Have enough sleep hours (1 = yes and 0 = no)
Response variable:
• TROUBLEPAYATT = Have trouble to pay attention at school (1 = yes and 0 = no)
Research question: Are those having nothing for breakfast more or less likely to have trouble paying attention at
school? To answer that question, I will use the logit function setting "TROUBLEPAYATT" as my response variable
and "NOBREAKFAST" as explanatory variable. In addition, I will use odds ratios to explain the probability of having
trobles to pay attention at school when having vs. no having breakfast.
# logistic regression model
lreg1 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST', data=mydata).fit()
print(lreg1.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg1.params
conf = lreg1.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6502
Method: MLE Df Model: 1
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.002436
Time: 17:34:11 Log-Likelihood: -3564.0
converged: True LL-Null: -3572.7
LLR p-value: 3.014e-05
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.1030 0.032 34.485 0.000 1.040 1.166
NOBREAKFAST 0.3169 0.078 4.088 0.000 0.165 0.469
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.829976 3.207982 3.013057
NOBREAKFAST 1.179349 1.598154 1.372874
The generated output indicates:
• Number of obervations: 6504
• P value of "NOBREAKFAST" is lower than the α-level of 0.05: the regression is significant.
• The coeficient of "NOBREAKFAST" is positive
• Interpretation of the odds ratio: students having nothing for breakfast are 1.37 times more likely to have trouble
paying attention at school than students having breakfast.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 1.18 and 1.60.
Document written in LATEX
template_version_01.tex
4
5. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
I will now add a second explanatory variable "ENOUGHSLEEP " to the model and study the results.
# logistic regression model
lreg2 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP', data=mydata).fit()
print(lreg2.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg2.params
conf = lreg2.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6501
Method: MLE Df Model: 2
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.01991
Time: 17:34:11 Log-Likelihood: -3501.6
converged: True LL-Null: -3572.7
LLR p-value: 1.272e-31
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.7497 0.072 24.392 0.000 1.609 1.890
NOBREAKFAST 0.2243 0.079 2.852 0.004 0.070 0.378
ENOUGHSLEEP -0.8107 0.077 -10.571 0.000 -0.961 -0.660
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 4.998263 6.621168 5.752768
NOBREAKFAST 1.072676 1.459978 1.251432
ENOUGHSLEEP 0.382490 0.516639 0.444533
The generated output indicates:
• P value of "ENOUGHSLEEP" is lower than α-level of 0.05: the regression is significant.
• The coeficient of "ENOUGHSLEEP" is negative
• Interpretation of the odds ratio: students having enough sleep hours are 0.44 times less likely to have trouble
paying attention at school than students not having enough sleep hours.
• Confidence interval: there’s a 95% certainty that the two populations odds ratio fall between 0.38 and 0.52.
Document written in LATEX
template_version_01.tex
5
6. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
To conclude, in order to find a possible confounding, I will add more explanatory variables to the model.
# logistic regression model
lreg3 = smf.logit(formula = 'TROUBLEPAYATT ~ NOBREAKFAST + ENOUGHSLEEP + WATCHTV + PLAYSPORT',
data=mydata).fit()
print(lreg3.summary())
# odds ratios with 95% confidence intervals
print("Odds Ratios")
params = lreg3.params
conf = lreg3.conf_int()
conf['OR'] = params
conf.columns = ['Lower CI','Upper CI','OR']
print(numpy.exp(conf))
Logit Regression Results
==============================================================================
Dep. Variable: TROUBLEPAYATT No. Observations: 6504
Model: Logit Df Residuals: 6499
Method: MLE Df Model: 4
Date: Tue, 13 Jun 2017 Pseudo R-squ.: 0.02063
Time: 17:34:11 Log-Likelihood: -3499.0
converged: True LL-Null: -3572.7
LLR p-value: 7.243e-31
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 1.3389 0.205 6.547 0.000 0.938 1.740
NOBREAKFAST 0.2338 0.079 2.963 0.003 0.079 0.388
ENOUGHSLEEP -0.8226 0.077 -10.688 0.000 -0.973 -0.672
WATCHTV 0.3679 0.195 1.882 0.060 -0.015 0.751
PLAYSPORT 0.0829 0.065 1.277 0.202 -0.044 0.210
===============================================================================
Odds Ratios
Lower CI Upper CI OR
Intercept 2.555077 5.695756 3.814852
NOBREAKFAST 1.082392 1.474696 1.263408
ENOUGHSLEEP 0.377780 0.510817 0.439291
WATCHTV 0.984947 2.119010 1.444684
PLAYSPORT 0.956615 1.233856 1.086428
The generated output indicates:
• The P values of "WATCHTV" and "PLAYSPORT" exceed the α-level of 0.05: the regression is non-significant.
They are confounding variables.
Document written in LATEX
template_version_01.tex
6
7. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
3 Codebook
Document written in LATEX
template_version_01.tex
7
8. Data Analysis And Interpretation Specialization
Test A Logistic Regression Model M3A4
Document written in LATEX
template_version_01.tex
8