SlideShare a Scribd company logo
1 of 294
Download to read offline
Basic Biostatistics and Data Management Using
Stata
Tadesse Awoke Ayele (PhD, Associate Professor)
University of Gonder
Collage of Medicine and Health Sciences
Institute of Public Health
November 22, 2019
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 1 / 294
Contact Detail
Instructor: Tadesse Awoke (PhD, Associate Professor)
Department: Epidemiology and Biostatistics
Qualiļ¬cation 1: Biostatistics
Qualiļ¬cation 2: Public Health
Email: tawoke7@gmail.com
Phone: +251910173308
Time: Nov 20-27, 2019
Oļ¬ƒce No: 10
Oļ¬ƒce hours: Available upon request
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 2 / 294
We are drowning in information and starving for
knowledge!
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 3 / 294
Course Description
This course is designed to give students an overview of descriptive
statistics, probability, sampling, inference, test of association,
comparisons of means, Correlation and regression, Generalized Linear
Models (GLM).
The sessions are designed to introduce the deļ¬nitions and basic
concepts of descriptive statistics, sampling, sample size
determination, statistical inference, t-test, ANOVA, Correlation,
Linear regression, logistic regression, Poisson regression, Negative
binomial regression, and Zero inļ¬‚ated poisson regression.
The overall emphasis will be placed on understanding the language of
statistics and the art of statistical investigation.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 4 / 294
Course Objectives
The aim of biostatistical study is to provide the numbers that contain
information about certain situations and present them in such a way
that valid interpretations are possible
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 5 / 294
Speciļ¬c objectives
Describe the process involved in data processing
Explain the diļ¬€erent types of statistical models
Identify statistical models which ļ¬ts the data well based on the
research question
Choose and perform appropriate statistical model for a given
medical/health data
Identify assumptions and limitations of selected statistical models.
Interprate model parameter estimates
Conclude based on the information obtained from the data
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 6 / 294
Course Outline
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 7 / 294
Chapter 1: General Introduction
Research and Biostatistics
Concept of Biostatistics
Variable, Data, Information, knowledge, and wisdom
Sample Datasets
Introduction to STATA
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 8 / 294
General and Generalized Linear Models (GLMs)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 9 / 294
Chapter 2: Analysis of Continuous Variable
Comparison of means
Student t-test
Analysis of Variance (ANOVA)
Analysis of Covariance (ANCOVA)
Correlation and Linear Regression
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 10 / 294
Chapter 3: Analysis of Categorical Data
Categorical Data Analysis
Logistic Regression
Conditional Logistic Regression
Ordinal Logistic Regression
Multinomial Logistic Regression
Multilevel Logistic Models
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 11 / 294
Chapter 4: Models for Count data
Poisson Regression
Negative Binomial Poisson Regression
Zero inļ¬‚ated Poisson Regression
Multilevel Poisson Models
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 12 / 294
Chapter 5: Survival Data analysis
Time to event data
Life Table
Kaplan-Meier Survival curve
Cox-proportional hazards regression
Proportional Hazard Assumption
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 13 / 294
References
Martin Bland. An introduction to Medical Statistics
Colton T. Statistics in Medicine
Daniel W. Biostatistics a foundation for analysis in the Health
Sciences
Kirkwood BR. Essentials of Medical Statistics
Knapp RG, Miller MC. Clinical epidemiology and Biostatistics.
Baltimore Williams and Wilkins, 1992
P. Armitage and G. Berry. Statistical Methods in Medical Research
Pagano and Gauvereau. Principles of Biostatistics
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 14 / 294
Teaching Methods and Evaluation
Teaching Methods
Lecture
Exercise
Assignment
Evaluation
Participation 5%
Assignment 15%
Project 20%
Final Exam: 60%
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 15 / 294
Tentative Schedule
Chapters Date
General Introduction Day 1
Models for Continuous data Day 1
Models for Categorical data Day 2
Models for count data Day 2
Survival Analysis Day 3
Non-parametric Methods Day 3
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 16 / 294
Chapter 1
Introduction
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 17 / 294
What is Research?
Research is deļ¬ned as the systematic
Collection
Organization
Analysis and
Interpretation
of data for the purpose of
Answering a question
Solving a problem or
Adding body of knowledge
The ultimate objective of research is to generate valid evidence for
Program planning
Policy Making
Intervention
Evaluation
Decision making
...
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 18 / 294
Industrial revolutions
The ļ¬rst revolution (1765):
Going from hand to machine production, the increasing use of steam
and water power
Second revolution (1870):
Expansion of electricity, petroleum and steel
The third Revolution (1920)
Digital technology: Clever software, robots, and a whole range of
web-based services
The fourth industrial revolution??
The row power for the fourth revolution is data
Data is the new oil!
Unlike oil, data is created by people, not geology
Can solve so many of the problem:
Health
Economy
Lifestyle, clime, . . . .
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 19 / 294
Quantitative Methods
Data are not just numbers, they are numbers with a context
In data analysis, context provides meaning
Context includes who collected the data, how were the data collected,
why the data were collected, when and where were the data collected
what exactly was collected
We can think of the who, how and why question referring to
pragmatics of data collection, where when, where and what refer to
data semantics
Pragmatics is the study of language from the point of view of usage
Semantics is concerned with the study of meaning and is related to
both philosophy and logic
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 20 / 294
Quantitative Methods
Is deļ¬ned as a systematic investigation of phenomena by gathering
quantiļ¬able data and performing statistical, mathematical or
computational techniques to explain a particular phenomenon
Include Epidemiology and Biostatistics
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 21 / 294
Illustration
Figure 1: Schematic presentation
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 22 / 294
Misuse of Statistics
The statistical conclusion is used with other knowledge to reach a
substantive conclusion
Statistics has several limitations
The statistical conclusion refers to groups and not individuals
It only summarizes but does not interpret data
Statistics can be misused by selective presentation of desired results
It is a tool that can be used well or can be misused
The human must be able to intelligently interpret the output from the
computer
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 23 / 294
Misuse of Statistics
When examining statistical information consider the following:
Was the sample used to gather the statistical data unbiased and of
suļ¬ƒcient size?
Is the methodology (design) used appropriate?
Is the analysis technique appropriate for the data?
Is the statistical statement ambiguous, could it be interpreted in more
than one way?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 24 / 294
Examples
1 Probability of surviving surgery
A patient undergoes examination and told to have surgery. He asked
the doctor to tell him the probability of surviving. Then the doctor told
him it is 100%. The patient asked how? The doctor said ā€of 10
patients who went through this operation, 9 diedā€. The 9th patient
died yesterday and you are the 10th patient.
2 Harvard university example
There were three female students at Harvard university. Of the three,
one married her professor. Thus, the information 33% of female
students married their professor at HU.
3 Soldiers height and crossing the river
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 25 / 294
Type I and Type II error
Figure 2: Types of error in hypothesis testing.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 26 / 294
Introduction(1)
21st century is the period in which information becomes;
Money
Power
Everything
Biostatistics provides the most fundamental tools and techniques of
the scientiļ¬c methods for generating information
Used for:
Forming hypotheses
Designing experiments and observational studies
Gathering data
Summarizing data
Drawing inferences from data( eg. testing hypotheses)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 27 / 294
Introduction(2)
The ļ¬eld of statistics can be divided into two:
1 Mathematical Statistics: study and develop statistical theory and
methods in the abstract
2 Applied Statistics: application of statistical methods to solve real
problems involving randomly generated data and the development of
new statistical methodology motivated by real problems
Later Applied Statistics can be also subdivided in to two branches
Descriptive statistics: describes what we have in hand (the sample)(sample
mean, sample variance, sample proportion, ...)
Inferential statistics: generalizes the ļ¬nding from the sample to the
population (estimation and hypothesis testing)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 28 / 294
Introduction(3)
Biostatistics is the branch of applied statistics directed toward
applications in the health sciences and biology
Why biostatistics?
Because some statistical methods are more heavily used in health
applications than elsewhere (eg. survival analysis and longitudinal
data analysis)
Because examples are drawn from health sciences:
Makes subject more appealing to those interested in health
Illustrates how to apply methodology to similar problems encountered
in real life
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 29 / 294
Variable, Data and Information
Variable: is a characteristic which takes diļ¬€erent value
Quantitative variables
E.g. Number of children in a family, ...
E.g. Weight, height, BP, VL, ...
Qualitative variables
E.g. Marital status, religion,
E.G. Education status, patient satisfaction
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 30 / 294
Types of Variable
Variables can be again classiļ¬ed in to two broad categories
Outcome variable
Can be also called response or dependent variable
It is the focus of the research
Aļ¬€ected by other (independent) variables
Predictor variables
Can be also called explanatory or independent variable
Aļ¬€ects the outcome variable
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 31 / 294
Example 1
In a study to determine whether surgery or chemotherapy results in
higher survival rates for a certain type of cancer, whether or not the
patient survived is one variable, and whether they received surgery or
chemotherapy is the other.
1 Identify the outcome variable
2 Identify the predictor variable
Solution
Outcome variable
Predictor variable
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 32 / 294
Example 2
Global Burden of Noncommunicable diseases and risk factors. They
are by far the leading cause of death in the Region, representing 62%
of all annual deaths.
NCD risk factors include:
Tobacco
Harmful use of Alcohol
Sedentary behaviour and physical inactivity
Obesity
Unhealthy diet.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 33 / 294
Schematic presentation
Figure 3: WHO NCD burden.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 34 / 294
Data: Is a measurement (observation) taken about the variable
The collection of data is often called dataset
Can be quantitative data or
Qualitative data
However, data is raw in which the required evidences can not be
easily obtained.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 35 / 294
Illustration
Data is raw, unorganized facts that need to be processed.
When data is processed, organized, structured or presented in a given
context so as to make it useful, it is called information.
Data and information
Figure 4: Relationship between data and information
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 36 / 294
Information: Data that is
speciļ¬c and organized for a purpose
presented within a context that gives it meaning and relevance and
can lead to an increase in understanding and decrease in uncertainty
Biostatistics/Statistics is the tool which converts data to information.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 37 / 294
Example Datasets
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 38 / 294
Example 1: Jimma Child Mortality Data
The data were collected to established risk factors aļ¬€ecting infant survival.
Children born in Jimma, Kaļ¬€a, and Illubabor were examined for their ļ¬rst
year growth characteristics. There were 8050 infants enrolled in the study.
The variables included were:
ID TotPrg TotBrth Abortion StillB Gravida Event Durtaion parity MamAge
1 5 5 2 0 0 0 360 0 30
2 3 3 2 0 0 0 62 0 23
3 8 8 2 0 0 0 360 0 32
4 5 5 2 0 0 0 356 0 30
5 4 2 1 1 0 0 362 1 23
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
8050 0 4 4 2 0 0 361 2 35
Objectives: To determine survival probability of new born
Statistical approach............?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 39 / 294
Example 2: Treatment Eļ¬€ect data
The data was collected on ulcer patients. The patients were randomized in
to two arms (placebo and treatment)
The variables in the data are age, duration, treatment, time and
result.
Data
ID Age Duration Treatment Time Result
1 48 2 1 7 1
2 73 1 1 12 0
3 54 1 1 12 0
4 58 2 1 12 0
5 56 1 0 12 0
6 49 2 0 12 0
. . . . . .
. . . . . .
42 61 1 0 12 0
43 33 2 1 12 0
Objectives of the study: to determine the eļ¬€ect of treatment on ulcer
Statistical Approach.............
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 40 / 294
Example 3: HIV/AIDS data
Follow-up data among HIV/AIDS in Gondar Hospital. The variables
time, age, sex, residence, WHO stage and CD4 cell count were
included.
Data
ID Months Sex Age Residence Stage FCD4
1 1 1 20 1 2 100
1 2 1 20 1 2 64
1 3 1 20 1 2 611
1 4 1 20 1 2 744
2 1 0 39 2 3 166
2 2 0 39 2 3 363
2 3 0 39 2 3 263
2 4 0 39 2 3 164
3 1 0 48 1 2 361
. . . . . . .
. . . . . . .
2550 1 1 30 0 4 441
Objectives: To assess the change in CD4 count over time.
Statistical approach.........
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 41 / 294
Infant growth data
Children were followed for 12 months and measurement were taken
every two months. Part of the data is given below;
Obs child age(months) weight(grams) CatBMI
1 1 0 2900 1
2 1 2 3100 0
3 1 4 3180 1
. . . . .
7 1 12 8000 1
8 2 0 3200 0
9 2 2 3340 0
. . . . .
Objectives: To assess the change in weight over time.
Statistical approach.........
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 42 / 294
Gilgel-Gibe Mosquito Data
A study conducted around Gilgel-Gibe dam for three years. Inļ¬‚uence
of the dam on mosquito abundance and species composition. Eight
ā€˜At riskā€™ and eight ā€˜Controlā€™ villages based on distance.
+--------------------------------------+
| ID Village Time Gamb Season |
|--------------------------------------|
1. | 1 at risk 0 0 wet |
2. | 1 at risk 1 0 wet |
3. | 1 at risk 2 9 wet |
4. | 1 at risk 3 30 wet |
|--------------------------------------|
6. | 1 at risk 5 0 dry |
. . . . . .
Objectives: To compare the incidence of mosquito and characterize
the type of species.
Statistical approach.........
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 43 / 294
In all the above examples the variables are presented in the ļ¬rst row
The body of the table represents the data
Every data is collected for a purpose (research question)
Need to be analysis to generate information
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 44 / 294
Introduction to Stata
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 45 / 294
Introduction
It is a multi-purpose statistical package to help you explore,
summarize and analyze datasets
A dataset is a collection of several pieces of information called
variables (usually arranged by columns)
A variable can have one or several values (information for one or
several cases)
Other statistical packages are SPSS, SAS and R
Stata is widely used in social science research and the most used
statistical software on campus.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 46 / 294
Data Management Softwares
There are diļ¬€erent possible data management softwares
The most common are Stata, SPSS, SAS and R
They all have diļ¬€erent features
For this course, we will focus on Stata
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 47 / 294
Introduction to Stata
Stata is a complete, integrated statistical package that provides
everything you need for data analysis, data management, and
graphics
It is fast and easy to use
Current version is Stata 14??
Standard stata can handle up to 2047 variables
SE can handle 32766 variables
Number of observations is limited by your computer (up to 2 billion!)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 48 / 294
Stata has 4 windows:
Command: where commands are entered. All commands and variables
are case sensitive.
Results: where results appear
Review: where past commands are listed
Clicking a past command in Review window brings it to the command
window where it can be modiļ¬ed and re-executed Variable: where
variables in current dataset are listed
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 49 / 294
Stata Windows
The four windows of Stata are given here
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 50 / 294
Other windows
Graphics
Data editor
Data viewer
ā€˜doā€™
Log
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 51 / 294
Important details
It is case sensitive!
First set the memory of Stata: set memory 100m
You can only have one dataset active at a time
abbreviations work for commands and variable names!
Help is very important part
Two interactive options:
help ā€˜commandā€™
help ā€˜searchā€™
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 52 / 294
There are lots of things you can do without data in stata!
display 4.1-1.96*0.3 *lower bound
display 4.1+1.96*0.3 *upper bound
tabi 100 34  17 294 *2X2 contegency table
tabi 100 34  17 294, col *percent for Column
tabi 100 34  17 294, col row cell chi
cci 100 34 17 294 *confidence interval
cci 100 34 17 294, exact
sampsi 0.2 0.5 *sample size for two population
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 53 / 294
Inputting Data
Many Options:
Manually enter data into the Stata Data Editor
Copy data into the Data Editor from another source (ex.: Excel)
Importing an ASCII (text) ļ¬le.
Reading in an Excel spreadsheet (tab- or comma- delimited text ļ¬le).
Open existing Stata Data ļ¬le with ļ¬le extension .dta
Use a conversion package (StatTransfer) to read in data from another
package (eg, SPSS, SAS data, . . . ).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 54 / 294
Manual Data entry
First open the data editor window to enter data manually
Second deļ¬ne variable names:
Note that variables are automatically named var1, var2, . . .
Double-click on top of column to view/edit ā€œVariable Propertiesā€ and
change the name
Via command: rename oldvarname newvarname
Eg. rename var1 id
Third write the observation for each variables.
To move to the second observation, click ā€enterā€
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 55 / 294
Data entry
Variables can be deļ¬ned in this window
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 56 / 294
Importing Data
Import: Via drop- menu: File ā†’ Import ā†’ ASCII data created by a
spreadsheet.
Via drop-menu: File ā†’ Open ā†’Scroll to ļ¬nd data
The clear command is a default command that clears the memory
before loading the requested dataļ¬le.
This is necessary because Stata can have only one dataset in memory
at a time!
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 57 / 294
Interactive command line
Do ļ¬le
The commands used to produce the output can be saved in the ā€do
ļ¬leā€
It can be also executed by highlighting each command line
Any time we want to run the command, we can open the do ļ¬le
Log ļ¬le
Log ļ¬le is used to save the outputs produced
This can bone by clicking begin under Log menu
File ā†’ Log ā†’ Begin
If you want to stop saving the output, then suspend using
File ā†’ Log ā†’ Suspend (or End)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 58 / 294
Data Management and Analysis
Stata works in two ways;
1 Menu driven
2 Command driven
The later was used in this material
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 59 / 294
Data Management
Once the data were entered or imported, data management will be
the next step
Data should be cleaned, edited, and recoded before analysis started
cleaning can be done through visual inspection, running frequencies
and cross-tabulation
When error is detected, can be corrected by checking the hard copy
and the data collector if necessary
Some variables may need recoding in order to make clear
interpretation and when there is assumption violation
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 60 / 294
Data Management
Basic commands for generating new variables or recoding existing
variables
gen: to generate new variable from existing variable
replace: to replace the remaining categories of the new variable
recode: to recode continuous variable to categorical
egen: to create a new variable that counts the number of ā€yesā€
responses
merge: variables from a second dataset to the dataset youā€™re currently
working with
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 61 / 294
Data Management
list, describe, keep, drop, rename and so on.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 62 / 294
Data Management
Data management using menu driven
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 63 / 294
Data Analysis
There are two basic analysis procedures
Descriptive versus Analytic
In stata, these can be done either
Menu driven versus command driven
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 64 / 294
Part II
Statistical Models for Continuous variable
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 65 / 294
Models
Statistics includes the process of ļ¬nding out about patterns in the
real world using data
When solving statistical problems it is often helpful to make models
of real world situations based on observations of data, assumptions
about the context, and on theoretical probability
The model can then be used to make predictions, test assumptions,
and solve problems
These models can be either deterministic or Stochastic
A deterministic model does not include elements of randomness
A probabilistic model includes elements of randomness.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 66 / 294
Components of the model
There are three components in GLM
g(y) = Ī²0 + Ī²1X1 + Ā· Ā· Ā· + Ī²kXk
Random component
y āˆ¼ dist()
Systematic Component
Ī²0 + Ī²1X1 + Ā· Ā· Ā· + Ī²k Xk
Link Function
The function which links the random component with the systematic
component
g(y)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 67 / 294
Deterministic Versus Stochastic Models
Deterministic
Model parameters are exact
E = mc2
The output of the model is fully determined by the parameter values
and the initial conditions
Stochastic
Models possess some inherent randomness
g(y) = XĪ² + ij
The same set of parameter values and initial conditions will lead to an
ensemble of diļ¬€erent outputs
Output has variation
The natural world is buļ¬€eted by stochasticity stochasticity.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 68 / 294
Task 1
Review literatures in your ļ¬eld of study
Locate studies and read the method part
List the dependent variable
List the independent variables
What was the statistical methods used?
Is the model appropriate for the outcome variable?
Submission date:
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 69 / 294
Chapter 2
Analysis of Continuous Variable
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 70 / 294
Introduction
A continuous variable is one which can take on inļ¬nity many,
uncountable possible values in the range of real numbers.
The probability distribution of of continuous variables can be express
in terms of probability density function.
Figure 5: Probability Density Function.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 71 / 294
Location and Scale Parameters
A probability distribution is characterized by location and scale
parameters
The scale parameter measures the skewness
Negatively skewed
Symmetry
Positively skwed
The location parameter measures the kurtosis
Mesokurtic: normal distribution
Leptokurtic: distribution shows heavy tails on either side
Platykurtic: distribution with ļ¬‚at tails
Location and scale parameters are typically used in modeling
applications
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 72 / 294
Modeling Continuous Variable
The model is depend on the research question
If we are interested to compare the means of two population
t-test is appropriate to compare two means from two population
There are three diļ¬€erent t-tests
One sample t-test
Two independent sample t-test
Paired sample t-test
If We are interested to compare means of more than two population
Analysis of Variance
Diļ¬€erent extension of ANOVA
One way ANOVA
More way ANOVA
ANCOVA, MANOVA, MANCOVA
Repeated measure ANOVA
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 73 / 294
Modeling Continuous Variable
If we want to determine the relationship between two continuous
variables
Correlation
If we want to predict the eļ¬€ect of one variable on the dependent
variable
Simple Linear Regression
If we want to predict the eļ¬€ect of multiple variable on the dependent
variable
Multiple Linear Regression
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 74 / 294
Correlation and Linear Regressions
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 75 / 294
Correlation
As OR and RR are used to quantify risk between two dichotomous
variables, correlation is used to quantify the degree to which two
random continuous variables are related, provided the relationship is
linear.
If we are interested in looking at relationship between two continuous
random variables, before we conduct any type of analysis we should
always create a two way scatter plot for our visual understanding of
the relationship
We now consider ways of describing relationship between numeric
variables from a single population.
consider the relationship between weight and height of children from
jimma data.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 76 / 294
The correlation coeļ¬ƒcient
Before we conduct any statistical analysis, we should always create a
scatter plot of the data.
These data could be presented in a scatter plot as shown in Figure 1
below.
One can see from the scatter plot that the weight tends to increase as
the length increases.
The correlation coeļ¬ƒcient may be used to measure the degree of the
association between the two variables.
Alternatively if we are interested in predicting weight from height the
of children we could ļ¬t a linear regression model (a straight line) to
the data.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 77 / 294
Scatter plot showing the above table
100020003000400050006000
30 40 50 60 70
length
weight Fitted values
Another way of looking at correlation is that it is a measure of the
scatter of the points around the underlying linear trend (the greater
the spread of points the lower the correlation
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 78 / 294
STATA code for scatter plot and lines:
twoway (scatter weight length) (lfit weight length)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 79 / 294
Correlation coeļ¬ƒcient
In the underlying population from which the sample of points (xi , yi )
is selected, the correlation coeļ¬ƒcient between X and Y is denoted by
Ļ and is given by
Ļ =
(X āˆ’ ĀµX )(Y āˆ’ ĀµY )
ĻƒX ĻƒY
It can be thought as the average of the product of the standard
normal deviates of X and Y
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 80 / 294
Estimation of Correlation coeļ¬ƒcient
The estimator of the population correlation coeļ¬ƒcient is given by
r =
(xi āˆ’ ĀÆx)(yi āˆ’ ĀÆy)
(xi āˆ’ ĀÆx)2 (yi āˆ’ ĀÆy)2
r is called the product moment correlation coeļ¬ƒcient or Pearson
correlation coeļ¬ƒcient āˆ’1 < r < 1
r close to 1 implies strong direct relationship.
r close to -1 implies strong inverse.
r = 0 when there is no linear relationship at all Pearson correlation
coeļ¬ƒcient is a measure of the degree of straight line relationship
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 81 / 294
Interpretation
The correlation was found to be 0.5648.
. pwcorr weight length
| weight length
-------------+------------------
weight | 1.0000
length | 0.5648 1.0000
It means the correlation between weight and height is positively
related
The p-value is less than 0.05 which indicates that the correlation is
statistically signiļ¬cant.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 82 / 294
limitations of correlation coeļ¬ƒcient
It quantiļ¬es only the strength of linear relationship between two
variables.
If the two variables have non-linear relationship, it will not provide a
valid measure of this association.
Care must be taken when the data contain outliers, correlation
coeļ¬ƒcient is highly sensitive to extreme values
The estimated correlation coeļ¬ƒcient should never be extrapolated
beyond the observed ranges of the variables; the relationship between
two variables may change outside of this region
It must be kept in mind that a high correlation coeļ¬ƒcient between
two variables does not in itself imply a cause and eļ¬€ect relationship.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 83 / 294
Introduction to Linear Regression
We frequently measure two or more variables on the same individual
(case, object, etc)
We do this to explore the nature of the relationship among these
variables. There are two basic types of relationships.
Relationship
Cause and eļ¬€ect
Functional
Function: a mathematical relationship enabling us to predict what
values of one variable (Y ) correspond to given values of another
variable (X)
Y : is referred to as the dependent variable, the response variable or
the predicted variable
X : is referred to as the independent variable, the explanatory variable
or the predictor variable.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 84 / 294
Linear Regression
Questions to be answered
What is the association between Y and X?
How can changes in Y be explained by changes in X?
What are the functional relationships between Y and X?
A functional relationship is symbolically written as:
Y = f (X)
Example: Y=cholesterol versus X=age
Two kinds of explanatory variables:
Those we can control
Those over which we have little or no control
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 85 / 294
Hypertension Data
We want to study the eļ¬€ect of age (X1) on blood pressure (Y ).
For this purpose, consider linear regression.
Model: Y = Ī± + Ī²X1 + Īµ.
Ī± and Ī² are unknown constants to be estimated.
Īµ stands for measurement error.
Ī± stands for the value of Y when X1 is 0.
Ī² is the expected change in mean of Y for a unit change X1.
Īµ is assumed normally distributed with mean 0 and constant variance.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 86 / 294
Commonly, least squares (LS) or maximum likelihood (ML)
techniques are used for estimation.
Ī± and Ī² are estimated by Ė†Ī± and Ė†Ī², respectively.
Estimated model: Ė†Y = Ė†Ī± + Ė†Ī²X1.
Least squares estimators of Ė†Ī± and Ė†Ī²:
Ė†Ī² =
SSxy
SSxx
,
Ė†Ī± = ĀÆy āˆ’ Ė†Ī²ĀÆx,
As before, SSxx =
n
1=1 (xi āˆ’ ĀÆx)
2
, and
SSxy =
n
1=1(x āˆ’ ĀÆx)(yi āˆ’ ĀÆy).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 87 / 294
Hypertension data: Blood pressure (Y ) and age (X1):
105110115120125
45 50 55
age in years
mean arterial blood pressure Fitted values
Figure 6: Mean arterial blood pressure versus age.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 88 / 294
Source | SS df MS Number of obs = 20
------------------------------------- F( 1, 18) = 13.82
Model | 243.265993 1 243.265993 Prob > F = 0.0016
Residual | 316.734007 18 17.5963337 R-squared = 0.4344
------------------------------------- Adj R-squared = 0.4030
Total | 560 19 29.4736842 Root MSE = 4.1948
-----------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf.Interval]
----------------------------------------------------------------
x1 | 1.430976 .3848601 3.72 0.002 .6224154 2.239537
_cons | 44.45455 18.7277 2.37 0.029 5.109098 83.79999
----------------------------------------------------------------
ANOVA table: It tells the story of how the regression equation accounts for variablity in
the response variable.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 89 / 294
STATA CODE:
regress y x1
Estimate of intercept is: Ė†Ī± = 44.45455.
Estimate of slope (coeļ¬ƒcient of age) is :Ė†Ī² = 1.43098.
STATA code for scatter plot and lines:
twoway (scatter y x1) (lfit y x1)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 90 / 294
Conclusion
The ANOVA (F-test) part shows that regression is signiļ¬cant
(p = 0.0016).
Adj-R-Sq implies that nearly 40% of the variation in Y is explained by
the regression.
Ė†Ī² is signiļ¬cant (p = 0.0016) and implies that for a unit (one year)
increase in age, the expected bp increases, on average, by 1.43
mmHg.
The studentized residuals can be used for identifying outliers.
We use the ā€˜predictā€™ command with the ā€˜rstudentā€™ option to generate
studentized residuals.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 91 / 294
Test of Normality
We can use the ā€˜predictā€™ command to create residuals.
ā€˜kdensityā€™, ā€˜qnormā€™ and ā€˜pnormā€™ to check the normality of the
residuals.
kdensity stands for kernel density estimate.
It can be thought of as a histogram with narrow bins and moving
average.
0.1.2.3.4
Density
āˆ’2 āˆ’1 0 1 2 3
Studentized residuals
Kernel density estimate
Normal density
kernel = epanechnikov, bandwidth = 0.5256
Kernel density estimate
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 92 / 294
STATA CODE:
predict r, resid
kdensity r, normal
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 93 / 294
The ā€˜pnormā€™ command graphs a standardized normal probability
(P-P) plot against the quantiles of a normal distribution.
ā€˜qnormā€™ plots the quantiles of a variable against the quantiles of a
normal distribution.
ā€˜pnormā€™ is sensitive to non-normality in the middle range of data.
ā€˜qnormā€™ is sensitive to non-normality near the tails.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 94 / 294
āˆ’2āˆ’1012
Studentizedresiduals
āˆ’2 āˆ’1 0 1 2
Inverse Normal
Figure 8: Quantiles of residuals.
STATA CODE:
pnorm r
qnorm r
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 95 / 294
Homoscedasticity of Residuals
Assumption...homogeneity of variance of the residuals.
If the model is well-ļ¬tted, there should be no pattern to the residuals
plotted against the ļ¬tted values.
If the variance of the residuals is non-constant. it is heteroscedastic.
āˆ’10āˆ’50510
Residuals
110 115 120 125
Fitted values
Figure 9: Residual versus Fitted values.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 96 / 294
STATA CODE:
rvfplot, yline(0)
Cameron and Trivediā€™s decomposition of IM-test.
Or hettest is the Breusch-Pagan test.
p-value < 0.05, reject the hypothesis that states that variance is
homogenous.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 97 / 294
Cameron & Trivediā€™s decomposition of IM-test
---------------------------------------------------
Source | chi2 df p
---------------------+-----------------------------
Heteroskedasticity |0.33 2 0.8458
Skewness | 1.22 1 0.2697
Kurtosis | 1.14 1 0.2848
---------------------+-----------------------------
Total | 2.70 4 0.6097
---------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 98 / 294
Breusch-Pagan / Cook-Weisberg test for
heteroskedasticity
Ho: Constant variance
Variables: fitted values of y
chi2(1) = 0.01
Prob > chi2 = 0.9324
STATA CODE:
estat imtest
estat hettest
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 99 / 294
One categorical predictor
Relationship between a continuous outcome and a single categorical
variable.
Dummy variables for the categories needed.
Objective: estimation/prediction the eļ¬€ect of the predictor on the
response.
We want to study whether birth weight (Y ) varies with sex (X1;
female = 0, male = 1).
For this purpose, consider liner regression.
Model: Y = Ī± + Ī²X1 + Īµ
Ī± and Ī² are unknown constants to be estimated.
Īµ stands for measurement error.
Ī² is the expected diļ¬€erence in mean of Y among X1 categories.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 100 / 294
Source | SS df MS Number of obs = 7873
---------------------------------------- F( 1, 7871) = 97.91
Model | 25750334.5 1 25750334.5 Prob > F = 0.0000
Residual | 2.0700e+09 7871 262990.174 R-squared = 0.0123
-------------+------------------------- Adj R-squared = 0.0122
Total | 2.0957e+09 7872 266227.896 Root MSE = 512.83
------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf.Interval]
------------------------------------------------------------------
1.sex | 114.4014 11.56138 9.90 0.000 91.738 137.0647
_cons | 3055.323 8.253152 370.20 0.000 3039.144 3071.501
------------------------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 101 / 294
STATA CODE:
regress weight i.sex
The ā€˜i.sexā€™ is used in the code because sex is an indicator variable.
The ā€˜1.sexā€™ stands for males, and hence female is a reference group.
Estimate of intercept is: Ė†Ī± = 3055.32.
Estimate of slope (coeļ¬ƒcient of age) is :Ė†Ī² = 114.40.
Interpretation?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 102 / 294
Multiple linear regression
Extension of simple linear regression with more than one predictors.
The predictors could be either a continuous or a categorical variable.
Objective: estimation/prdiction the eļ¬€ect of the predictors on the
response.
Hypertension Data
We want to study the eļ¬€ect of (X1), (X2), (X3), (X4), (X5), (X6) on
blood pressure (Y ).
Predictors: X1 = age (years); X2 = weight (kg); X3 =body surface area
(sq m); X4 = duration of hypertension; X5 = basal pulse (beats/min);
X6 = measure of stress.
For this purpose, consider Multiple liner regression.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 103 / 294
mean
arterial
blood
pressure
age in
years
weight
in kg
body
surface
area
duration
of
hypertentsion
basal
pulse
pulse per
minute
measure
of
stress
100
110
120
130
100 110 120 130
45
50
55
45 50 55
85
90
95
100
85 90 95 100
1.8
2
2.2
1.8 2 2.2
0
5
10
0 5 10
60
65
70
75
60 65 70 75
0
50
100
0 50 100
Figure 10: Scatter Plot Matrix
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 104 / 294
STATA CODE:
graph matrix y x1 x2 x3 x4 x5 x6
What do you observe from the scatter plot matrix?
Which variables seem to be correlated which other variables?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 105 / 294
Coeļ¬ƒcient of determination (R2):
R2
=
(Ė†yi āˆ’ ĀÆy)2
(yi āˆ’ ĀÆy)2
.
The adjusted R2 statistic is an estimate of the population R2 taking
account of the fact that the parameters were estimated from the data,
Adj-R2
= 1 āˆ’
(n āˆ’ 1)(1 āˆ’ R2)
n āˆ’ p
.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 106 / 294
Y = Ī± + Ī²1X1 + Ī²2X2 + Ī²3X3 + Ī²4X4 + Ī²5X5 + Ī²6X6 + Īµ
Response: Y =mean arterial blood pressure (mm HG);
Predictors: X1 = age (years); X2 = weight (kg); X3 = body surface
area (sq m); X4 = duration of hypertension; X5 = basal pulse
(beats/min); X6 = measure of stress.
Source | SS df MS Number of obs = 20
------------------------------------- F( 6, 13) = 560.64
Model | 557.844135 6 92.9740225 Prob > F = 0.0000
Residual | 2.1558651 13 .165835777 R-squared = 0.9962
------------------------------------- Adj R-squared = 0.9944
Total | 560 19 29.4736842 Root MSE = .40723
Much improvement as compared to the model with only age: say, based on Adj R-sq.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 107 / 294
---------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf.Interval]
-------------+-------------------------------------------------
x1 | .7032596 .0496059 14.18 0.000 .5960926 .8104266
x2 | .9699192 .0631086 15.37 0.000 .8335815 1.106257
x3 | 3.776502 1.580154 2.39 0.033 .3627878 7.190217
x4 | .0683829 .0484416 1.41 0.182 -.0362687 .1730346
x5 | -.0844846 .0516091 -1.64 0.126 -.1959792 .0270101
x6 | .0055715 .0034123 1.63 0.126 -.0018003 .0129433
_cons | -12.87047 2.556654 -5.03 0.000 -18.39378 -7.34715
---------------------------------------------------------------
STATA CODE:
regress y x1 x2 x3 x4 x5 x6
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 108 / 294
Variable Selection
Forward selection,
Backward elimination,
Stepwise regression.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 109 / 294
Multicollinearity
When there is a perfect linear relationship among the predictors, the
estimates cannot be uniquely computed.
The term collinearity implies that two variables are near perfect linear
combinations of one another.
The regression model estimates of the coeļ¬ƒcients become unstable.
The standard errors for the coeļ¬ƒcients can get wildly inļ¬‚ated.
We can use the vif command after the regression to check for
multicollinearity.
As a rule of thumb, a variable whose values are greater than 10 may
need further investigation.
Tolerance, deļ¬ned as 1/VIF, is used by many researchers to check on
the degree of collinearity.
The standard errors for the coeļ¬ƒcients can get wildly inļ¬‚ated.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 110 / 294
VIF(xk) =
1
1 āˆ’ R2
k
.
VIF(xk) is the variance inļ¬‚ation factor for explanatory variable xk.
R2
k is the square of the multiple correlation coeļ¬ƒcient obtained from
regressing xk on the remaining explanatory variables.
Variable |VIF 1/VIF
-----------+----------------------
x2 | 8.42 0.118807
x3 | 5.33 0.187661
x5 | 4.41 0.226574
x6 | 1.83 0.545005
x1 | 1.76 0.567277
x4 | 1.24 0.808205
-----------+----------------------
Mean VIF | 3.83
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 111 / 294
STATA CODE:
regress y x1 x2 x3 x4 x5 x6
vif
There are diļ¬€erent suggestions for the cut-oļ¬€ points
Some says 5 and some say even 10. So, if we consider 10,
multicollinearity does not seem to be a major problem...
If we take 5, then there are two variables which needs decision
One solution to dealing with multicollinearity is to remove some of
the violating predictors from the model.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 112 / 294
Jimma Infant Data
We want to study the eļ¬€ect of (X1), (X2), (X3), (X4) on birth weight
(Y ).
Predictors: (X1 = sex; female = 0, male = 1); X2 = place
(semiāˆ’urban); X3 = place (rural); X4 = Age of mother.
Originally, place was coded as: (urban = 1, semi-urban = 2, and rural
= 3). By default, stata considers the ļ¬rst (urban = 1) as a reference.
For this purpose, consider Multiple liner regression.
Y = Ī± + Ī²1X1 + Ī²2X2 + Ī²3X3 + Ī²4X4 + Īµ
Response: Y = birth weight;
Predictors: X1 = sex; female = 0, male = 1); X2 = place
(semiāˆ’urban); X3 = place (rural); X4 = age of mother (years).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 113 / 294
Source | SS df MS Number of obs = 7873
---------------------------------------- F( 4, 7868) = 169.34
Model | 166126585 4 41531646.3 Prob > F = 0.0000
Residual | 1.9296e+09 7868 245249.036 R-squared = 0.0793
---------------------------------------- Adj R-squared = 0.0788
Total | 2.0957e+09 7872 266227.896 Root MSE = 495.23
-------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf.Interval]
-------------------------------------------------------------------
1.sex | 109.2883 11.16942 9.78 0.000 87.39329 131.1833
|
place |
2 | -44.07271 16.25057 -2.71 0.007 -75.92814 -12.21729
3 | -279.7563 13.86682 -20.17 0.000 -306.939 -252.5737
|
momage | 7.809375 .8910886 8.76 0.000 6.062605 9.556145
_cons | 3010.102 26.26505 114.60 0.000 2958.615 3061.588
-------------------------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 114 / 294
Ė†Ī± = 3010.102, Ė†Ī²1 = 109.2883, Ė†Ī²2 = āˆ’44.07271, Ė†Ī²3 = āˆ’279.7563,
Ė†Ī²4 = 7.809375.
Interpretation?
STATA CODE:
.regress weight i.sex i.place momage
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
1.sex | 1.00 0.999139
place |
2 | 1.53 0.655694
3 | 1.54 0.650037
momage | 1.01 0.988233
-------------+----------------------
Mean VIF | 1.27
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 115 / 294
Multivariate methods
Multivariate analysis is the statistical study of the dependence
(covariance) between diļ¬€erent variables
Multivariate Analysis includes many statistical methods that are
designed to allow you to include multiple variables and examine the
contribution of each
There are diļ¬€erent methods
Principal component analysis
Factor analysis
Cluster analysis
Discriminant Analysis
Structural Equation Modeling
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 116 / 294
Principle component analysis
PCA is a dimension-reduction tool that can be used to reduce a large
set of variables to a small set that still contains most of the
information in the large set
PCA is a mathematical procedure that transforms a number of
correlated variables into a (smaller) number of uncorrelated variables
called principal components
Eigenvalue conceptually represents that amount of variance accounted
for by a factor
When we have correlation (multicollinarity) between the x-variables,
the data may more or less fall on a line or plane in a lower number of
dimensions
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 117 / 294
Objectives of PCA
Its general purpose is to summarize the information contained in a
large number of variables into a smaller number of factors
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 118 / 294
Reading Assignment
Form ļ¬ve groups
Assign numbers from 1 to 5
Randomly let the group pick one number
Then give the topic corresponding to the number
1 Principal component analysis
2 Factor analysis
3 Cluster analysis
4 Discriminant Analysis
5 Structural Equation Modeling
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 119 / 294
Assignment 2
Consider Infant growth data. The outcome variable is birth height of
new born and others are independent variable.
Determine the correlation between mothers age and the height of the
new born
Test if the correlation is signiļ¬cant
Fit linear regression for birth weight with sex, residence, family income
and mothers age.
Test the assumptions
Interpret the result
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 120 / 294
Chapter 3
Categorical Data Analysis
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 121 / 294
Categorical Data
Variables that are measured using Nominal scale and ordinal scale
The variable may have only two levels called a dichotomous (E.g. Sex)
The variable may have more than two levels called a polygamous (E.g.
Blood group)
Continuous (numeric) variables can be condensed to categorical
variables if recoded
BMIunderweight, normal. and overweight
Income (very poor, poor, rich, very rich)
Age (< 15years), 16-24 years, 25-34 years, and ā‰„ 35 years
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 122 / 294
Contingency Tables
When working with categorical variables, we often arrange the counts
in a tabular format called contingency tables
If a contingency table involves two dichotomous variables then it is a
2x2 (two way table)
Diseases Status
Exposure Positive Negative Total
Exposed a b a+b
Not Exposed c d c+d
Total a+c b+d n
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 123 / 294
Contingency Tables
It can be generalized to accommodate into rxc contingency table
(r-rows and c-columns)
Column Variable
Row Variable 1 2 . . . c Total
1 R1C1 R1C2 . . . R1Cn R1
2 R2C1 R2C2 . . . R2Cn R2
3 R3C1 R3C2 . . . R3Cn R3
. . . . . . . .
. . . . . . . .
R RnC1 RnC2 . . . RnCn Rn
Total C1 C2 . . . Cc N
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 124 / 294
Steps in Testing Association
The test statistic chi-square is used to test the association between
two categorical variables
Hypothesis
1 The hypothesis to be tested can be stated as:
H0: There is no association H1: There is association
2 Compute the calculated and tabulated value of the test statistic
3 Compare the two (tabulated and calculated values)
4 Decision
If the contingency table is 2X2, then
Ļ‡2
=
n(ad āˆ’ bc)2
(a + c)(b + d)(a + b)(c + d)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 125 / 294
If the contingency table is rxc
The formula for the calculated value of the test statistic is diļ¬€erent
when the contingency table is rxc
Ļ‡2
=
(Oij āˆ’ Eij )2
Eij
Where
Oij is observed value
Eij is expected value
Eij =
Ri Ɨ Cj
N
The Chi-squared test measures the disparity between observed
frequencies (data from the sample and expected frequencies
(probability distribution)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 126 / 294
The Chi-squared test is valid
If no observed cell have value 0
The values must be counts but not percent or proportion
And no 20% of expected cell has less than value of 5
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 127 / 294
Example
A cross-sectional survey was conducted to assess the association
between head injury and wearing helmet. A total of 795 individuals
were included in the study and the data is given below;
Wearing Helmet
Head injury Yes No Total
Yes 17 218 235
No 130 428 558
Total 147 646 793
Test the presence of association between head injury and wearing
helmet at 0.05 level of signiļ¬cant
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 128 / 294
Solution
hypothesis
HO : There is no association between helmet and head injury HA :
There is association between helmet and head injury
Test statistics
Ļ‡2
=
n(ad āˆ’ bc)2
(a + c)(b + d)(a + d)(c + d)
=
(17 Ɨ 428 āˆ’ 218 Ɨ 130)2793
235 Ɨ 559 Ɨ 147 Ɨ 646
= 28.26
The critical value of the test statistic
Ļ‡2
0.05 = 3.84
Since 28.26> 3.84, reject the null hypothesis
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 129 / 294
Example 2
A sample of 263 students who bought lunch at a school canteen were
asked whether or not they developed gastroenteritis. The response is
given below
Gastroenteritis
Ate Sandwich Yes No Total
Yes 109 116 225
No 4 34 38
Total 113 150 263
Test the presence of association between sandwich and gastroenteritis
at 0.05 level of signiļ¬cant
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 130 / 294
Solution
Step 1. HO : There is no association
Step 2. Ha: There is association
Step 3. Test statistics Ļ‡2 = 17.6
Step 4: Critical value of Ļ‡2
1(0.05)= 3.84
Step 5: decision: since 17.6>3.84 then reject the null hypothesis and
decide as there is association between eating sandwich and
gastrointestinal pain
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 131 / 294
Probability Distribution
Binomial Distribution
Under the assumption of n independent, identical trials, Y has the
binomial distribution with index n and parameter Ļ€
The probability of outcome y for Y equals
P(Y = y) =
n!
y!(n āˆ’ y)!
py
(1 āˆ’ p)(nāˆ’y)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 132 / 294
Maximum Likelihood Estimation
In statistics, maximum-likelihood estimation (MLE) is a method of
estimating the parameters of a statistical model given the data
MLE begins with writing a mathematical expression known as the
Likelihood Function of the sample data
The values of these parameters that maximize the sample likelihood
are known as the Maximum Likelihood Estimates
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 133 / 294
Suppose we have a random sample X1, X2,..., Xn where:
Xi = 0 if a randomly selected subject does not experience the event,
and
Xi = 1 if a randomly selected subject does experience the event
Assuming that the Xi are independent Bernoulli random variables
with unknown parameter p, ļ¬nd the maximum likelihood estimator of
p, the proportion of subjects who experience the event.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 134 / 294
Solution
If the Xi are independent Bernoulli random variables with unknown
parameter p, then the probability mass function of each Xi is:
f (xi ; p) = pxi
(1 āˆ’ p)1āˆ’xi
for xi = 0 or 1 and 0 < p < 1
Therefore, the likelihood function L(p) is, by deļ¬nition:
L(P) =
n
i=1
f (xi ; p) = px1
(1āˆ’p)1āˆ’x1
Ɨpxi
(1āˆ’p)1āˆ’xi
. . . pxn
(1āˆ’p)1āˆ’xn
Simplifying, by summing up the exponents, we get:
L(P) = p xi
(1 āˆ’ p)nāˆ’ xi
Taking partial derivative with respect to p after log transformation
p = x/n
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 135 / 294
Inference for proportion
The point estimate E(p)=Ļ€
Conļ¬dence interval for proportion
p āˆ’ zĪ±/2SE(P), p + zĪ±/2SE(P)
Where
SE(p) =
p(1 āˆ’ p)
n
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 136 / 294
Example: Logistic regression
Let pj be the probability that a subject in stratum j has a success
Then the probability of observing yj events in stratum j is
P(Yj = yj ) =
nj
yj
p
yj
j (1 āˆ’ pj )nj āˆ’yj
We know pj is a function of covariates
assume we are interested in two covariates, xj ,such that
pj =
eĪ±0+Ī²xj
1 + eĪ±0+Ī²xj
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 137 / 294
Example: Logistic regression
Then, the likelihood function can be written as
L(y) =
j+1
J
nj
yj
p
yj
j (1 āˆ’ pj )nj āˆ’yj
=
j+1
J
(
eĪ±0+Ī²xj
1 + eĪ±0+Ī²xj
)y
j (
1
1 + eĪ±0+Ī²xj
)(nj āˆ’yj )
Ɨ
nj
yj
=
j+1
J
(eĪ±0+Ī²xj )yj
(1 + eĪ±0+Ī²xj )nj
Ɨ
nj
yj
=
(eĪ±0+Ī²xj )
j+1
J (1 + eĪ±0+Ī²xj )nj
Ɨ
j+1
J
nj
yj
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 138 / 294
Contingency Tables 2
Partial Tables: a two-way cross-sectional slice of the three-way table
where X and Y are cross classiļ¬ed at separate levels of the control
variable Z
it shows the eļ¬€ect of X on Y while controlling for Z
Marginal table: a two-way contingency table obtained by combining
the partial tables
The marginal table ignores Z rather than controlling for it.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 139 / 294
Example
Prevalence study: A total of 715 births, cross classiļ¬ed by clinic (Z),
prenatal care (X) and outcome (Y )
Clinic Prenatal Care Died Lived
1 less 3 176
more 4 293
2 less 17 197
more 2 23
Construct the corresponding partial tables and the marginal table.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 140 / 294
Confounding
Confounding occurs when a factor is associated with both the
exposure and the outcome but does not lie on the causative pathway
Figure 11: Confounding variable
When the partial and marginal associations are diļ¬€erent, there is said
to be CONFOUNDING.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 141 / 294
Interaction Eļ¬€ect
An interaction eļ¬€ect happens when one explanatory variable interacts
with another explanatory variable on a response variable
This is opposed to the ā€œmain eļ¬€ectā€ which is the action of a single
independent variable on the dependent variable
Two independent variables interact if the eļ¬€ect of one of the variables
diļ¬€ers depending on the level of the other variable
It occur when the eļ¬€ect of one variable depends on the value of
another variable
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 142 / 294
Joint, Marginal, and Conditional Probabilities
Probabilities for contingency tables can be of three types:
joint,
marginal, or
conditional
Consider the variable X and Y from randomly chosen subjects
Let Ļ€ij = P(X = i, Y = j) denote the probability that (X, Y ) falls in
the cell in row i and column j
The probabilities Ļ€ij is the joint distribution of X and Y
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 143 / 294
Marginal Probabilities
The marginal distributions are the row and column totals of the joint
probabilities
Variable 1 Diseases Non Diseases Total
Exposed n11 n12 n1+
Not Exposed n21 n22 n2+
Total n+1 n+2 n++
Ļ€1+ = Ļ€11 + Ļ€12
It can be calculated as pij =
nij
n
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 144 / 294
Conditional Probabilities
It is informative to construct a separate probability distribution for Y
and X at each level of Z.
Consider The variables sex, smoking and lung cancer
Sex Smoking Diseases Non Diseases Total
Male Yes n11 n12 n1+
No n21 n22 n2+
Female Yes n31 n32 n3+
No n41 n42 n4+
Total n+1 n+2 n++
Simpsonā€™s Paradox: The result that a marginal association can have
diļ¬€erent direction from the conditional associations
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 145 / 294
Application for screening test
Sensitivity: the proportion of diseased individuals detected as positive
by the test.
Speciļ¬city: the proportion of healthy individuals detected as negative
by the test.
Positive predictivity: given that the test result is positive, what is the
probability that, in fact, the disease is present?
Negative predictivity: given that the test result is negative, what is
the probability that, in fact, the disease is not present?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 146 / 294
Example
Cytological test to screen women for cervical cancer
Diseases negative positive Total
Negative 23362 362 23724
Positive 225 154 379
Total 23587 516 24103
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 147 / 294
Example
Sensitivity=154/379=40.6%
Speciļ¬city=23362/23724=98.5%
Positive predictivity=154/516=29.8%
Negative predictivity=23362/23587=99.0%
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 148 / 294
Analysis of Categorical Data
Descriptive;
Summary statistics including percentages, prevalence,
Graphical techniques: bar chart, pie-chart.
Inferential Statistics
Risk ratios
Relative risk
Odds ratios,
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 149 / 294
Jimma Infant Data...
birth |
weight | Freq. Percent Cum.
------------+-----------------------------------
0 | 7,207 91.54 91.54
1 | 666 8.46 100.00
------------+-----------------------------------
Total | 7,873 100.00
STATA CODE:
tab bwt
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 150 / 294
Cross tabulation of bwt and gender (male = 1).
The option ā€œrowā€ can be used to obtain the row percentages.
birth | sex
weight | 0 1 | Total
-----------+----------------------+----------
0 | 3,466 3,741 | 7,207
| 48.09 51.91 | 100.00
-----------+----------------------+----------
1 | 395 271 | 666
| 59.31 40.69 | 100.00
-----------+----------------------+----------
Total | 3,861 4,012 | 7,873
| 49.04 50.96 | 100.00
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 151 / 294
STATA CODE:
tab bwt sex, row
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 152 / 294
Consider the following contingency table:
Diseased
Yes No | Total
--------------------------------------
yes| a b | a+b
exposed ------------------------------
No | c d | c+d
--------------------------------------
Total | a+c c+d | a+b+c+d
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 153 / 294
Odds
The odds of being diseased among exposed is:
a
a + b
.
The odds of being diseased among unexposed is:
c
c + d
.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 154 / 294
Odds Ratio
The ratio of the odds of being diseased among exposed to unexposed
is:
a
a + b
Ć·
c
c + d
=
ad
bc
.
Odds ratio compares the risk of being diseased in the exposed versus
the unexposed.
Var(ln( Ė†OR)) ā‰ˆ
1
a
+
1
b
+
1
c
+
1
d
.
Conļ¬dence interval for ln(OR): ln( Ė†OR) Ā± Z1āˆ’Ī±/2
Ė†Var(ln Ė†OR).
Conļ¬dence interval for OR: exp(ln( Ė†OR) Ā± Z1āˆ’Ī±/2
Ė†Var(ln Ė†OR).)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 155 / 294
Jimma Infant Data:
Outcome-bwt ā€˜low bwtā€™ versus ā€˜normalā€™.
Exposure-gender ā€˜femaleā€™ versus ā€˜maleā€™.
Proportion
| Exposed Unexposed | Total Exposed
-----------------+------------------------+------------------------
Cases | 395 271 | 666 0.5931
Controls | 3466 3741 | 7207 0.4809
-----------------+------------------------+------------------------
Total | 3861 4012 | 7873 0.4904
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Odds ratio | 1.573211 | 1.334496 1.854786
+-------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 156 / 294
STATA CODE:
cci 395 271 3466 3741
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 157 / 294
Measures of Association
How do we determine whether a certain disease is associated with a
certain exposure?
If an association between disease and exposure exists,
How strong is it?
How do we know whether the association is just by chance or due to
other factors?
Another commonly used measure of association is the odds ratio.
The odds in favor of an event happening (such as getting a disease) is
the probability of the event happening relative to the probability of
the event not happening.
In case control studies, we know the proportion of cases who were
exposed and the proportion of controls who were expose from the 2x2
table
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 158 / 294
Odds Ratio
Consider the 2x2 contingency table of:
Diseases Status
Exposure Positive Negative Total
Exposed a b a+b
Not Exposed c d c+d
Total a+c b+d n
OR =
ad
bc
For example suppose the probability (p) of a certain team winning a
football game is 60%, then the probability of not winning (1-p) is
40%, so the odds of the team winning is= p/1-p = 0.6/0.4 = 1.5 : 1
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 159 / 294
Odds Ratio
Odds Ratio (OR) is a good estimate of the relative risk when the
disease is rare
If OR = 1 No association between disease and exposure
If OR > 1 Positive association between disease and exposure
If OR < 1 Negative association between disease and exposure
(Exposure may be protective of the disease)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 160 / 294
Example 3
A case control study was conducted on 200 cases of heart disease and
400 controls to determine whether heart disease is associated with
smoking status. The results are shown in the table below
CHD
Smoking Positive Negative Total
Yes 112 176 288
No 88 224 312
Total 200 400 600
OR =
112x224
176x88
= 1.62
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 161 / 294
Conļ¬dence interval for OR
Since the sampling distribution of OR is often skewed to obtain
conļ¬dence intervals we work with the logarithms.
SE(lnOR) = 1/a + 1/b + 1/c + 1/d
95% CI for lnOR is deļ¬ned as
lnOR Ā± ZĪ±/2 Ɨ SE(lnOR)
95% CI for OR is deļ¬ned as
exp lnOR Ā± ZĪ±/2 Ɨ SE(lnOR)
95% CI for OR=1.62 is (1.13;2.31)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 162 / 294
Interpretation
The odds of CHD was 1.62 times higher for cases than controls.
We are 95% certain that the true odds smoking is between 1.13 and
2.31.
The 95% CI does not include 1; we therefore conclude smoking was
signiļ¬cantly higher among cases than controls.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 163 / 294
Introduction to Generalized Linear Models (GLM)
GLMs are often employed for modeling a non-Gaussian data
It extends ordinary regression models for non-Gaussian data
Statistical models that relate outcome variables such as, binary,
counts, rates etc, to a linear combination of predictor variables
It uses diļ¬€erent link function to connect the outcome variable to the
predictor variables
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 164 / 294
Components of GLM
Three components specify GLM:
The random component identiļ¬es a vector of observations of Y and its
probability distribution;
The systematic component is a speciļ¬cation for the predictor variables
And the link function speciļ¬es the function of E(Y ) that the model
equates to the systematic component
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 165 / 294
Modeling the outcome variable
The outcome variable is binary.
The objective is to study the eļ¬€ect of explanatory variables.
An outcome variable with two possible categorical outcomes
(1=success; 0=failure).
The events are independent from subject to subject.
Explanatory variables can be categorical and/or continuous.
A way to estimate the probability p of the success event of the
outcome variable.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 166 / 294
Measuring the Probability of an outcome
The probability of the outcome is measured by the odds of occurrence
of an event.
If p is the probability of an event, then (1 āˆ’ p) is the probability of it
not occurring.
Odds of success = p
(1āˆ’p) .
The eļ¬€ects of explanatory variable, say x, on the probability of the
diseases can be put on the odds as:
Odds =
p
1 āˆ’ p
= exp(Ī± + Ī²x)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 167 / 294
Logistic Regression
Taking natural logarithm on both sides:
ln(
p
1 āˆ’ p
) = log[exp(Ī± + Ī²1x1)].
ā‡’ logit(p) = Ī± + Ī²1x1.
Consider an outcome as ā€˜diseasedā€™ or ā€˜not diseasedā€™.
Ī± represents the overall disease risk.
Ī²1 represents the fraction by which the disease risk is altered by a
unit change in x1.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 168 / 294
Multiple Logistic Regression Model Contā€™d
The joint eļ¬€ects of all explanatory variables, say x1, x2, . . . , xp put
together on the odds is:
Odds =
p
1 āˆ’ p
= exp(Ī± + Ī²1x1 + Ī²2x2 + . . . + Ī²pxp).
Taking natural logarithm on both sides:
ln(
p
1 āˆ’ p
) = log[exp(Ī± + Ī²1x1 + Ī²2x2 + . . . + Ī²pxp)].
ā‡’ logit(p) = Ī± + Ī²1x1 + Ī²2x2 + . . . + Ī²pxp.
Consider an outcome as ā€˜diseasedā€™ or ā€˜not diseasedā€™.
Ī± represents the overall disease risk.
Ī²1 represents the fraction by which the disease risk is altered by a
unit change in x1.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 169 / 294
Multiple Logistic Regression Model Contā€™d
Ī²2 represents the fraction by which the disease risk is altered by a
unit change in x2, and so on.
What changes is the log odds. The odds themselves are changed by
exp(Ī²).
Only one predictor variable.
logit(p) = Ī± + Ī²x .
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 170 / 294
Jimma Infant Data
Outcome: birth weight status(bwt, 1 = ā€˜low).
Factor: sex (1 = ā€˜male ).
Logistic regression Number of obs = 7873
LR chi2(1) = 30.82
Prob > chi2 = 0.0000
Log likelihood = -2266.5465 Pseudo R2 = 0.0068
----------------------------------------------------------------
bwt | Coef. Std. Err. z P>|z| [95% Conf.Interval]
-------------+--------------------------------------------------
1.sex | -.4531187 .0823256 -5.50 0.000 -.6144739 -.2917634
_cons | -2.171871 .0531052 -40.90 0.000 -2.275955 -2.067786
----------------------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 171 / 294
The ļ¬tted logistic regression model is:
logit(p) = Ė†Ī± + Ė†Ī²sex.
Substituting the estimated values yields:
logit(p) = āˆ’2.172 āˆ’ 0.453sex.
STATA CODE:
logit bwt i.sex
ā€˜iā€™ indicates that sex is a categorical variable.
The option ā€˜orā€™ can be added to obtain the odds ratio estimates.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 172 / 294
Logistic regression Number of obs = 7873
LR chi2(1) = 30.82
Prob > chi2 = 0.0000
Log likelihood = -2266.5465 Pseudo R2 = 0.0068
--------------------------------------------------------------
bwt | Odds Ratio Std. Err. z P>|z| [95% Conf.Interval]
--------------------------------------------------------------
1.sex | .6356427 .0523297 -5.50 0.000 .5409254 .7469452
_cons | .1139642 .0060521 -40.90 0.000 .1026988 .1264654
--------------------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 173 / 294
STATA CODE:
recode sex (1=0) (0=1), generate(sex2)
The estimate stands for males, i.e., sex = 1 and females are
considered as a reference group.
Interpretation of exp(Ė†Ī²)?
Interpretation of exp(Ė†Ī±)?
Conclusion? Why?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 174 / 294
Take males as a reference group.
The estimate now stands for females.
This requires to recode the sex to say, sex2 and ļ¬t the model again
using the following code:
recode sex (1=0) (0=1), generate(sex2)
logit bwt i.sex2,or
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 175 / 294
Logistic regression Number of obs = 7873
LR chi2(1) = 30.82
Prob > chi2 = 0.0000
Log likelihood = -2266.5465 Pseudo R2 = 0.0068
----------------------------------------------------------------
bwt | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+--------------------------------------------------
1.sex2 | 1.573211 .1295156 5.50 0.000 1.338786 1.848684
_cons | .0724405 .004557 -41.73 0.000 .0640375 .0819461
----------------------------------------------------------------
Compare exp(Ė†Ī²) and exp(Ė†Ī±) in the above table with the previous ones.
Interpretation and conclusion?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 176 / 294
Multiple Logistic Regression Model
Logistic regression with more than one predictor variables.
It is also referred to as multivariate logistic regression.
Control eļ¬€ect of one independent variable on another independent
variable and vice-versa.
Jimma Infant Data
Outcome: birth weight status(bwt, 1 = ā€˜low ).
Factor1: sex2 (ā€˜1 = female ).
Factor2: place of residence (1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 177 / 294
Logistic regression Number of obs = 7873
LR chi2(3) = 258.69
Prob > chi2 = 0.0000
Log likelihood = -2152.613 Pseudo R2 = 0.0567
----------------------------------------------------------------
bwt | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
----------------------------------------------------------------
1.sex2 | 1.54585 .1289271 5.22 0.000 1.31273 1.820368
|
place |
2 | 1.935937 .3516486 3.64 0.000 1.356053 2.763794
3 | 5.436301 .835046 11.02 0.000 4.023039 7.346031
|
_cons | .0210955 .003245 -25.09 0.000 .0156048 .0285184
----------------------------------------------------------------
logit(p) = Ė†Ī± + Ė†Ī²1sex2 + Ė†Ī²2SU + Ė†Ī²3R.
SU and R refer to semi-urban and rural, respectively.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 178 / 294
STATA CODE:
logit bwt i.sex2 i.place, or
Interpretation of exp(Ė†Ī²1) = 1.55?
Interpretation of exp(Ė†Ī²2) = 1.94?
Interpretation of exp(Ė†Ī²3) = 5.44?
Conclusion?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 179 / 294
Jimma Infant Data
Add a third variable:
Outcome: birth weight status(bwt, 1 = ā€˜low ).
Factor1: sex2 (1 = ā€˜female ).
Factor2: place of residence
(1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ).
Factor3: maternal age, ā€˜momageā€™ measured in years (continuous).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 180 / 294
Logistic regression Number of obs = 7873
LR chi2(4) = 271.06
Prob > chi2 = 0.0000
Log likelihood = -2146.4273 Pseudo R2 = 0.0594
----------------------------------------------------------------
bwt | Coef. Std. Err. z P>|z| [95% Conf.Interval]
----------------------------------------------------------------
1.sex2 | .4378871 .0834884 5.24 0.000 .2742529 .6015213
|
place |
2 | .661642 .1817 3.64 0.000 .3055165 1.017767
3 | 1.725587 .1539474 11.21 0.000 1.423855 2.027318
|
momage | -.0231749 .0066597 -3.48 0.001 -.0362277 -.0101222
_cons | -3.274631 .2257413 -14.51 0.000 -3.717076 -2.832186
----------------------------------------------------------------
logit(p) = Ė†Ī± + Ė†Ī²1sex2 + Ė†Ī²2SU + Ė†Ī²3R + Ė†Ī²4A.
logit(p) = āˆ’3.275 + 0.438sex2 + 0.662SU + 1.726R āˆ’ 0.023A.
SU, R and A refer to semi-urban, rural and maternal age, respectively.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 181 / 294
Odds ratio estimates:
Logistic regression Number of obs = 7873
LR chi2(4) = 271.06
Prob > chi2 = 0.0000
Log likelihood = -2146.4273 Pseudo R2 = 0.0594
---------------------------------------------------------------
bwt | Odds Ratio Std. Err. z P>|z| [95% Conf.Interval]
---------------------------------------------------------------
1.sex2 | 1.54943 .1293594 5.24 0.000 1.315547 1.824893
|
place |
2 | 1.937972 .3521295 3.64 0.000 1.357326 2.76701
3 | 5.615815 .8645402 11.21 0.000 4.153101 7.593693
|
momage | .9770915 .0065071 -3.48 0.001 .9644206 .9899289
_cons | .0378308 .00854 -14.51 0.000 .0243049 .058884
---------------------------------------------------------------
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 182 / 294
Interpretation of exp(Ė†Ī²4)?
Interpretation of, say, exp(Ė†Ī²1) with and with out maternal age in the
model?
Conclusion?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 183 / 294
Model Comparison
Compare the model ļ¬t statistics, say āˆ’2Log-likelihood of the three
models considered so far, i.e,
Simple logistic regression: only ā€˜sexā€™ as a factor,
Multiple logistic regression: ā€˜sexā€™ and ā€˜placeā€™ as two factor variables,
and
Multiple logistic regression: ā€˜sexā€™, ā€˜placeā€™ and ā€˜maternal ageā€™.
Which one ļ¬ts the data better? Why?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 184 / 294
Ordinal Logistic Regression
Many categorical dependent variables are ordered
Label of agreement (strongly disagree, disagree, agree, strongly agree)
Social class
Satisfaction
Pain Score
If numerical values assigned to categories do not accurately reļ¬‚ect the
true distance
Thus linear regression may not be appropriate
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 185 / 294
Strategies to deal with ordered categorical variables
Use linear regression anyway
Commonly done; but can give incorrect results
Possibly check robustness by varying coding of interval between
outcomes
Collapse variables to dichotomy, use binary logistic regression model
Works ļ¬ne, but throws away useful information
If you are not conļ¬dent about ordering, use multinomial logistic
regression
Ordered logit / ordinal probit
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 186 / 294
Proportional Odds Assumption
The fact that you can calculate odds ratios highlights a key
assumption of ordered logit
Proportional odds assumption
parallel regression assumption
Model assumes that variable eļ¬€ects on the odds of lower vs. higher
outcomes are consistent
Eļ¬€ect on odds of ā€œtoo littleā€ vs ā€œabout rightā€ is same for ā€œabout
rightā€ vs ā€œtoo muchā€
Controlling for all other vars in the model
If this assumption doesnā€™t seem reasonable, consider multinomial logit
Interpretation strategies are similar to logit
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 187 / 294
Test for Proportional odds
A user-written command omodel can be used after downloading (type
search omodel)
The brant command performs a Brant test, which can be obtained by
typing search spost
The former provide overall test of proportionality odds for all
independent variables and the later will provide speciļ¬c test for each
covariate
Insigniļ¬cant p-value revealed that the assumption is satisļ¬ed
omodel logit dependent independent1 independent2, ...
brant, detail
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 188 / 294
Multinomial Logistic Regression
What if you want have a dependent variable with more than two
outcomes?
A polytomous outcome
Contrast outcomes with a common ā€œreference pointā€
Similar to conducting a series of 2-outcome logit models comparing
pairs of categories
The ā€œreference categoryā€ is like the reference group when using
dummy variables in regression
It serves as the contrast point for all analyses
Imagine a dependent variable with M categories
Conduct binomial logit models for all possible combinations of
outcomes
This will produce results fairly similar to a multinomial output
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 189 / 294
Multinomial logistic regression
Choose one category as ā€œreferenceā€
Output will include two tables if the variable has three categories
Choice of ā€œreferenceā€ category drives interpretation of multinomial
logit results
Choose the contrast(s) that makes most sense
Be aware of the reference category when interpreting results
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 190 / 294
Example: Mode of travel
There are diļ¬€erent mode of transport to travel for vacation (car, bus,
and plane). We want to model the probability of the choice of
transport mode
The independent variables are income and family size
If the outcome variable has k number of categories, then we expect
k āˆ’ 1 models (estimates)
This example, the outcome variable (mode of transport) has 3
categories, then the mlogit command in Stata will give two estimates
for each covariate
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 191 / 294
Parameter estimation:
mlogit mode income familysize, base(3) rrr
Multinomial logistic regression Number of obs = 152
LR chi2(4) = 42.63
Prob > chi2 = 0.0000
Log likelihood = -138.68742 Pseudo R2 = 0.1332
------------------------------------------------------------------------------
mode | Odds ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
car |
income | .9444061 .0118194 -4.57 0.000 .9215224 .9678581
familysize | .8204706 .1632009 -0.99 0.320 .5555836 1.211648
-------------+----------------------------------------------------------------
Bus |
income | .9743237 .0136232 -1.86 0.063 .9479852 1.001394
familysize | .4185063 .1370806 -2.66 0.008 .2202385 .7952627
------------------------------------------------------------------------------
(mode==plane is the base outcome)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 192 / 294
mlogit mode income familysize, base(3) rrr
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 193 / 294
Interpretation
Holding family size constant, a unit increase in income reduces the
odds of traveling by car rather than plane by 0.944 (decreases by
5.6%)
Holding income constant, a unit increase in family size reduces the
odds of traveling by bus rather than plane by 0.418 (decreases by
59.2%)
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 194 / 294
Assignment 3
Consider the maternal depression data. The outcome variable is
depression among pregnant mothers.
Explore the data using chi-square and odds ratio
Fit Simple binary logistic regression for each of the independent
variables and compute crude odds ratio
Fit Multiple logistic regression and compute adjusted odds ratio
Compare the crude and adjusted odds ratio and explain the diļ¬€erence
Check the goodness of ļ¬t of the model
Interpret all the results
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 195 / 294
Count Data Analysis
Poisson Regression Model
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 196 / 294
The Poisson Regression Model
Count data are very common in many applications.
Examples include:
Number of patients visiting a certain hospital per day,
Number of ANC visits per period of pregnancy,
Number of live births in a given district per year, etc.
Count data are commonly analyzed using Poisson regression model.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 197 / 294
The Poisson Regression Model Contā€™d
The Poisson model assumes that yi , given the vector of covariates xi ,
is independently Poisson-distributed with
f (y) =
eāˆ’Ī»Ī»y
y!
.
The mean is given by Āµ = Ī» and the variance, var(Āµ) = Ī».
Suppose Y1, . . . , YN is a set of independent count outcomes, and let
x1, . . . , xN represent the corresponding p-dimensional vectors of
covariate values.
The Poisson regression model with Ī² a vector of p ļ¬xed, unknown
regression coeļ¬ƒcients is given by log(Ī»i ) = Ī²0 + Ī² Ɨ xi .
In count and binary data, we donā€™t have the error term, why?
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 198 / 294
Jimma Infant Data
Outcome: Total child deaths (ā€˜deathsā€™).
Factor1: place of residence
(1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ).
Factor2: monthly family income,ā€˜famincā€™ in birr (continuous).
log(Ī») = Ī²0 + Ī²1SU + Ī²2R + Ī²3IN.
ā€˜INā€™ refers to monthly family income, ā€˜SUā€™ and ā€˜Rā€™ are deļ¬ned as
before.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 199 / 294
Poisson regression Number of obs = 7556
LR chi2(3) = 203.15
Prob > chi2 = 0.0000
Log likelihood = -6849.1816 Pseudo R2 = 0.0146
------------------------------------------------------------------------
deaths | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------
place |
2 | .035657 .0560403 0.64 0.525 -.07418 .145494
3 | .2299316 .0500235 4.60 0.000 .1318874 .3279758
|
faminc | -.0013383 .0001642 -8.15 0.000 -.0016601 -.0010164
_cons | -.7445872 .0510623 -14.58 0.000 -.8446676 -.6445069
------------------------------------------------------------------------
log(Ī») = āˆ’0.745 + 0.036SU + 0.230R āˆ’ 0.001IN.
STATA CODE:
poisson deaths i.place faminc
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 200 / 294
The option ā€˜irrā€™ displays the incidence rate ratio estimates.
Poisson regression Number of obs = 7556
LR chi2(3) = 203.15
Prob > chi2 = 0.0000
Log likelihood = -6849.1816 Pseudo R2 = 0.0146
------------------------------------------------------------------------
deaths | IRR Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------------------------------------------
place |
2 | 1.0363 .0580746 0.64 0.525 .9285045 1.156611
3 | 1.258514 .0629552 4.60 0.000 1.14098 1.388155
|
faminc | .9986626 .000164 -8.15 0.000 .9983413 .9989841
_cons | .4749303 .024251 -14.58 0.000 .4297002 .5249213
------------------------------------------------------------------------
STATA CODE:
poisson deaths i.place faminc, irr
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 201 / 294
exp(Ī²i ) measures the change in the expected log counts (incidence
rate).
exp(Ī²2) = exp(0.230) = 1.25 implies that the expected log counts of
child death in rural is higher by 1.25 as compared to that of urban, or
the incidence rate of child death in rural is higher by nearly 25% as
compared to that of urban.
exp(Ī²3) = exp(āˆ’0.0013) = 0.9987 implies that as family income
increases by one Birr, the incidence rate of child death decreases by
nearly 0.13%.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 202 / 294
Negative-Binomial Regression I
Recall: Poisson distribution assumes that its ā€˜meanā€™ and ā€˜varianceā€™ are
equal.
However, count data in practice violate this assumption as a result of
unobserved heterogeneity.
Usually, the sample variance is higher than the sample mean, a
phenomenon referred to as ā€˜overdispersionā€™.
The Overdispersion in the data canā€™t be fully captured by observed
covariates.
Hence, there is a need of models having more parameters than the
Poisson distribution.
Negative-binomial is an important extension to Poisson in this regard.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 203 / 294
Negative-Binomial Regression II
Negtive-binomial distribution is given by
P(y) =
Ī“(Ī±āˆ’1 + y)
Ī“(Ī±āˆ’1)y!
Ī±Āµ
1 + Ī±Āµ
y 1
1 + Ī±Āµ
1/Ī±
.
E(y) = Āµ, Var(y) = Āµ + Ī±Āµ2.
For regression purpose, we assume
yi āˆ¼ Negbin(Āµi , Ī±).
Applying a log link,
logĀµi = xi
T
Ī².
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 204 / 294
Jimma Infant Data
Outcome: Total child deaths (ā€˜deathsā€™).
Factor1: place of residence
(1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ).
Factor2: monthly family income,ā€˜famincā€™ in birr (continuous).
Summary statistics of the outcome (ā€˜deathsā€™):
Variable | Obs Mean Std. Dev. Min Max
-----------------------------------------------------------------
deaths | 8040 .460199 .7191223 0 2
Sample variance is higher than sample mean, i.e., suggesting evidence of overdispersion.
Negative-binomial model might be a good idea instead of the Poisson model which was
ļ¬tted earlier to these data.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 205 / 294
Negative binomial regression Number of obs = 7556
LR chi2(3) = 176.10
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -6824.794 Pseudo R2 = 0.0127
--------------------------------------------------------------------------
deaths | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------------------------------------------------------------------
place |
2 | .0352622 .0595191 0.59 0.554 -.0813931 .1519174
3 | .2328639 .0532109 4.38 0.000 .1285724 .3371554
|
faminc | -.0013315 .0001712 -7.78 0.000 -.0016671 -.0009959
_cons | -.7470684 .0540689 -13.82 0.000 -.8530415 -.6410953
--------------------------------------------------------------------------
/lnalpha | -1.124534 .1619651 -1.441979 -.807088
--------------------------------------------------------------------------
alpha | .3248039 .0526069 .2364592 .4461554
--------------------------------------------------------------------------
Likelihood-ratio test of alpha=0 chibar2(01) 48.78 Prob>=chibar2 = 0.000
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 206 / 294
Observations ...
STATA CODE:
nbreg deaths i.place faminc
The dispersion parameter alpha now in the Negative-binomial model
captures the extra variability in the data.
The Negative-binomial model is an improvement in model ļ¬t over the
Possion model.
Furthermore, likelihood ratio test of the parameter alpha with
(pvalue < 0.001) conļ¬rms the above result.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 207 / 294
Zero-Inļ¬‚ated Models I
Most count data are characterized by excessive zeros beyond what the
common count distributions can predict.
Often because of heterogeneity between subjects.
Or omission of important covariances
Not appropriately accounting for this feature leads to biased
estimates.
And hence erroneous conclusion.
Hence, models for such extension are needed.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 208 / 294
Examples
1: Number of child deaths; The question was asked for each question,
number of death?
The ļ¬rst Scenario is a women may give many children and say no child
death.
The second scenario is a women may not have children and yet may
say zero number of death.
The two scenario resulted zero, but may lead to inļ¬‚ated zero.
Example 2: Number of ļ¬sh caught by the ļ¬sher man.
There may be two diļ¬€erent ponds; one with out ļ¬sh and one with ļ¬sh.
Number of ļ¬sh may be zero if the ļ¬sher tried in the ļ¬rst pond.
That is always zero group. Where as a ļ¬sher man may catch zero due
to his problem
.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 209 / 294
Zero-Inļ¬‚ated Models II
Commonly used models in this regard:
Zero-inļ¬‚ated Poisson model (ZIP).
Zero-inļ¬‚ated negative-binomial (ZINB).
ZINB is considered when data are characterized by both
overdispersion and zero-inļ¬‚ation.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 210 / 294
Zero-Inļ¬‚ated Models III
Zero-inļ¬‚ated Poisson model (ZIP)is commonly used models in this
regard:
This model allows for overdispersion assuming that there are two
diļ¬€erent types of individuals in the data:
(1) Those who have a zero count with a probability of 1 (Always-0
group), and
(2) those who have counts predicted by the standard Poisson (Not
always-0 group)
Observed zero could be from either group, and if the zero is from the
Always-0 group, it indicates that the observation is free from the
probability of having a positive outcome
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 211 / 294
Assignment 4
Consider EDHS2016 datasets
The outcome variable is number of children a women has
Fit poisson regression
Test the assumptions of the model
Interprate the results
Write full report
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 212 / 294
Chapter 4
Survival Data Analysis
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 213 / 294
Introduction
Survival Analysis refers to statistical methods for analyzing survival
data.
Survival data could be derived from laboratory, clinical,
epidemiological studies, etc.
Response of interest is the time from an initial observation until
occurrence of a subsequent event.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 214 / 294
Examples
In many biomedical studies, the primary endpoint is time until an
event occurs
The event of interest includes
Time to death, remission
Time to adverse events
Time to relapse
Time to recovery from diseases
Time to new symptom
Time to discharge
Time to start talk for a child etc
Time from transplant surgery until the new organ fails
Marriage duration
This time interval from a starting point and a subsequent event is
known as the survival time.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 215 / 294
Example
Which of the following data sets is likely to lend itself to survival
analysis?
A case-control study of caļ¬€eine intake and breast cancer
A randomized controlled trial where the outcome was whether or not
women developed breast cancer in the study period
A cohort study where the outcome was the time it took women to
develop breast cancer.
A cross-sectional study which identiļ¬ed both whether or not women
have ever had breast cancer and their date of diagnosis.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 216 / 294
Event and Censoring
We may not observe all the event of interest within the deļ¬ned
follow-up period
Thus, data are typically subject to censoring when a study ends
before the event occurs
Censoring: Subjects are said to be censored if they are lost to follow
up or drop out of the study, or if the study ends before they die or
have an outcome of interest
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 217 / 294
Jimma Infant Data
Infants in Southwest Ethiopia were studied for one year.
Measurements were taken approximately every two months.
Nearly 8000 infants at baseline.
Data on infant, maternal and household characteristics were collected.
Death of infants within their ļ¬rst year is an important problem with
these data.
We will therefore analyze the time from birth to death (in days).
For infants still alive when these data were collected, time is the time
from birth to the time of data collection.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 218 / 294
Jimma Infant Data II
The variable event is an indicator for whether time refers to death (1)
or end of study (0).
Possible explanatory variables for time-to-death could be place of
residence and sex of infants, among others.
These data can be described as survival data.
Duration or survival data can generally not be analyzed by
conventional methods such as linear regression.
The main reason for this is that some durations are usually
right-censored.
That is, the endpoint of interest has not occurred during the period
of observation.
Another reason is that survival times tend to have positively skewed
distributions.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 219 / 294
Survivor Function
The survival time T may be regarded as a random variable with a
probability distribution F(t) and probability density function f (t).
An obvious quantity of interest is the probability of surviving to time
t or beyond, the survivor function or survival curve S(t), which is
given by
S(t) = P(T ā‰„ t) = 1 āˆ’ F(t).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 220 / 294
Hazard Function
A further function which is of interest for survival data is the hazard
function.
This represents the instantaneous death rate, that is, the probability
that an individual experiences the event of interest at a time point
given that the event has not yet occurred.
The hazard function is given by
h(t) =
f (t)
S(t)
.
The instantaneous probability of death at time t divided by the
probability of surviving up to time t.
Hazard function is just the incidence rate.
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 221 / 294
Kaplan-Meier Estimator
The Kaplan-Meier estimator is a nonparametric estimator of the
survivor function S(t).
If all the failure times, or times at which the event occurs in the
sample are ordered,
And labeled t(j) such that t(1) ā‰¤ t(2) ā‰¤ . . . ā‰¤ t(n), the estimator is
given by
S(t) =
j|t(j)ā‰¤t
(1 āˆ’
dj
nj
).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 222 / 294
Kaplan-Meier Estimator II
Recall
S(t) =
j|t(j)ā‰¤t
(1 āˆ’
dj
nj
).
dj is the number of individuals who experience the event at time t(j),
nj is the number of individuals who have not yet experienced the
event at that time,
And are therefore still ā€˜at riskā€™ of experiencing it (including those
censored at t(j)).
Tadesse A. Ayele (Universities of Gondar) November 22, 2019 223 / 294
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment
Basic Biostatistics and Data managment

More Related Content

What's hot

observational analytical study
observational analytical studyobservational analytical study
observational analytical studyDr. Partha Sarkar
Ā 
Survival analysis
Survival analysisSurvival analysis
Survival analysisHar Jindal
Ā 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copyAmandeep Kaur
Ā 
Bias, confounding and fallacies in epidemiology
Bias, confounding and fallacies in epidemiologyBias, confounding and fallacies in epidemiology
Bias, confounding and fallacies in epidemiologyTauseef Jawaid
Ā 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwaresDr.ammara khakwani
Ā 
biostatistics
biostatisticsbiostatistics
biostatisticsMehul Shinde
Ā 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and StatisticsT.S. Lim
Ā 
Introduction to Biostatistics
Introduction to Biostatistics Introduction to Biostatistics
Introduction to Biostatistics Dr Athar Khan
Ā 
Basics of Systematic Review and Meta-analysis: Part 3
Basics of Systematic Review and Meta-analysis: Part 3Basics of Systematic Review and Meta-analysis: Part 3
Basics of Systematic Review and Meta-analysis: Part 3Rizwan S A
Ā 
Biostatistics khushbu
Biostatistics khushbuBiostatistics khushbu
Biostatistics khushbukhushbu mishra
Ā 
Research methodology
Research methodologyResearch methodology
Research methodologySonal Kale
Ā 
Measures of association
Measures of associationMeasures of association
Measures of associationIAU Dent
Ā 
Introduction to Biostatistics
Introduction to BiostatisticsIntroduction to Biostatistics
Introduction to BiostatisticsAbdul Wasay Baloch
Ā 
Meta analysis techniques in epidemiology
Meta analysis techniques in epidemiologyMeta analysis techniques in epidemiology
Meta analysis techniques in epidemiologyBhoj Raj Singh
Ā 
Allocation Concealment
Allocation ConcealmentAllocation Concealment
Allocation ConcealmentTulasi Raman
Ā 
Choosing a statistical test
Choosing a statistical testChoosing a statistical test
Choosing a statistical testRizwan S A
Ā 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysisSumit Das
Ā 

What's hot (20)

observational analytical study
observational analytical studyobservational analytical study
observational analytical study
Ā 
Survival analysis
Survival analysisSurvival analysis
Survival analysis
Ā 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
Ā 
Bias, confounding and fallacies in epidemiology
Bias, confounding and fallacies in epidemiologyBias, confounding and fallacies in epidemiology
Bias, confounding and fallacies in epidemiology
Ā 
Biostatistics ppt
Biostatistics  pptBiostatistics  ppt
Biostatistics ppt
Ā 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
Ā 
biostatistics
biostatisticsbiostatistics
biostatistics
Ā 
Data Analysis and Statistics
Data Analysis and StatisticsData Analysis and Statistics
Data Analysis and Statistics
Ā 
Introduction to Biostatistics
Introduction to Biostatistics Introduction to Biostatistics
Introduction to Biostatistics
Ā 
Basics of Systematic Review and Meta-analysis: Part 3
Basics of Systematic Review and Meta-analysis: Part 3Basics of Systematic Review and Meta-analysis: Part 3
Basics of Systematic Review and Meta-analysis: Part 3
Ā 
Statistical Power
Statistical PowerStatistical Power
Statistical Power
Ā 
Biostatistics khushbu
Biostatistics khushbuBiostatistics khushbu
Biostatistics khushbu
Ā 
Research methodology
Research methodologyResearch methodology
Research methodology
Ā 
Measures of association
Measures of associationMeasures of association
Measures of association
Ā 
Sample size calculation
Sample size calculationSample size calculation
Sample size calculation
Ā 
Introduction to Biostatistics
Introduction to BiostatisticsIntroduction to Biostatistics
Introduction to Biostatistics
Ā 
Meta analysis techniques in epidemiology
Meta analysis techniques in epidemiologyMeta analysis techniques in epidemiology
Meta analysis techniques in epidemiology
Ā 
Allocation Concealment
Allocation ConcealmentAllocation Concealment
Allocation Concealment
Ā 
Choosing a statistical test
Choosing a statistical testChoosing a statistical test
Choosing a statistical test
Ā 
Categorical data analysis
Categorical data analysisCategorical data analysis
Categorical data analysis
Ā 

Similar to Basic Biostatistics and Data managment

Advanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdf
Advanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdfAdvanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdf
Advanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdfNaolShamsu
Ā 
Pandemic Preparedness Results and Recommendations.pdf
Pandemic Preparedness Results and Recommendations.pdfPandemic Preparedness Results and Recommendations.pdf
Pandemic Preparedness Results and Recommendations.pdfbkbk37
Ā 
4 Steps Followed In Conducting Scientific Research
4 Steps Followed In Conducting Scientific Research4 Steps Followed In Conducting Scientific Research
4 Steps Followed In Conducting Scientific ResearchCheryl Brown
Ā 
N 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdf
N 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdfN 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdf
N 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdfbkbk37
Ā 
IssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docxIssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docxvrickens
Ā 
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTIONCLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTIONIJDKP
Ā 
REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...
REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...
REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...IJCI JOURNAL
Ā 
Case Study Research.pptx
Case Study Research.pptxCase Study Research.pptx
Case Study Research.pptxAnsarHasas1
Ā 
2023 Semester 2 Academic Integrity.pdf
2023 Semester 2 Academic Integrity.pdf2023 Semester 2 Academic Integrity.pdf
2023 Semester 2 Academic Integrity.pdfMartin McMorrow
Ā 
Business Statistics Notes for Business and Commerce Department
Business Statistics Notes for Business and Commerce DepartmentBusiness Statistics Notes for Business and Commerce Department
Business Statistics Notes for Business and Commerce DepartmentSeetal Daas
Ā 
Estimating the Statistical Significance of Classifiers used in the Predictio...
Estimating the Statistical Significance of Classifiers used in the  Predictio...Estimating the Statistical Significance of Classifiers used in the  Predictio...
Estimating the Statistical Significance of Classifiers used in the Predictio...IOSR Journals
Ā 
Validity of Instruments, Appropriateness of Designs and Statistics in Article...
Validity of Instruments, Appropriateness of Designs and Statistics in Article...Validity of Instruments, Appropriateness of Designs and Statistics in Article...
Validity of Instruments, Appropriateness of Designs and Statistics in Article...iosrjce
Ā 
1Measurements of health and disease_Introduction.pdf
1Measurements of health and disease_Introduction.pdf1Measurements of health and disease_Introduction.pdf
1Measurements of health and disease_Introduction.pdfAmanuelDina
Ā 
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docxRunning Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docxtodd521
Ā 
Professor Aboul ella COVID-19 related publications
Professor Aboul ella COVID-19 related publications Professor Aboul ella COVID-19 related publications
Professor Aboul ella COVID-19 related publications Aboul Ella Hassanien
Ā 
Perception of distance learning among undergraduate medical students during C...
Perception of distance learning among undergraduate medical students during C...Perception of distance learning among undergraduate medical students during C...
Perception of distance learning among undergraduate medical students during C...Journal of Education and Learning (EduLearn)
Ā 
SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...
SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...
SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...caijjournal
Ā 

Similar to Basic Biostatistics and Data managment (20)

Advanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdf
Advanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdfAdvanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdf
Advanced Biostatistics Course Note_Summer_Tadesse Awoke 2018.pdf
Ā 
Pandemic Preparedness Results and Recommendations.pdf
Pandemic Preparedness Results and Recommendations.pdfPandemic Preparedness Results and Recommendations.pdf
Pandemic Preparedness Results and Recommendations.pdf
Ā 
Scientific misconduct
Scientific misconductScientific misconduct
Scientific misconduct
Ā 
Lecture notes on STS 202
Lecture notes on STS 202Lecture notes on STS 202
Lecture notes on STS 202
Ā 
4 Steps Followed In Conducting Scientific Research
4 Steps Followed In Conducting Scientific Research4 Steps Followed In Conducting Scientific Research
4 Steps Followed In Conducting Scientific Research
Ā 
N 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdf
N 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdfN 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdf
N 599 Aspen Pandemic Preparedness Results and Recommendations Paper.pdf
Ā 
IssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docxIssaPopulation and SamplingThe constructs of population and sa.docx
IssaPopulation and SamplingThe constructs of population and sa.docx
Ā 
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTIONCLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
CLASS IMBALANCE HANDLING TECHNIQUES USED IN DEPRESSION PREDICTION AND DETECTION
Ā 
Research new1
Research new1Research new1
Research new1
Ā 
REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...
REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...
REVIEW OFCLASS IMBALANCE DATASET HANDLING TECHNIQUES FOR DEPRESSION PREDICTIO...
Ā 
Case Study Research.pptx
Case Study Research.pptxCase Study Research.pptx
Case Study Research.pptx
Ā 
2023 Semester 2 Academic Integrity.pdf
2023 Semester 2 Academic Integrity.pdf2023 Semester 2 Academic Integrity.pdf
2023 Semester 2 Academic Integrity.pdf
Ā 
Business Statistics Notes for Business and Commerce Department
Business Statistics Notes for Business and Commerce DepartmentBusiness Statistics Notes for Business and Commerce Department
Business Statistics Notes for Business and Commerce Department
Ā 
Estimating the Statistical Significance of Classifiers used in the Predictio...
Estimating the Statistical Significance of Classifiers used in the  Predictio...Estimating the Statistical Significance of Classifiers used in the  Predictio...
Estimating the Statistical Significance of Classifiers used in the Predictio...
Ā 
Validity of Instruments, Appropriateness of Designs and Statistics in Article...
Validity of Instruments, Appropriateness of Designs and Statistics in Article...Validity of Instruments, Appropriateness of Designs and Statistics in Article...
Validity of Instruments, Appropriateness of Designs and Statistics in Article...
Ā 
1Measurements of health and disease_Introduction.pdf
1Measurements of health and disease_Introduction.pdf1Measurements of health and disease_Introduction.pdf
1Measurements of health and disease_Introduction.pdf
Ā 
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docxRunning Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Running Head RESEARCH BUDGET 1RESEARCH BUDGET 2.docx
Ā 
Professor Aboul ella COVID-19 related publications
Professor Aboul ella COVID-19 related publications Professor Aboul ella COVID-19 related publications
Professor Aboul ella COVID-19 related publications
Ā 
Perception of distance learning among undergraduate medical students during C...
Perception of distance learning among undergraduate medical students during C...Perception of distance learning among undergraduate medical students during C...
Perception of distance learning among undergraduate medical students during C...
Ā 
SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...
SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...
SIMULATION IN TEACHING WITHIN THE HEALTH AND ENGINEERING FACULTIES OF SAUDI U...
Ā 

Recently uploaded

call girls in Connaught Place DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...
call girls in Connaught Place  DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...call girls in Connaught Place  DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...
call girls in Connaught Place DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...saminamagar
Ā 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
Ā 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girlsnehamumbai
Ā 
Bangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% Safenarwatsonia7
Ā 
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
Ā 
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service MumbaiVIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbaisonalikaur4
Ā 
Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024Gabriel Guevara MD
Ā 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...Miss joya
Ā 
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service LucknowVIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknownarwatsonia7
Ā 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Modelssonalikaur4
Ā 
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service LucknowCall Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknownarwatsonia7
Ā 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...Miss joya
Ā 
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceCollege Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceNehru place Escorts
Ā 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
Ā 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
Ā 
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original PhotosBook Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photosnarwatsonia7
Ā 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
Ā 
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service JaipurHigh Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipurparulsinha
Ā 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...narwatsonia7
Ā 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...narwatsonia7
Ā 

Recently uploaded (20)

call girls in Connaught Place DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...
call girls in Connaught Place  DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...call girls in Connaught Place  DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...
call girls in Connaught Place DELHI šŸ” >ą¼’9540349809 šŸ” genuine Escort Service ...
Ā 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Ā 
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy GirlsCall Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Call Girls In Andheri East Call 9920874524 Book Hot And Sexy Girls
Ā 
Bangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% SafeBangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% Safe
Bangalore Call Girls Marathahalli šŸ“ž 9907093804 High Profile Service 100% Safe
Ā 
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hebbal Just Call 7001305949 Top Class Call Girl Service Available
Ā 
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service MumbaiVIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
VIP Call Girls Mumbai Arpita 9910780858 Independent Escort Service Mumbai
Ā 
Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024Asthma Review - GINA guidelines summary 2024
Asthma Review - GINA guidelines summary 2024
Ā 
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
College Call Girls Pune Mira 9907093804 Short 1500 Night 6000 Best call girls...
Ā 
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service LucknowVIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
VIP Call Girls Lucknow Nandini 7001305949 Independent Escort Service Lucknow
Ā 
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking ModelsMumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Mumbai Call Girls Service 9910780858 Real Russian Girls Looking Models
Ā 
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service LucknowCall Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Call Girl Lucknow Mallika 7001305949 Independent Escort Service Lucknow
Ā 
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
VIP Call Girls Pune Vrinda 9907093804 Short 1500 Night 6000 Best call girls S...
Ā 
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort ServiceCollege Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
College Call Girls Vyasarpadi Whatsapp 7001305949 Independent Escort Service
Ā 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
Ā 
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jp Nagar Just Call 7001305949 Top Class Call Girl Service Available
Ā 
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original PhotosBook Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Ā 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Ā 
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service JaipurHigh Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
High Profile Call Girls Jaipur Vani 8445551418 Independent Escort Service Jaipur
Ā 
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Call Girls Kanakapura Road Just Call 7001305949 Top Class Call Girl Service A...
Ā 
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Housewife Call Girls Bangalore - Call 7001305949 Rs-3500 with A/C Room Cash o...
Ā 

Basic Biostatistics and Data managment

  • 1. Basic Biostatistics and Data Management Using Stata Tadesse Awoke Ayele (PhD, Associate Professor) University of Gonder Collage of Medicine and Health Sciences Institute of Public Health November 22, 2019 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 1 / 294
  • 2. Contact Detail Instructor: Tadesse Awoke (PhD, Associate Professor) Department: Epidemiology and Biostatistics Qualiļ¬cation 1: Biostatistics Qualiļ¬cation 2: Public Health Email: tawoke7@gmail.com Phone: +251910173308 Time: Nov 20-27, 2019 Oļ¬ƒce No: 10 Oļ¬ƒce hours: Available upon request Tadesse A. Ayele (Universities of Gondar) November 22, 2019 2 / 294
  • 3. We are drowning in information and starving for knowledge! Tadesse A. Ayele (Universities of Gondar) November 22, 2019 3 / 294
  • 4. Course Description This course is designed to give students an overview of descriptive statistics, probability, sampling, inference, test of association, comparisons of means, Correlation and regression, Generalized Linear Models (GLM). The sessions are designed to introduce the deļ¬nitions and basic concepts of descriptive statistics, sampling, sample size determination, statistical inference, t-test, ANOVA, Correlation, Linear regression, logistic regression, Poisson regression, Negative binomial regression, and Zero inļ¬‚ated poisson regression. The overall emphasis will be placed on understanding the language of statistics and the art of statistical investigation. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 4 / 294
  • 5. Course Objectives The aim of biostatistical study is to provide the numbers that contain information about certain situations and present them in such a way that valid interpretations are possible Tadesse A. Ayele (Universities of Gondar) November 22, 2019 5 / 294
  • 6. Speciļ¬c objectives Describe the process involved in data processing Explain the diļ¬€erent types of statistical models Identify statistical models which ļ¬ts the data well based on the research question Choose and perform appropriate statistical model for a given medical/health data Identify assumptions and limitations of selected statistical models. Interprate model parameter estimates Conclude based on the information obtained from the data Tadesse A. Ayele (Universities of Gondar) November 22, 2019 6 / 294
  • 7. Course Outline Tadesse A. Ayele (Universities of Gondar) November 22, 2019 7 / 294
  • 8. Chapter 1: General Introduction Research and Biostatistics Concept of Biostatistics Variable, Data, Information, knowledge, and wisdom Sample Datasets Introduction to STATA Tadesse A. Ayele (Universities of Gondar) November 22, 2019 8 / 294
  • 9. General and Generalized Linear Models (GLMs) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 9 / 294
  • 10. Chapter 2: Analysis of Continuous Variable Comparison of means Student t-test Analysis of Variance (ANOVA) Analysis of Covariance (ANCOVA) Correlation and Linear Regression Tadesse A. Ayele (Universities of Gondar) November 22, 2019 10 / 294
  • 11. Chapter 3: Analysis of Categorical Data Categorical Data Analysis Logistic Regression Conditional Logistic Regression Ordinal Logistic Regression Multinomial Logistic Regression Multilevel Logistic Models Tadesse A. Ayele (Universities of Gondar) November 22, 2019 11 / 294
  • 12. Chapter 4: Models for Count data Poisson Regression Negative Binomial Poisson Regression Zero inļ¬‚ated Poisson Regression Multilevel Poisson Models Tadesse A. Ayele (Universities of Gondar) November 22, 2019 12 / 294
  • 13. Chapter 5: Survival Data analysis Time to event data Life Table Kaplan-Meier Survival curve Cox-proportional hazards regression Proportional Hazard Assumption Tadesse A. Ayele (Universities of Gondar) November 22, 2019 13 / 294
  • 14. References Martin Bland. An introduction to Medical Statistics Colton T. Statistics in Medicine Daniel W. Biostatistics a foundation for analysis in the Health Sciences Kirkwood BR. Essentials of Medical Statistics Knapp RG, Miller MC. Clinical epidemiology and Biostatistics. Baltimore Williams and Wilkins, 1992 P. Armitage and G. Berry. Statistical Methods in Medical Research Pagano and Gauvereau. Principles of Biostatistics Tadesse A. Ayele (Universities of Gondar) November 22, 2019 14 / 294
  • 15. Teaching Methods and Evaluation Teaching Methods Lecture Exercise Assignment Evaluation Participation 5% Assignment 15% Project 20% Final Exam: 60% Tadesse A. Ayele (Universities of Gondar) November 22, 2019 15 / 294
  • 16. Tentative Schedule Chapters Date General Introduction Day 1 Models for Continuous data Day 1 Models for Categorical data Day 2 Models for count data Day 2 Survival Analysis Day 3 Non-parametric Methods Day 3 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 16 / 294
  • 17. Chapter 1 Introduction Tadesse A. Ayele (Universities of Gondar) November 22, 2019 17 / 294
  • 18. What is Research? Research is deļ¬ned as the systematic Collection Organization Analysis and Interpretation of data for the purpose of Answering a question Solving a problem or Adding body of knowledge The ultimate objective of research is to generate valid evidence for Program planning Policy Making Intervention Evaluation Decision making ... Tadesse A. Ayele (Universities of Gondar) November 22, 2019 18 / 294
  • 19. Industrial revolutions The ļ¬rst revolution (1765): Going from hand to machine production, the increasing use of steam and water power Second revolution (1870): Expansion of electricity, petroleum and steel The third Revolution (1920) Digital technology: Clever software, robots, and a whole range of web-based services The fourth industrial revolution?? The row power for the fourth revolution is data Data is the new oil! Unlike oil, data is created by people, not geology Can solve so many of the problem: Health Economy Lifestyle, clime, . . . . Tadesse A. Ayele (Universities of Gondar) November 22, 2019 19 / 294
  • 20. Quantitative Methods Data are not just numbers, they are numbers with a context In data analysis, context provides meaning Context includes who collected the data, how were the data collected, why the data were collected, when and where were the data collected what exactly was collected We can think of the who, how and why question referring to pragmatics of data collection, where when, where and what refer to data semantics Pragmatics is the study of language from the point of view of usage Semantics is concerned with the study of meaning and is related to both philosophy and logic Tadesse A. Ayele (Universities of Gondar) November 22, 2019 20 / 294
  • 21. Quantitative Methods Is deļ¬ned as a systematic investigation of phenomena by gathering quantiļ¬able data and performing statistical, mathematical or computational techniques to explain a particular phenomenon Include Epidemiology and Biostatistics Tadesse A. Ayele (Universities of Gondar) November 22, 2019 21 / 294
  • 22. Illustration Figure 1: Schematic presentation Tadesse A. Ayele (Universities of Gondar) November 22, 2019 22 / 294
  • 23. Misuse of Statistics The statistical conclusion is used with other knowledge to reach a substantive conclusion Statistics has several limitations The statistical conclusion refers to groups and not individuals It only summarizes but does not interpret data Statistics can be misused by selective presentation of desired results It is a tool that can be used well or can be misused The human must be able to intelligently interpret the output from the computer Tadesse A. Ayele (Universities of Gondar) November 22, 2019 23 / 294
  • 24. Misuse of Statistics When examining statistical information consider the following: Was the sample used to gather the statistical data unbiased and of suļ¬ƒcient size? Is the methodology (design) used appropriate? Is the analysis technique appropriate for the data? Is the statistical statement ambiguous, could it be interpreted in more than one way? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 24 / 294
  • 25. Examples 1 Probability of surviving surgery A patient undergoes examination and told to have surgery. He asked the doctor to tell him the probability of surviving. Then the doctor told him it is 100%. The patient asked how? The doctor said ā€of 10 patients who went through this operation, 9 diedā€. The 9th patient died yesterday and you are the 10th patient. 2 Harvard university example There were three female students at Harvard university. Of the three, one married her professor. Thus, the information 33% of female students married their professor at HU. 3 Soldiers height and crossing the river Tadesse A. Ayele (Universities of Gondar) November 22, 2019 25 / 294
  • 26. Type I and Type II error Figure 2: Types of error in hypothesis testing. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 26 / 294
  • 27. Introduction(1) 21st century is the period in which information becomes; Money Power Everything Biostatistics provides the most fundamental tools and techniques of the scientiļ¬c methods for generating information Used for: Forming hypotheses Designing experiments and observational studies Gathering data Summarizing data Drawing inferences from data( eg. testing hypotheses) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 27 / 294
  • 28. Introduction(2) The ļ¬eld of statistics can be divided into two: 1 Mathematical Statistics: study and develop statistical theory and methods in the abstract 2 Applied Statistics: application of statistical methods to solve real problems involving randomly generated data and the development of new statistical methodology motivated by real problems Later Applied Statistics can be also subdivided in to two branches Descriptive statistics: describes what we have in hand (the sample)(sample mean, sample variance, sample proportion, ...) Inferential statistics: generalizes the ļ¬nding from the sample to the population (estimation and hypothesis testing) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 28 / 294
  • 29. Introduction(3) Biostatistics is the branch of applied statistics directed toward applications in the health sciences and biology Why biostatistics? Because some statistical methods are more heavily used in health applications than elsewhere (eg. survival analysis and longitudinal data analysis) Because examples are drawn from health sciences: Makes subject more appealing to those interested in health Illustrates how to apply methodology to similar problems encountered in real life Tadesse A. Ayele (Universities of Gondar) November 22, 2019 29 / 294
  • 30. Variable, Data and Information Variable: is a characteristic which takes diļ¬€erent value Quantitative variables E.g. Number of children in a family, ... E.g. Weight, height, BP, VL, ... Qualitative variables E.g. Marital status, religion, E.G. Education status, patient satisfaction Tadesse A. Ayele (Universities of Gondar) November 22, 2019 30 / 294
  • 31. Types of Variable Variables can be again classiļ¬ed in to two broad categories Outcome variable Can be also called response or dependent variable It is the focus of the research Aļ¬€ected by other (independent) variables Predictor variables Can be also called explanatory or independent variable Aļ¬€ects the outcome variable Tadesse A. Ayele (Universities of Gondar) November 22, 2019 31 / 294
  • 32. Example 1 In a study to determine whether surgery or chemotherapy results in higher survival rates for a certain type of cancer, whether or not the patient survived is one variable, and whether they received surgery or chemotherapy is the other. 1 Identify the outcome variable 2 Identify the predictor variable Solution Outcome variable Predictor variable Tadesse A. Ayele (Universities of Gondar) November 22, 2019 32 / 294
  • 33. Example 2 Global Burden of Noncommunicable diseases and risk factors. They are by far the leading cause of death in the Region, representing 62% of all annual deaths. NCD risk factors include: Tobacco Harmful use of Alcohol Sedentary behaviour and physical inactivity Obesity Unhealthy diet. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 33 / 294
  • 34. Schematic presentation Figure 3: WHO NCD burden. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 34 / 294
  • 35. Data: Is a measurement (observation) taken about the variable The collection of data is often called dataset Can be quantitative data or Qualitative data However, data is raw in which the required evidences can not be easily obtained. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 35 / 294
  • 36. Illustration Data is raw, unorganized facts that need to be processed. When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Data and information Figure 4: Relationship between data and information Tadesse A. Ayele (Universities of Gondar) November 22, 2019 36 / 294
  • 37. Information: Data that is speciļ¬c and organized for a purpose presented within a context that gives it meaning and relevance and can lead to an increase in understanding and decrease in uncertainty Biostatistics/Statistics is the tool which converts data to information. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 37 / 294
  • 38. Example Datasets Tadesse A. Ayele (Universities of Gondar) November 22, 2019 38 / 294
  • 39. Example 1: Jimma Child Mortality Data The data were collected to established risk factors aļ¬€ecting infant survival. Children born in Jimma, Kaļ¬€a, and Illubabor were examined for their ļ¬rst year growth characteristics. There were 8050 infants enrolled in the study. The variables included were: ID TotPrg TotBrth Abortion StillB Gravida Event Durtaion parity MamAge 1 5 5 2 0 0 0 360 0 30 2 3 3 2 0 0 0 62 0 23 3 8 8 2 0 0 0 360 0 32 4 5 5 2 0 0 0 356 0 30 5 4 2 1 1 0 0 362 1 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8050 0 4 4 2 0 0 361 2 35 Objectives: To determine survival probability of new born Statistical approach............? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 39 / 294
  • 40. Example 2: Treatment Eļ¬€ect data The data was collected on ulcer patients. The patients were randomized in to two arms (placebo and treatment) The variables in the data are age, duration, treatment, time and result. Data ID Age Duration Treatment Time Result 1 48 2 1 7 1 2 73 1 1 12 0 3 54 1 1 12 0 4 58 2 1 12 0 5 56 1 0 12 0 6 49 2 0 12 0 . . . . . . . . . . . . 42 61 1 0 12 0 43 33 2 1 12 0 Objectives of the study: to determine the eļ¬€ect of treatment on ulcer Statistical Approach............. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 40 / 294
  • 41. Example 3: HIV/AIDS data Follow-up data among HIV/AIDS in Gondar Hospital. The variables time, age, sex, residence, WHO stage and CD4 cell count were included. Data ID Months Sex Age Residence Stage FCD4 1 1 1 20 1 2 100 1 2 1 20 1 2 64 1 3 1 20 1 2 611 1 4 1 20 1 2 744 2 1 0 39 2 3 166 2 2 0 39 2 3 363 2 3 0 39 2 3 263 2 4 0 39 2 3 164 3 1 0 48 1 2 361 . . . . . . . . . . . . . . 2550 1 1 30 0 4 441 Objectives: To assess the change in CD4 count over time. Statistical approach......... Tadesse A. Ayele (Universities of Gondar) November 22, 2019 41 / 294
  • 42. Infant growth data Children were followed for 12 months and measurement were taken every two months. Part of the data is given below; Obs child age(months) weight(grams) CatBMI 1 1 0 2900 1 2 1 2 3100 0 3 1 4 3180 1 . . . . . 7 1 12 8000 1 8 2 0 3200 0 9 2 2 3340 0 . . . . . Objectives: To assess the change in weight over time. Statistical approach......... Tadesse A. Ayele (Universities of Gondar) November 22, 2019 42 / 294
  • 43. Gilgel-Gibe Mosquito Data A study conducted around Gilgel-Gibe dam for three years. Inļ¬‚uence of the dam on mosquito abundance and species composition. Eight ā€˜At riskā€™ and eight ā€˜Controlā€™ villages based on distance. +--------------------------------------+ | ID Village Time Gamb Season | |--------------------------------------| 1. | 1 at risk 0 0 wet | 2. | 1 at risk 1 0 wet | 3. | 1 at risk 2 9 wet | 4. | 1 at risk 3 30 wet | |--------------------------------------| 6. | 1 at risk 5 0 dry | . . . . . . Objectives: To compare the incidence of mosquito and characterize the type of species. Statistical approach......... Tadesse A. Ayele (Universities of Gondar) November 22, 2019 43 / 294
  • 44. In all the above examples the variables are presented in the ļ¬rst row The body of the table represents the data Every data is collected for a purpose (research question) Need to be analysis to generate information Tadesse A. Ayele (Universities of Gondar) November 22, 2019 44 / 294
  • 45. Introduction to Stata Tadesse A. Ayele (Universities of Gondar) November 22, 2019 45 / 294
  • 46. Introduction It is a multi-purpose statistical package to help you explore, summarize and analyze datasets A dataset is a collection of several pieces of information called variables (usually arranged by columns) A variable can have one or several values (information for one or several cases) Other statistical packages are SPSS, SAS and R Stata is widely used in social science research and the most used statistical software on campus. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 46 / 294
  • 47. Data Management Softwares There are diļ¬€erent possible data management softwares The most common are Stata, SPSS, SAS and R They all have diļ¬€erent features For this course, we will focus on Stata Tadesse A. Ayele (Universities of Gondar) November 22, 2019 47 / 294
  • 48. Introduction to Stata Stata is a complete, integrated statistical package that provides everything you need for data analysis, data management, and graphics It is fast and easy to use Current version is Stata 14?? Standard stata can handle up to 2047 variables SE can handle 32766 variables Number of observations is limited by your computer (up to 2 billion!) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 48 / 294
  • 49. Stata has 4 windows: Command: where commands are entered. All commands and variables are case sensitive. Results: where results appear Review: where past commands are listed Clicking a past command in Review window brings it to the command window where it can be modiļ¬ed and re-executed Variable: where variables in current dataset are listed Tadesse A. Ayele (Universities of Gondar) November 22, 2019 49 / 294
  • 50. Stata Windows The four windows of Stata are given here Tadesse A. Ayele (Universities of Gondar) November 22, 2019 50 / 294
  • 51. Other windows Graphics Data editor Data viewer ā€˜doā€™ Log Tadesse A. Ayele (Universities of Gondar) November 22, 2019 51 / 294
  • 52. Important details It is case sensitive! First set the memory of Stata: set memory 100m You can only have one dataset active at a time abbreviations work for commands and variable names! Help is very important part Two interactive options: help ā€˜commandā€™ help ā€˜searchā€™ Tadesse A. Ayele (Universities of Gondar) November 22, 2019 52 / 294
  • 53. There are lots of things you can do without data in stata! display 4.1-1.96*0.3 *lower bound display 4.1+1.96*0.3 *upper bound tabi 100 34 17 294 *2X2 contegency table tabi 100 34 17 294, col *percent for Column tabi 100 34 17 294, col row cell chi cci 100 34 17 294 *confidence interval cci 100 34 17 294, exact sampsi 0.2 0.5 *sample size for two population Tadesse A. Ayele (Universities of Gondar) November 22, 2019 53 / 294
  • 54. Inputting Data Many Options: Manually enter data into the Stata Data Editor Copy data into the Data Editor from another source (ex.: Excel) Importing an ASCII (text) ļ¬le. Reading in an Excel spreadsheet (tab- or comma- delimited text ļ¬le). Open existing Stata Data ļ¬le with ļ¬le extension .dta Use a conversion package (StatTransfer) to read in data from another package (eg, SPSS, SAS data, . . . ). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 54 / 294
  • 55. Manual Data entry First open the data editor window to enter data manually Second deļ¬ne variable names: Note that variables are automatically named var1, var2, . . . Double-click on top of column to view/edit ā€œVariable Propertiesā€ and change the name Via command: rename oldvarname newvarname Eg. rename var1 id Third write the observation for each variables. To move to the second observation, click ā€enterā€ Tadesse A. Ayele (Universities of Gondar) November 22, 2019 55 / 294
  • 56. Data entry Variables can be deļ¬ned in this window Tadesse A. Ayele (Universities of Gondar) November 22, 2019 56 / 294
  • 57. Importing Data Import: Via drop- menu: File ā†’ Import ā†’ ASCII data created by a spreadsheet. Via drop-menu: File ā†’ Open ā†’Scroll to ļ¬nd data The clear command is a default command that clears the memory before loading the requested dataļ¬le. This is necessary because Stata can have only one dataset in memory at a time! Tadesse A. Ayele (Universities of Gondar) November 22, 2019 57 / 294
  • 58. Interactive command line Do ļ¬le The commands used to produce the output can be saved in the ā€do ļ¬leā€ It can be also executed by highlighting each command line Any time we want to run the command, we can open the do ļ¬le Log ļ¬le Log ļ¬le is used to save the outputs produced This can bone by clicking begin under Log menu File ā†’ Log ā†’ Begin If you want to stop saving the output, then suspend using File ā†’ Log ā†’ Suspend (or End) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 58 / 294
  • 59. Data Management and Analysis Stata works in two ways; 1 Menu driven 2 Command driven The later was used in this material Tadesse A. Ayele (Universities of Gondar) November 22, 2019 59 / 294
  • 60. Data Management Once the data were entered or imported, data management will be the next step Data should be cleaned, edited, and recoded before analysis started cleaning can be done through visual inspection, running frequencies and cross-tabulation When error is detected, can be corrected by checking the hard copy and the data collector if necessary Some variables may need recoding in order to make clear interpretation and when there is assumption violation Tadesse A. Ayele (Universities of Gondar) November 22, 2019 60 / 294
  • 61. Data Management Basic commands for generating new variables or recoding existing variables gen: to generate new variable from existing variable replace: to replace the remaining categories of the new variable recode: to recode continuous variable to categorical egen: to create a new variable that counts the number of ā€yesā€ responses merge: variables from a second dataset to the dataset youā€™re currently working with Tadesse A. Ayele (Universities of Gondar) November 22, 2019 61 / 294
  • 62. Data Management list, describe, keep, drop, rename and so on. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 62 / 294
  • 63. Data Management Data management using menu driven Tadesse A. Ayele (Universities of Gondar) November 22, 2019 63 / 294
  • 64. Data Analysis There are two basic analysis procedures Descriptive versus Analytic In stata, these can be done either Menu driven versus command driven Tadesse A. Ayele (Universities of Gondar) November 22, 2019 64 / 294
  • 65. Part II Statistical Models for Continuous variable Tadesse A. Ayele (Universities of Gondar) November 22, 2019 65 / 294
  • 66. Models Statistics includes the process of ļ¬nding out about patterns in the real world using data When solving statistical problems it is often helpful to make models of real world situations based on observations of data, assumptions about the context, and on theoretical probability The model can then be used to make predictions, test assumptions, and solve problems These models can be either deterministic or Stochastic A deterministic model does not include elements of randomness A probabilistic model includes elements of randomness. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 66 / 294
  • 67. Components of the model There are three components in GLM g(y) = Ī²0 + Ī²1X1 + Ā· Ā· Ā· + Ī²kXk Random component y āˆ¼ dist() Systematic Component Ī²0 + Ī²1X1 + Ā· Ā· Ā· + Ī²k Xk Link Function The function which links the random component with the systematic component g(y) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 67 / 294
  • 68. Deterministic Versus Stochastic Models Deterministic Model parameters are exact E = mc2 The output of the model is fully determined by the parameter values and the initial conditions Stochastic Models possess some inherent randomness g(y) = XĪ² + ij The same set of parameter values and initial conditions will lead to an ensemble of diļ¬€erent outputs Output has variation The natural world is buļ¬€eted by stochasticity stochasticity. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 68 / 294
  • 69. Task 1 Review literatures in your ļ¬eld of study Locate studies and read the method part List the dependent variable List the independent variables What was the statistical methods used? Is the model appropriate for the outcome variable? Submission date: Tadesse A. Ayele (Universities of Gondar) November 22, 2019 69 / 294
  • 70. Chapter 2 Analysis of Continuous Variable Tadesse A. Ayele (Universities of Gondar) November 22, 2019 70 / 294
  • 71. Introduction A continuous variable is one which can take on inļ¬nity many, uncountable possible values in the range of real numbers. The probability distribution of of continuous variables can be express in terms of probability density function. Figure 5: Probability Density Function. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 71 / 294
  • 72. Location and Scale Parameters A probability distribution is characterized by location and scale parameters The scale parameter measures the skewness Negatively skewed Symmetry Positively skwed The location parameter measures the kurtosis Mesokurtic: normal distribution Leptokurtic: distribution shows heavy tails on either side Platykurtic: distribution with ļ¬‚at tails Location and scale parameters are typically used in modeling applications Tadesse A. Ayele (Universities of Gondar) November 22, 2019 72 / 294
  • 73. Modeling Continuous Variable The model is depend on the research question If we are interested to compare the means of two population t-test is appropriate to compare two means from two population There are three diļ¬€erent t-tests One sample t-test Two independent sample t-test Paired sample t-test If We are interested to compare means of more than two population Analysis of Variance Diļ¬€erent extension of ANOVA One way ANOVA More way ANOVA ANCOVA, MANOVA, MANCOVA Repeated measure ANOVA Tadesse A. Ayele (Universities of Gondar) November 22, 2019 73 / 294
  • 74. Modeling Continuous Variable If we want to determine the relationship between two continuous variables Correlation If we want to predict the eļ¬€ect of one variable on the dependent variable Simple Linear Regression If we want to predict the eļ¬€ect of multiple variable on the dependent variable Multiple Linear Regression Tadesse A. Ayele (Universities of Gondar) November 22, 2019 74 / 294
  • 75. Correlation and Linear Regressions Tadesse A. Ayele (Universities of Gondar) November 22, 2019 75 / 294
  • 76. Correlation As OR and RR are used to quantify risk between two dichotomous variables, correlation is used to quantify the degree to which two random continuous variables are related, provided the relationship is linear. If we are interested in looking at relationship between two continuous random variables, before we conduct any type of analysis we should always create a two way scatter plot for our visual understanding of the relationship We now consider ways of describing relationship between numeric variables from a single population. consider the relationship between weight and height of children from jimma data. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 76 / 294
  • 77. The correlation coeļ¬ƒcient Before we conduct any statistical analysis, we should always create a scatter plot of the data. These data could be presented in a scatter plot as shown in Figure 1 below. One can see from the scatter plot that the weight tends to increase as the length increases. The correlation coeļ¬ƒcient may be used to measure the degree of the association between the two variables. Alternatively if we are interested in predicting weight from height the of children we could ļ¬t a linear regression model (a straight line) to the data. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 77 / 294
  • 78. Scatter plot showing the above table 100020003000400050006000 30 40 50 60 70 length weight Fitted values Another way of looking at correlation is that it is a measure of the scatter of the points around the underlying linear trend (the greater the spread of points the lower the correlation Tadesse A. Ayele (Universities of Gondar) November 22, 2019 78 / 294
  • 79. STATA code for scatter plot and lines: twoway (scatter weight length) (lfit weight length) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 79 / 294
  • 80. Correlation coeļ¬ƒcient In the underlying population from which the sample of points (xi , yi ) is selected, the correlation coeļ¬ƒcient between X and Y is denoted by Ļ and is given by Ļ = (X āˆ’ ĀµX )(Y āˆ’ ĀµY ) ĻƒX ĻƒY It can be thought as the average of the product of the standard normal deviates of X and Y Tadesse A. Ayele (Universities of Gondar) November 22, 2019 80 / 294
  • 81. Estimation of Correlation coeļ¬ƒcient The estimator of the population correlation coeļ¬ƒcient is given by r = (xi āˆ’ ĀÆx)(yi āˆ’ ĀÆy) (xi āˆ’ ĀÆx)2 (yi āˆ’ ĀÆy)2 r is called the product moment correlation coeļ¬ƒcient or Pearson correlation coeļ¬ƒcient āˆ’1 < r < 1 r close to 1 implies strong direct relationship. r close to -1 implies strong inverse. r = 0 when there is no linear relationship at all Pearson correlation coeļ¬ƒcient is a measure of the degree of straight line relationship Tadesse A. Ayele (Universities of Gondar) November 22, 2019 81 / 294
  • 82. Interpretation The correlation was found to be 0.5648. . pwcorr weight length | weight length -------------+------------------ weight | 1.0000 length | 0.5648 1.0000 It means the correlation between weight and height is positively related The p-value is less than 0.05 which indicates that the correlation is statistically signiļ¬cant. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 82 / 294
  • 83. limitations of correlation coeļ¬ƒcient It quantiļ¬es only the strength of linear relationship between two variables. If the two variables have non-linear relationship, it will not provide a valid measure of this association. Care must be taken when the data contain outliers, correlation coeļ¬ƒcient is highly sensitive to extreme values The estimated correlation coeļ¬ƒcient should never be extrapolated beyond the observed ranges of the variables; the relationship between two variables may change outside of this region It must be kept in mind that a high correlation coeļ¬ƒcient between two variables does not in itself imply a cause and eļ¬€ect relationship. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 83 / 294
  • 84. Introduction to Linear Regression We frequently measure two or more variables on the same individual (case, object, etc) We do this to explore the nature of the relationship among these variables. There are two basic types of relationships. Relationship Cause and eļ¬€ect Functional Function: a mathematical relationship enabling us to predict what values of one variable (Y ) correspond to given values of another variable (X) Y : is referred to as the dependent variable, the response variable or the predicted variable X : is referred to as the independent variable, the explanatory variable or the predictor variable. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 84 / 294
  • 85. Linear Regression Questions to be answered What is the association between Y and X? How can changes in Y be explained by changes in X? What are the functional relationships between Y and X? A functional relationship is symbolically written as: Y = f (X) Example: Y=cholesterol versus X=age Two kinds of explanatory variables: Those we can control Those over which we have little or no control Tadesse A. Ayele (Universities of Gondar) November 22, 2019 85 / 294
  • 86. Hypertension Data We want to study the eļ¬€ect of age (X1) on blood pressure (Y ). For this purpose, consider linear regression. Model: Y = Ī± + Ī²X1 + Īµ. Ī± and Ī² are unknown constants to be estimated. Īµ stands for measurement error. Ī± stands for the value of Y when X1 is 0. Ī² is the expected change in mean of Y for a unit change X1. Īµ is assumed normally distributed with mean 0 and constant variance. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 86 / 294
  • 87. Commonly, least squares (LS) or maximum likelihood (ML) techniques are used for estimation. Ī± and Ī² are estimated by Ė†Ī± and Ė†Ī², respectively. Estimated model: Ė†Y = Ė†Ī± + Ė†Ī²X1. Least squares estimators of Ė†Ī± and Ė†Ī²: Ė†Ī² = SSxy SSxx , Ė†Ī± = ĀÆy āˆ’ Ė†Ī²ĀÆx, As before, SSxx = n 1=1 (xi āˆ’ ĀÆx) 2 , and SSxy = n 1=1(x āˆ’ ĀÆx)(yi āˆ’ ĀÆy). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 87 / 294
  • 88. Hypertension data: Blood pressure (Y ) and age (X1): 105110115120125 45 50 55 age in years mean arterial blood pressure Fitted values Figure 6: Mean arterial blood pressure versus age. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 88 / 294
  • 89. Source | SS df MS Number of obs = 20 ------------------------------------- F( 1, 18) = 13.82 Model | 243.265993 1 243.265993 Prob > F = 0.0016 Residual | 316.734007 18 17.5963337 R-squared = 0.4344 ------------------------------------- Adj R-squared = 0.4030 Total | 560 19 29.4736842 Root MSE = 4.1948 ----------------------------------------------------------------- y | Coef. Std. Err. t P>|t| [95% Conf.Interval] ---------------------------------------------------------------- x1 | 1.430976 .3848601 3.72 0.002 .6224154 2.239537 _cons | 44.45455 18.7277 2.37 0.029 5.109098 83.79999 ---------------------------------------------------------------- ANOVA table: It tells the story of how the regression equation accounts for variablity in the response variable. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 89 / 294
  • 90. STATA CODE: regress y x1 Estimate of intercept is: Ė†Ī± = 44.45455. Estimate of slope (coeļ¬ƒcient of age) is :Ė†Ī² = 1.43098. STATA code for scatter plot and lines: twoway (scatter y x1) (lfit y x1) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 90 / 294
  • 91. Conclusion The ANOVA (F-test) part shows that regression is signiļ¬cant (p = 0.0016). Adj-R-Sq implies that nearly 40% of the variation in Y is explained by the regression. Ė†Ī² is signiļ¬cant (p = 0.0016) and implies that for a unit (one year) increase in age, the expected bp increases, on average, by 1.43 mmHg. The studentized residuals can be used for identifying outliers. We use the ā€˜predictā€™ command with the ā€˜rstudentā€™ option to generate studentized residuals. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 91 / 294
  • 92. Test of Normality We can use the ā€˜predictā€™ command to create residuals. ā€˜kdensityā€™, ā€˜qnormā€™ and ā€˜pnormā€™ to check the normality of the residuals. kdensity stands for kernel density estimate. It can be thought of as a histogram with narrow bins and moving average. 0.1.2.3.4 Density āˆ’2 āˆ’1 0 1 2 3 Studentized residuals Kernel density estimate Normal density kernel = epanechnikov, bandwidth = 0.5256 Kernel density estimate Tadesse A. Ayele (Universities of Gondar) November 22, 2019 92 / 294
  • 93. STATA CODE: predict r, resid kdensity r, normal Tadesse A. Ayele (Universities of Gondar) November 22, 2019 93 / 294
  • 94. The ā€˜pnormā€™ command graphs a standardized normal probability (P-P) plot against the quantiles of a normal distribution. ā€˜qnormā€™ plots the quantiles of a variable against the quantiles of a normal distribution. ā€˜pnormā€™ is sensitive to non-normality in the middle range of data. ā€˜qnormā€™ is sensitive to non-normality near the tails. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 94 / 294
  • 95. āˆ’2āˆ’1012 Studentizedresiduals āˆ’2 āˆ’1 0 1 2 Inverse Normal Figure 8: Quantiles of residuals. STATA CODE: pnorm r qnorm r Tadesse A. Ayele (Universities of Gondar) November 22, 2019 95 / 294
  • 96. Homoscedasticity of Residuals Assumption...homogeneity of variance of the residuals. If the model is well-ļ¬tted, there should be no pattern to the residuals plotted against the ļ¬tted values. If the variance of the residuals is non-constant. it is heteroscedastic. āˆ’10āˆ’50510 Residuals 110 115 120 125 Fitted values Figure 9: Residual versus Fitted values. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 96 / 294
  • 97. STATA CODE: rvfplot, yline(0) Cameron and Trivediā€™s decomposition of IM-test. Or hettest is the Breusch-Pagan test. p-value < 0.05, reject the hypothesis that states that variance is homogenous. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 97 / 294
  • 98. Cameron & Trivediā€™s decomposition of IM-test --------------------------------------------------- Source | chi2 df p ---------------------+----------------------------- Heteroskedasticity |0.33 2 0.8458 Skewness | 1.22 1 0.2697 Kurtosis | 1.14 1 0.2848 ---------------------+----------------------------- Total | 2.70 4 0.6097 --------------------------------------------------- Tadesse A. Ayele (Universities of Gondar) November 22, 2019 98 / 294
  • 99. Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of y chi2(1) = 0.01 Prob > chi2 = 0.9324 STATA CODE: estat imtest estat hettest Tadesse A. Ayele (Universities of Gondar) November 22, 2019 99 / 294
  • 100. One categorical predictor Relationship between a continuous outcome and a single categorical variable. Dummy variables for the categories needed. Objective: estimation/prediction the eļ¬€ect of the predictor on the response. We want to study whether birth weight (Y ) varies with sex (X1; female = 0, male = 1). For this purpose, consider liner regression. Model: Y = Ī± + Ī²X1 + Īµ Ī± and Ī² are unknown constants to be estimated. Īµ stands for measurement error. Ī² is the expected diļ¬€erence in mean of Y among X1 categories. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 100 / 294
  • 101. Source | SS df MS Number of obs = 7873 ---------------------------------------- F( 1, 7871) = 97.91 Model | 25750334.5 1 25750334.5 Prob > F = 0.0000 Residual | 2.0700e+09 7871 262990.174 R-squared = 0.0123 -------------+------------------------- Adj R-squared = 0.0122 Total | 2.0957e+09 7872 266227.896 Root MSE = 512.83 ------------------------------------------------------------------ weight | Coef. Std. Err. t P>|t| [95% Conf.Interval] ------------------------------------------------------------------ 1.sex | 114.4014 11.56138 9.90 0.000 91.738 137.0647 _cons | 3055.323 8.253152 370.20 0.000 3039.144 3071.501 ------------------------------------------------------------------ Tadesse A. Ayele (Universities of Gondar) November 22, 2019 101 / 294
  • 102. STATA CODE: regress weight i.sex The ā€˜i.sexā€™ is used in the code because sex is an indicator variable. The ā€˜1.sexā€™ stands for males, and hence female is a reference group. Estimate of intercept is: Ė†Ī± = 3055.32. Estimate of slope (coeļ¬ƒcient of age) is :Ė†Ī² = 114.40. Interpretation? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 102 / 294
  • 103. Multiple linear regression Extension of simple linear regression with more than one predictors. The predictors could be either a continuous or a categorical variable. Objective: estimation/prdiction the eļ¬€ect of the predictors on the response. Hypertension Data We want to study the eļ¬€ect of (X1), (X2), (X3), (X4), (X5), (X6) on blood pressure (Y ). Predictors: X1 = age (years); X2 = weight (kg); X3 =body surface area (sq m); X4 = duration of hypertension; X5 = basal pulse (beats/min); X6 = measure of stress. For this purpose, consider Multiple liner regression. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 103 / 294
  • 104. mean arterial blood pressure age in years weight in kg body surface area duration of hypertentsion basal pulse pulse per minute measure of stress 100 110 120 130 100 110 120 130 45 50 55 45 50 55 85 90 95 100 85 90 95 100 1.8 2 2.2 1.8 2 2.2 0 5 10 0 5 10 60 65 70 75 60 65 70 75 0 50 100 0 50 100 Figure 10: Scatter Plot Matrix Tadesse A. Ayele (Universities of Gondar) November 22, 2019 104 / 294
  • 105. STATA CODE: graph matrix y x1 x2 x3 x4 x5 x6 What do you observe from the scatter plot matrix? Which variables seem to be correlated which other variables? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 105 / 294
  • 106. Coeļ¬ƒcient of determination (R2): R2 = (Ė†yi āˆ’ ĀÆy)2 (yi āˆ’ ĀÆy)2 . The adjusted R2 statistic is an estimate of the population R2 taking account of the fact that the parameters were estimated from the data, Adj-R2 = 1 āˆ’ (n āˆ’ 1)(1 āˆ’ R2) n āˆ’ p . Tadesse A. Ayele (Universities of Gondar) November 22, 2019 106 / 294
  • 107. Y = Ī± + Ī²1X1 + Ī²2X2 + Ī²3X3 + Ī²4X4 + Ī²5X5 + Ī²6X6 + Īµ Response: Y =mean arterial blood pressure (mm HG); Predictors: X1 = age (years); X2 = weight (kg); X3 = body surface area (sq m); X4 = duration of hypertension; X5 = basal pulse (beats/min); X6 = measure of stress. Source | SS df MS Number of obs = 20 ------------------------------------- F( 6, 13) = 560.64 Model | 557.844135 6 92.9740225 Prob > F = 0.0000 Residual | 2.1558651 13 .165835777 R-squared = 0.9962 ------------------------------------- Adj R-squared = 0.9944 Total | 560 19 29.4736842 Root MSE = .40723 Much improvement as compared to the model with only age: say, based on Adj R-sq. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 107 / 294
  • 108. --------------------------------------------------------------- y | Coef. Std. Err. t P>|t| [95% Conf.Interval] -------------+------------------------------------------------- x1 | .7032596 .0496059 14.18 0.000 .5960926 .8104266 x2 | .9699192 .0631086 15.37 0.000 .8335815 1.106257 x3 | 3.776502 1.580154 2.39 0.033 .3627878 7.190217 x4 | .0683829 .0484416 1.41 0.182 -.0362687 .1730346 x5 | -.0844846 .0516091 -1.64 0.126 -.1959792 .0270101 x6 | .0055715 .0034123 1.63 0.126 -.0018003 .0129433 _cons | -12.87047 2.556654 -5.03 0.000 -18.39378 -7.34715 --------------------------------------------------------------- STATA CODE: regress y x1 x2 x3 x4 x5 x6 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 108 / 294
  • 109. Variable Selection Forward selection, Backward elimination, Stepwise regression. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 109 / 294
  • 110. Multicollinearity When there is a perfect linear relationship among the predictors, the estimates cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. The regression model estimates of the coeļ¬ƒcients become unstable. The standard errors for the coeļ¬ƒcients can get wildly inļ¬‚ated. We can use the vif command after the regression to check for multicollinearity. As a rule of thumb, a variable whose values are greater than 10 may need further investigation. Tolerance, deļ¬ned as 1/VIF, is used by many researchers to check on the degree of collinearity. The standard errors for the coeļ¬ƒcients can get wildly inļ¬‚ated. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 110 / 294
  • 111. VIF(xk) = 1 1 āˆ’ R2 k . VIF(xk) is the variance inļ¬‚ation factor for explanatory variable xk. R2 k is the square of the multiple correlation coeļ¬ƒcient obtained from regressing xk on the remaining explanatory variables. Variable |VIF 1/VIF -----------+---------------------- x2 | 8.42 0.118807 x3 | 5.33 0.187661 x5 | 4.41 0.226574 x6 | 1.83 0.545005 x1 | 1.76 0.567277 x4 | 1.24 0.808205 -----------+---------------------- Mean VIF | 3.83 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 111 / 294
  • 112. STATA CODE: regress y x1 x2 x3 x4 x5 x6 vif There are diļ¬€erent suggestions for the cut-oļ¬€ points Some says 5 and some say even 10. So, if we consider 10, multicollinearity does not seem to be a major problem... If we take 5, then there are two variables which needs decision One solution to dealing with multicollinearity is to remove some of the violating predictors from the model. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 112 / 294
  • 113. Jimma Infant Data We want to study the eļ¬€ect of (X1), (X2), (X3), (X4) on birth weight (Y ). Predictors: (X1 = sex; female = 0, male = 1); X2 = place (semiāˆ’urban); X3 = place (rural); X4 = Age of mother. Originally, place was coded as: (urban = 1, semi-urban = 2, and rural = 3). By default, stata considers the ļ¬rst (urban = 1) as a reference. For this purpose, consider Multiple liner regression. Y = Ī± + Ī²1X1 + Ī²2X2 + Ī²3X3 + Ī²4X4 + Īµ Response: Y = birth weight; Predictors: X1 = sex; female = 0, male = 1); X2 = place (semiāˆ’urban); X3 = place (rural); X4 = age of mother (years). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 113 / 294
  • 114. Source | SS df MS Number of obs = 7873 ---------------------------------------- F( 4, 7868) = 169.34 Model | 166126585 4 41531646.3 Prob > F = 0.0000 Residual | 1.9296e+09 7868 245249.036 R-squared = 0.0793 ---------------------------------------- Adj R-squared = 0.0788 Total | 2.0957e+09 7872 266227.896 Root MSE = 495.23 ------------------------------------------------------------------- weight | Coef. Std. Err. t P>|t| [95% Conf.Interval] ------------------------------------------------------------------- 1.sex | 109.2883 11.16942 9.78 0.000 87.39329 131.1833 | place | 2 | -44.07271 16.25057 -2.71 0.007 -75.92814 -12.21729 3 | -279.7563 13.86682 -20.17 0.000 -306.939 -252.5737 | momage | 7.809375 .8910886 8.76 0.000 6.062605 9.556145 _cons | 3010.102 26.26505 114.60 0.000 2958.615 3061.588 ------------------------------------------------------------------- Tadesse A. Ayele (Universities of Gondar) November 22, 2019 114 / 294
  • 115. Ė†Ī± = 3010.102, Ė†Ī²1 = 109.2883, Ė†Ī²2 = āˆ’44.07271, Ė†Ī²3 = āˆ’279.7563, Ė†Ī²4 = 7.809375. Interpretation? STATA CODE: .regress weight i.sex i.place momage . estat vif Variable | VIF 1/VIF -------------+---------------------- 1.sex | 1.00 0.999139 place | 2 | 1.53 0.655694 3 | 1.54 0.650037 momage | 1.01 0.988233 -------------+---------------------- Mean VIF | 1.27 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 115 / 294
  • 116. Multivariate methods Multivariate analysis is the statistical study of the dependence (covariance) between diļ¬€erent variables Multivariate Analysis includes many statistical methods that are designed to allow you to include multiple variables and examine the contribution of each There are diļ¬€erent methods Principal component analysis Factor analysis Cluster analysis Discriminant Analysis Structural Equation Modeling Tadesse A. Ayele (Universities of Gondar) November 22, 2019 116 / 294
  • 117. Principle component analysis PCA is a dimension-reduction tool that can be used to reduce a large set of variables to a small set that still contains most of the information in the large set PCA is a mathematical procedure that transforms a number of correlated variables into a (smaller) number of uncorrelated variables called principal components Eigenvalue conceptually represents that amount of variance accounted for by a factor When we have correlation (multicollinarity) between the x-variables, the data may more or less fall on a line or plane in a lower number of dimensions Tadesse A. Ayele (Universities of Gondar) November 22, 2019 117 / 294
  • 118. Objectives of PCA Its general purpose is to summarize the information contained in a large number of variables into a smaller number of factors Tadesse A. Ayele (Universities of Gondar) November 22, 2019 118 / 294
  • 119. Reading Assignment Form ļ¬ve groups Assign numbers from 1 to 5 Randomly let the group pick one number Then give the topic corresponding to the number 1 Principal component analysis 2 Factor analysis 3 Cluster analysis 4 Discriminant Analysis 5 Structural Equation Modeling Tadesse A. Ayele (Universities of Gondar) November 22, 2019 119 / 294
  • 120. Assignment 2 Consider Infant growth data. The outcome variable is birth height of new born and others are independent variable. Determine the correlation between mothers age and the height of the new born Test if the correlation is signiļ¬cant Fit linear regression for birth weight with sex, residence, family income and mothers age. Test the assumptions Interpret the result Tadesse A. Ayele (Universities of Gondar) November 22, 2019 120 / 294
  • 121. Chapter 3 Categorical Data Analysis Tadesse A. Ayele (Universities of Gondar) November 22, 2019 121 / 294
  • 122. Categorical Data Variables that are measured using Nominal scale and ordinal scale The variable may have only two levels called a dichotomous (E.g. Sex) The variable may have more than two levels called a polygamous (E.g. Blood group) Continuous (numeric) variables can be condensed to categorical variables if recoded BMIunderweight, normal. and overweight Income (very poor, poor, rich, very rich) Age (< 15years), 16-24 years, 25-34 years, and ā‰„ 35 years Tadesse A. Ayele (Universities of Gondar) November 22, 2019 122 / 294
  • 123. Contingency Tables When working with categorical variables, we often arrange the counts in a tabular format called contingency tables If a contingency table involves two dichotomous variables then it is a 2x2 (two way table) Diseases Status Exposure Positive Negative Total Exposed a b a+b Not Exposed c d c+d Total a+c b+d n Tadesse A. Ayele (Universities of Gondar) November 22, 2019 123 / 294
  • 124. Contingency Tables It can be generalized to accommodate into rxc contingency table (r-rows and c-columns) Column Variable Row Variable 1 2 . . . c Total 1 R1C1 R1C2 . . . R1Cn R1 2 R2C1 R2C2 . . . R2Cn R2 3 R3C1 R3C2 . . . R3Cn R3 . . . . . . . . . . . . . . . . R RnC1 RnC2 . . . RnCn Rn Total C1 C2 . . . Cc N Tadesse A. Ayele (Universities of Gondar) November 22, 2019 124 / 294
  • 125. Steps in Testing Association The test statistic chi-square is used to test the association between two categorical variables Hypothesis 1 The hypothesis to be tested can be stated as: H0: There is no association H1: There is association 2 Compute the calculated and tabulated value of the test statistic 3 Compare the two (tabulated and calculated values) 4 Decision If the contingency table is 2X2, then Ļ‡2 = n(ad āˆ’ bc)2 (a + c)(b + d)(a + b)(c + d) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 125 / 294
  • 126. If the contingency table is rxc The formula for the calculated value of the test statistic is diļ¬€erent when the contingency table is rxc Ļ‡2 = (Oij āˆ’ Eij )2 Eij Where Oij is observed value Eij is expected value Eij = Ri Ɨ Cj N The Chi-squared test measures the disparity between observed frequencies (data from the sample and expected frequencies (probability distribution) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 126 / 294
  • 127. The Chi-squared test is valid If no observed cell have value 0 The values must be counts but not percent or proportion And no 20% of expected cell has less than value of 5 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 127 / 294
  • 128. Example A cross-sectional survey was conducted to assess the association between head injury and wearing helmet. A total of 795 individuals were included in the study and the data is given below; Wearing Helmet Head injury Yes No Total Yes 17 218 235 No 130 428 558 Total 147 646 793 Test the presence of association between head injury and wearing helmet at 0.05 level of signiļ¬cant Tadesse A. Ayele (Universities of Gondar) November 22, 2019 128 / 294
  • 129. Solution hypothesis HO : There is no association between helmet and head injury HA : There is association between helmet and head injury Test statistics Ļ‡2 = n(ad āˆ’ bc)2 (a + c)(b + d)(a + d)(c + d) = (17 Ɨ 428 āˆ’ 218 Ɨ 130)2793 235 Ɨ 559 Ɨ 147 Ɨ 646 = 28.26 The critical value of the test statistic Ļ‡2 0.05 = 3.84 Since 28.26> 3.84, reject the null hypothesis Tadesse A. Ayele (Universities of Gondar) November 22, 2019 129 / 294
  • 130. Example 2 A sample of 263 students who bought lunch at a school canteen were asked whether or not they developed gastroenteritis. The response is given below Gastroenteritis Ate Sandwich Yes No Total Yes 109 116 225 No 4 34 38 Total 113 150 263 Test the presence of association between sandwich and gastroenteritis at 0.05 level of signiļ¬cant Tadesse A. Ayele (Universities of Gondar) November 22, 2019 130 / 294
  • 131. Solution Step 1. HO : There is no association Step 2. Ha: There is association Step 3. Test statistics Ļ‡2 = 17.6 Step 4: Critical value of Ļ‡2 1(0.05)= 3.84 Step 5: decision: since 17.6>3.84 then reject the null hypothesis and decide as there is association between eating sandwich and gastrointestinal pain Tadesse A. Ayele (Universities of Gondar) November 22, 2019 131 / 294
  • 132. Probability Distribution Binomial Distribution Under the assumption of n independent, identical trials, Y has the binomial distribution with index n and parameter Ļ€ The probability of outcome y for Y equals P(Y = y) = n! y!(n āˆ’ y)! py (1 āˆ’ p)(nāˆ’y) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 132 / 294
  • 133. Maximum Likelihood Estimation In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given the data MLE begins with writing a mathematical expression known as the Likelihood Function of the sample data The values of these parameters that maximize the sample likelihood are known as the Maximum Likelihood Estimates Tadesse A. Ayele (Universities of Gondar) November 22, 2019 133 / 294
  • 134. Suppose we have a random sample X1, X2,..., Xn where: Xi = 0 if a randomly selected subject does not experience the event, and Xi = 1 if a randomly selected subject does experience the event Assuming that the Xi are independent Bernoulli random variables with unknown parameter p, ļ¬nd the maximum likelihood estimator of p, the proportion of subjects who experience the event. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 134 / 294
  • 135. Solution If the Xi are independent Bernoulli random variables with unknown parameter p, then the probability mass function of each Xi is: f (xi ; p) = pxi (1 āˆ’ p)1āˆ’xi for xi = 0 or 1 and 0 < p < 1 Therefore, the likelihood function L(p) is, by deļ¬nition: L(P) = n i=1 f (xi ; p) = px1 (1āˆ’p)1āˆ’x1 Ɨpxi (1āˆ’p)1āˆ’xi . . . pxn (1āˆ’p)1āˆ’xn Simplifying, by summing up the exponents, we get: L(P) = p xi (1 āˆ’ p)nāˆ’ xi Taking partial derivative with respect to p after log transformation p = x/n Tadesse A. Ayele (Universities of Gondar) November 22, 2019 135 / 294
  • 136. Inference for proportion The point estimate E(p)=Ļ€ Conļ¬dence interval for proportion p āˆ’ zĪ±/2SE(P), p + zĪ±/2SE(P) Where SE(p) = p(1 āˆ’ p) n Tadesse A. Ayele (Universities of Gondar) November 22, 2019 136 / 294
  • 137. Example: Logistic regression Let pj be the probability that a subject in stratum j has a success Then the probability of observing yj events in stratum j is P(Yj = yj ) = nj yj p yj j (1 āˆ’ pj )nj āˆ’yj We know pj is a function of covariates assume we are interested in two covariates, xj ,such that pj = eĪ±0+Ī²xj 1 + eĪ±0+Ī²xj Tadesse A. Ayele (Universities of Gondar) November 22, 2019 137 / 294
  • 138. Example: Logistic regression Then, the likelihood function can be written as L(y) = j+1 J nj yj p yj j (1 āˆ’ pj )nj āˆ’yj = j+1 J ( eĪ±0+Ī²xj 1 + eĪ±0+Ī²xj )y j ( 1 1 + eĪ±0+Ī²xj )(nj āˆ’yj ) Ɨ nj yj = j+1 J (eĪ±0+Ī²xj )yj (1 + eĪ±0+Ī²xj )nj Ɨ nj yj = (eĪ±0+Ī²xj ) j+1 J (1 + eĪ±0+Ī²xj )nj Ɨ j+1 J nj yj Tadesse A. Ayele (Universities of Gondar) November 22, 2019 138 / 294
  • 139. Contingency Tables 2 Partial Tables: a two-way cross-sectional slice of the three-way table where X and Y are cross classiļ¬ed at separate levels of the control variable Z it shows the eļ¬€ect of X on Y while controlling for Z Marginal table: a two-way contingency table obtained by combining the partial tables The marginal table ignores Z rather than controlling for it. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 139 / 294
  • 140. Example Prevalence study: A total of 715 births, cross classiļ¬ed by clinic (Z), prenatal care (X) and outcome (Y ) Clinic Prenatal Care Died Lived 1 less 3 176 more 4 293 2 less 17 197 more 2 23 Construct the corresponding partial tables and the marginal table. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 140 / 294
  • 141. Confounding Confounding occurs when a factor is associated with both the exposure and the outcome but does not lie on the causative pathway Figure 11: Confounding variable When the partial and marginal associations are diļ¬€erent, there is said to be CONFOUNDING. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 141 / 294
  • 142. Interaction Eļ¬€ect An interaction eļ¬€ect happens when one explanatory variable interacts with another explanatory variable on a response variable This is opposed to the ā€œmain eļ¬€ectā€ which is the action of a single independent variable on the dependent variable Two independent variables interact if the eļ¬€ect of one of the variables diļ¬€ers depending on the level of the other variable It occur when the eļ¬€ect of one variable depends on the value of another variable Tadesse A. Ayele (Universities of Gondar) November 22, 2019 142 / 294
  • 143. Joint, Marginal, and Conditional Probabilities Probabilities for contingency tables can be of three types: joint, marginal, or conditional Consider the variable X and Y from randomly chosen subjects Let Ļ€ij = P(X = i, Y = j) denote the probability that (X, Y ) falls in the cell in row i and column j The probabilities Ļ€ij is the joint distribution of X and Y Tadesse A. Ayele (Universities of Gondar) November 22, 2019 143 / 294
  • 144. Marginal Probabilities The marginal distributions are the row and column totals of the joint probabilities Variable 1 Diseases Non Diseases Total Exposed n11 n12 n1+ Not Exposed n21 n22 n2+ Total n+1 n+2 n++ Ļ€1+ = Ļ€11 + Ļ€12 It can be calculated as pij = nij n Tadesse A. Ayele (Universities of Gondar) November 22, 2019 144 / 294
  • 145. Conditional Probabilities It is informative to construct a separate probability distribution for Y and X at each level of Z. Consider The variables sex, smoking and lung cancer Sex Smoking Diseases Non Diseases Total Male Yes n11 n12 n1+ No n21 n22 n2+ Female Yes n31 n32 n3+ No n41 n42 n4+ Total n+1 n+2 n++ Simpsonā€™s Paradox: The result that a marginal association can have diļ¬€erent direction from the conditional associations Tadesse A. Ayele (Universities of Gondar) November 22, 2019 145 / 294
  • 146. Application for screening test Sensitivity: the proportion of diseased individuals detected as positive by the test. Speciļ¬city: the proportion of healthy individuals detected as negative by the test. Positive predictivity: given that the test result is positive, what is the probability that, in fact, the disease is present? Negative predictivity: given that the test result is negative, what is the probability that, in fact, the disease is not present? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 146 / 294
  • 147. Example Cytological test to screen women for cervical cancer Diseases negative positive Total Negative 23362 362 23724 Positive 225 154 379 Total 23587 516 24103 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 147 / 294
  • 149. Analysis of Categorical Data Descriptive; Summary statistics including percentages, prevalence, Graphical techniques: bar chart, pie-chart. Inferential Statistics Risk ratios Relative risk Odds ratios, Tadesse A. Ayele (Universities of Gondar) November 22, 2019 149 / 294
  • 150. Jimma Infant Data... birth | weight | Freq. Percent Cum. ------------+----------------------------------- 0 | 7,207 91.54 91.54 1 | 666 8.46 100.00 ------------+----------------------------------- Total | 7,873 100.00 STATA CODE: tab bwt Tadesse A. Ayele (Universities of Gondar) November 22, 2019 150 / 294
  • 151. Cross tabulation of bwt and gender (male = 1). The option ā€œrowā€ can be used to obtain the row percentages. birth | sex weight | 0 1 | Total -----------+----------------------+---------- 0 | 3,466 3,741 | 7,207 | 48.09 51.91 | 100.00 -----------+----------------------+---------- 1 | 395 271 | 666 | 59.31 40.69 | 100.00 -----------+----------------------+---------- Total | 3,861 4,012 | 7,873 | 49.04 50.96 | 100.00 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 151 / 294
  • 152. STATA CODE: tab bwt sex, row Tadesse A. Ayele (Universities of Gondar) November 22, 2019 152 / 294
  • 153. Consider the following contingency table: Diseased Yes No | Total -------------------------------------- yes| a b | a+b exposed ------------------------------ No | c d | c+d -------------------------------------- Total | a+c c+d | a+b+c+d Tadesse A. Ayele (Universities of Gondar) November 22, 2019 153 / 294
  • 154. Odds The odds of being diseased among exposed is: a a + b . The odds of being diseased among unexposed is: c c + d . Tadesse A. Ayele (Universities of Gondar) November 22, 2019 154 / 294
  • 155. Odds Ratio The ratio of the odds of being diseased among exposed to unexposed is: a a + b Ć· c c + d = ad bc . Odds ratio compares the risk of being diseased in the exposed versus the unexposed. Var(ln( Ė†OR)) ā‰ˆ 1 a + 1 b + 1 c + 1 d . Conļ¬dence interval for ln(OR): ln( Ė†OR) Ā± Z1āˆ’Ī±/2 Ė†Var(ln Ė†OR). Conļ¬dence interval for OR: exp(ln( Ė†OR) Ā± Z1āˆ’Ī±/2 Ė†Var(ln Ė†OR).) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 155 / 294
  • 156. Jimma Infant Data: Outcome-bwt ā€˜low bwtā€™ versus ā€˜normalā€™. Exposure-gender ā€˜femaleā€™ versus ā€˜maleā€™. Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 395 271 | 666 0.5931 Controls | 3466 3741 | 7207 0.4809 -----------------+------------------------+------------------------ Total | 3861 4012 | 7873 0.4904 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 1.573211 | 1.334496 1.854786 +------------------------------------------------- Tadesse A. Ayele (Universities of Gondar) November 22, 2019 156 / 294
  • 157. STATA CODE: cci 395 271 3466 3741 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 157 / 294
  • 158. Measures of Association How do we determine whether a certain disease is associated with a certain exposure? If an association between disease and exposure exists, How strong is it? How do we know whether the association is just by chance or due to other factors? Another commonly used measure of association is the odds ratio. The odds in favor of an event happening (such as getting a disease) is the probability of the event happening relative to the probability of the event not happening. In case control studies, we know the proportion of cases who were exposed and the proportion of controls who were expose from the 2x2 table Tadesse A. Ayele (Universities of Gondar) November 22, 2019 158 / 294
  • 159. Odds Ratio Consider the 2x2 contingency table of: Diseases Status Exposure Positive Negative Total Exposed a b a+b Not Exposed c d c+d Total a+c b+d n OR = ad bc For example suppose the probability (p) of a certain team winning a football game is 60%, then the probability of not winning (1-p) is 40%, so the odds of the team winning is= p/1-p = 0.6/0.4 = 1.5 : 1 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 159 / 294
  • 160. Odds Ratio Odds Ratio (OR) is a good estimate of the relative risk when the disease is rare If OR = 1 No association between disease and exposure If OR > 1 Positive association between disease and exposure If OR < 1 Negative association between disease and exposure (Exposure may be protective of the disease) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 160 / 294
  • 161. Example 3 A case control study was conducted on 200 cases of heart disease and 400 controls to determine whether heart disease is associated with smoking status. The results are shown in the table below CHD Smoking Positive Negative Total Yes 112 176 288 No 88 224 312 Total 200 400 600 OR = 112x224 176x88 = 1.62 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 161 / 294
  • 162. Conļ¬dence interval for OR Since the sampling distribution of OR is often skewed to obtain conļ¬dence intervals we work with the logarithms. SE(lnOR) = 1/a + 1/b + 1/c + 1/d 95% CI for lnOR is deļ¬ned as lnOR Ā± ZĪ±/2 Ɨ SE(lnOR) 95% CI for OR is deļ¬ned as exp lnOR Ā± ZĪ±/2 Ɨ SE(lnOR) 95% CI for OR=1.62 is (1.13;2.31) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 162 / 294
  • 163. Interpretation The odds of CHD was 1.62 times higher for cases than controls. We are 95% certain that the true odds smoking is between 1.13 and 2.31. The 95% CI does not include 1; we therefore conclude smoking was signiļ¬cantly higher among cases than controls. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 163 / 294
  • 164. Introduction to Generalized Linear Models (GLM) GLMs are often employed for modeling a non-Gaussian data It extends ordinary regression models for non-Gaussian data Statistical models that relate outcome variables such as, binary, counts, rates etc, to a linear combination of predictor variables It uses diļ¬€erent link function to connect the outcome variable to the predictor variables Tadesse A. Ayele (Universities of Gondar) November 22, 2019 164 / 294
  • 165. Components of GLM Three components specify GLM: The random component identiļ¬es a vector of observations of Y and its probability distribution; The systematic component is a speciļ¬cation for the predictor variables And the link function speciļ¬es the function of E(Y ) that the model equates to the systematic component Tadesse A. Ayele (Universities of Gondar) November 22, 2019 165 / 294
  • 166. Modeling the outcome variable The outcome variable is binary. The objective is to study the eļ¬€ect of explanatory variables. An outcome variable with two possible categorical outcomes (1=success; 0=failure). The events are independent from subject to subject. Explanatory variables can be categorical and/or continuous. A way to estimate the probability p of the success event of the outcome variable. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 166 / 294
  • 167. Measuring the Probability of an outcome The probability of the outcome is measured by the odds of occurrence of an event. If p is the probability of an event, then (1 āˆ’ p) is the probability of it not occurring. Odds of success = p (1āˆ’p) . The eļ¬€ects of explanatory variable, say x, on the probability of the diseases can be put on the odds as: Odds = p 1 āˆ’ p = exp(Ī± + Ī²x) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 167 / 294
  • 168. Logistic Regression Taking natural logarithm on both sides: ln( p 1 āˆ’ p ) = log[exp(Ī± + Ī²1x1)]. ā‡’ logit(p) = Ī± + Ī²1x1. Consider an outcome as ā€˜diseasedā€™ or ā€˜not diseasedā€™. Ī± represents the overall disease risk. Ī²1 represents the fraction by which the disease risk is altered by a unit change in x1. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 168 / 294
  • 169. Multiple Logistic Regression Model Contā€™d The joint eļ¬€ects of all explanatory variables, say x1, x2, . . . , xp put together on the odds is: Odds = p 1 āˆ’ p = exp(Ī± + Ī²1x1 + Ī²2x2 + . . . + Ī²pxp). Taking natural logarithm on both sides: ln( p 1 āˆ’ p ) = log[exp(Ī± + Ī²1x1 + Ī²2x2 + . . . + Ī²pxp)]. ā‡’ logit(p) = Ī± + Ī²1x1 + Ī²2x2 + . . . + Ī²pxp. Consider an outcome as ā€˜diseasedā€™ or ā€˜not diseasedā€™. Ī± represents the overall disease risk. Ī²1 represents the fraction by which the disease risk is altered by a unit change in x1. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 169 / 294
  • 170. Multiple Logistic Regression Model Contā€™d Ī²2 represents the fraction by which the disease risk is altered by a unit change in x2, and so on. What changes is the log odds. The odds themselves are changed by exp(Ī²). Only one predictor variable. logit(p) = Ī± + Ī²x . Tadesse A. Ayele (Universities of Gondar) November 22, 2019 170 / 294
  • 171. Jimma Infant Data Outcome: birth weight status(bwt, 1 = ā€˜low). Factor: sex (1 = ā€˜male ). Logistic regression Number of obs = 7873 LR chi2(1) = 30.82 Prob > chi2 = 0.0000 Log likelihood = -2266.5465 Pseudo R2 = 0.0068 ---------------------------------------------------------------- bwt | Coef. Std. Err. z P>|z| [95% Conf.Interval] -------------+-------------------------------------------------- 1.sex | -.4531187 .0823256 -5.50 0.000 -.6144739 -.2917634 _cons | -2.171871 .0531052 -40.90 0.000 -2.275955 -2.067786 ---------------------------------------------------------------- Tadesse A. Ayele (Universities of Gondar) November 22, 2019 171 / 294
  • 172. The ļ¬tted logistic regression model is: logit(p) = Ė†Ī± + Ė†Ī²sex. Substituting the estimated values yields: logit(p) = āˆ’2.172 āˆ’ 0.453sex. STATA CODE: logit bwt i.sex ā€˜iā€™ indicates that sex is a categorical variable. The option ā€˜orā€™ can be added to obtain the odds ratio estimates. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 172 / 294
  • 173. Logistic regression Number of obs = 7873 LR chi2(1) = 30.82 Prob > chi2 = 0.0000 Log likelihood = -2266.5465 Pseudo R2 = 0.0068 -------------------------------------------------------------- bwt | Odds Ratio Std. Err. z P>|z| [95% Conf.Interval] -------------------------------------------------------------- 1.sex | .6356427 .0523297 -5.50 0.000 .5409254 .7469452 _cons | .1139642 .0060521 -40.90 0.000 .1026988 .1264654 -------------------------------------------------------------- Tadesse A. Ayele (Universities of Gondar) November 22, 2019 173 / 294
  • 174. STATA CODE: recode sex (1=0) (0=1), generate(sex2) The estimate stands for males, i.e., sex = 1 and females are considered as a reference group. Interpretation of exp(Ė†Ī²)? Interpretation of exp(Ė†Ī±)? Conclusion? Why? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 174 / 294
  • 175. Take males as a reference group. The estimate now stands for females. This requires to recode the sex to say, sex2 and ļ¬t the model again using the following code: recode sex (1=0) (0=1), generate(sex2) logit bwt i.sex2,or Tadesse A. Ayele (Universities of Gondar) November 22, 2019 175 / 294
  • 176. Logistic regression Number of obs = 7873 LR chi2(1) = 30.82 Prob > chi2 = 0.0000 Log likelihood = -2266.5465 Pseudo R2 = 0.0068 ---------------------------------------------------------------- bwt | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+-------------------------------------------------- 1.sex2 | 1.573211 .1295156 5.50 0.000 1.338786 1.848684 _cons | .0724405 .004557 -41.73 0.000 .0640375 .0819461 ---------------------------------------------------------------- Compare exp(Ė†Ī²) and exp(Ė†Ī±) in the above table with the previous ones. Interpretation and conclusion? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 176 / 294
  • 177. Multiple Logistic Regression Model Logistic regression with more than one predictor variables. It is also referred to as multivariate logistic regression. Control eļ¬€ect of one independent variable on another independent variable and vice-versa. Jimma Infant Data Outcome: birth weight status(bwt, 1 = ā€˜low ). Factor1: sex2 (ā€˜1 = female ). Factor2: place of residence (1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 177 / 294
  • 178. Logistic regression Number of obs = 7873 LR chi2(3) = 258.69 Prob > chi2 = 0.0000 Log likelihood = -2152.613 Pseudo R2 = 0.0567 ---------------------------------------------------------------- bwt | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------------------------------------------------------- 1.sex2 | 1.54585 .1289271 5.22 0.000 1.31273 1.820368 | place | 2 | 1.935937 .3516486 3.64 0.000 1.356053 2.763794 3 | 5.436301 .835046 11.02 0.000 4.023039 7.346031 | _cons | .0210955 .003245 -25.09 0.000 .0156048 .0285184 ---------------------------------------------------------------- logit(p) = Ė†Ī± + Ė†Ī²1sex2 + Ė†Ī²2SU + Ė†Ī²3R. SU and R refer to semi-urban and rural, respectively. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 178 / 294
  • 179. STATA CODE: logit bwt i.sex2 i.place, or Interpretation of exp(Ė†Ī²1) = 1.55? Interpretation of exp(Ė†Ī²2) = 1.94? Interpretation of exp(Ė†Ī²3) = 5.44? Conclusion? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 179 / 294
  • 180. Jimma Infant Data Add a third variable: Outcome: birth weight status(bwt, 1 = ā€˜low ). Factor1: sex2 (1 = ā€˜female ). Factor2: place of residence (1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ). Factor3: maternal age, ā€˜momageā€™ measured in years (continuous). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 180 / 294
  • 181. Logistic regression Number of obs = 7873 LR chi2(4) = 271.06 Prob > chi2 = 0.0000 Log likelihood = -2146.4273 Pseudo R2 = 0.0594 ---------------------------------------------------------------- bwt | Coef. Std. Err. z P>|z| [95% Conf.Interval] ---------------------------------------------------------------- 1.sex2 | .4378871 .0834884 5.24 0.000 .2742529 .6015213 | place | 2 | .661642 .1817 3.64 0.000 .3055165 1.017767 3 | 1.725587 .1539474 11.21 0.000 1.423855 2.027318 | momage | -.0231749 .0066597 -3.48 0.001 -.0362277 -.0101222 _cons | -3.274631 .2257413 -14.51 0.000 -3.717076 -2.832186 ---------------------------------------------------------------- logit(p) = Ė†Ī± + Ė†Ī²1sex2 + Ė†Ī²2SU + Ė†Ī²3R + Ė†Ī²4A. logit(p) = āˆ’3.275 + 0.438sex2 + 0.662SU + 1.726R āˆ’ 0.023A. SU, R and A refer to semi-urban, rural and maternal age, respectively. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 181 / 294
  • 182. Odds ratio estimates: Logistic regression Number of obs = 7873 LR chi2(4) = 271.06 Prob > chi2 = 0.0000 Log likelihood = -2146.4273 Pseudo R2 = 0.0594 --------------------------------------------------------------- bwt | Odds Ratio Std. Err. z P>|z| [95% Conf.Interval] --------------------------------------------------------------- 1.sex2 | 1.54943 .1293594 5.24 0.000 1.315547 1.824893 | place | 2 | 1.937972 .3521295 3.64 0.000 1.357326 2.76701 3 | 5.615815 .8645402 11.21 0.000 4.153101 7.593693 | momage | .9770915 .0065071 -3.48 0.001 .9644206 .9899289 _cons | .0378308 .00854 -14.51 0.000 .0243049 .058884 --------------------------------------------------------------- Tadesse A. Ayele (Universities of Gondar) November 22, 2019 182 / 294
  • 183. Interpretation of exp(Ė†Ī²4)? Interpretation of, say, exp(Ė†Ī²1) with and with out maternal age in the model? Conclusion? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 183 / 294
  • 184. Model Comparison Compare the model ļ¬t statistics, say āˆ’2Log-likelihood of the three models considered so far, i.e, Simple logistic regression: only ā€˜sexā€™ as a factor, Multiple logistic regression: ā€˜sexā€™ and ā€˜placeā€™ as two factor variables, and Multiple logistic regression: ā€˜sexā€™, ā€˜placeā€™ and ā€˜maternal ageā€™. Which one ļ¬ts the data better? Why? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 184 / 294
  • 185. Ordinal Logistic Regression Many categorical dependent variables are ordered Label of agreement (strongly disagree, disagree, agree, strongly agree) Social class Satisfaction Pain Score If numerical values assigned to categories do not accurately reļ¬‚ect the true distance Thus linear regression may not be appropriate Tadesse A. Ayele (Universities of Gondar) November 22, 2019 185 / 294
  • 186. Strategies to deal with ordered categorical variables Use linear regression anyway Commonly done; but can give incorrect results Possibly check robustness by varying coding of interval between outcomes Collapse variables to dichotomy, use binary logistic regression model Works ļ¬ne, but throws away useful information If you are not conļ¬dent about ordering, use multinomial logistic regression Ordered logit / ordinal probit Tadesse A. Ayele (Universities of Gondar) November 22, 2019 186 / 294
  • 187. Proportional Odds Assumption The fact that you can calculate odds ratios highlights a key assumption of ordered logit Proportional odds assumption parallel regression assumption Model assumes that variable eļ¬€ects on the odds of lower vs. higher outcomes are consistent Eļ¬€ect on odds of ā€œtoo littleā€ vs ā€œabout rightā€ is same for ā€œabout rightā€ vs ā€œtoo muchā€ Controlling for all other vars in the model If this assumption doesnā€™t seem reasonable, consider multinomial logit Interpretation strategies are similar to logit Tadesse A. Ayele (Universities of Gondar) November 22, 2019 187 / 294
  • 188. Test for Proportional odds A user-written command omodel can be used after downloading (type search omodel) The brant command performs a Brant test, which can be obtained by typing search spost The former provide overall test of proportionality odds for all independent variables and the later will provide speciļ¬c test for each covariate Insigniļ¬cant p-value revealed that the assumption is satisļ¬ed omodel logit dependent independent1 independent2, ... brant, detail Tadesse A. Ayele (Universities of Gondar) November 22, 2019 188 / 294
  • 189. Multinomial Logistic Regression What if you want have a dependent variable with more than two outcomes? A polytomous outcome Contrast outcomes with a common ā€œreference pointā€ Similar to conducting a series of 2-outcome logit models comparing pairs of categories The ā€œreference categoryā€ is like the reference group when using dummy variables in regression It serves as the contrast point for all analyses Imagine a dependent variable with M categories Conduct binomial logit models for all possible combinations of outcomes This will produce results fairly similar to a multinomial output Tadesse A. Ayele (Universities of Gondar) November 22, 2019 189 / 294
  • 190. Multinomial logistic regression Choose one category as ā€œreferenceā€ Output will include two tables if the variable has three categories Choice of ā€œreferenceā€ category drives interpretation of multinomial logit results Choose the contrast(s) that makes most sense Be aware of the reference category when interpreting results Tadesse A. Ayele (Universities of Gondar) November 22, 2019 190 / 294
  • 191. Example: Mode of travel There are diļ¬€erent mode of transport to travel for vacation (car, bus, and plane). We want to model the probability of the choice of transport mode The independent variables are income and family size If the outcome variable has k number of categories, then we expect k āˆ’ 1 models (estimates) This example, the outcome variable (mode of transport) has 3 categories, then the mlogit command in Stata will give two estimates for each covariate Tadesse A. Ayele (Universities of Gondar) November 22, 2019 191 / 294
  • 192. Parameter estimation: mlogit mode income familysize, base(3) rrr Multinomial logistic regression Number of obs = 152 LR chi2(4) = 42.63 Prob > chi2 = 0.0000 Log likelihood = -138.68742 Pseudo R2 = 0.1332 ------------------------------------------------------------------------------ mode | Odds ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- car | income | .9444061 .0118194 -4.57 0.000 .9215224 .9678581 familysize | .8204706 .1632009 -0.99 0.320 .5555836 1.211648 -------------+---------------------------------------------------------------- Bus | income | .9743237 .0136232 -1.86 0.063 .9479852 1.001394 familysize | .4185063 .1370806 -2.66 0.008 .2202385 .7952627 ------------------------------------------------------------------------------ (mode==plane is the base outcome) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 192 / 294
  • 193. mlogit mode income familysize, base(3) rrr Tadesse A. Ayele (Universities of Gondar) November 22, 2019 193 / 294
  • 194. Interpretation Holding family size constant, a unit increase in income reduces the odds of traveling by car rather than plane by 0.944 (decreases by 5.6%) Holding income constant, a unit increase in family size reduces the odds of traveling by bus rather than plane by 0.418 (decreases by 59.2%) Tadesse A. Ayele (Universities of Gondar) November 22, 2019 194 / 294
  • 195. Assignment 3 Consider the maternal depression data. The outcome variable is depression among pregnant mothers. Explore the data using chi-square and odds ratio Fit Simple binary logistic regression for each of the independent variables and compute crude odds ratio Fit Multiple logistic regression and compute adjusted odds ratio Compare the crude and adjusted odds ratio and explain the diļ¬€erence Check the goodness of ļ¬t of the model Interpret all the results Tadesse A. Ayele (Universities of Gondar) November 22, 2019 195 / 294
  • 196. Count Data Analysis Poisson Regression Model Tadesse A. Ayele (Universities of Gondar) November 22, 2019 196 / 294
  • 197. The Poisson Regression Model Count data are very common in many applications. Examples include: Number of patients visiting a certain hospital per day, Number of ANC visits per period of pregnancy, Number of live births in a given district per year, etc. Count data are commonly analyzed using Poisson regression model. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 197 / 294
  • 198. The Poisson Regression Model Contā€™d The Poisson model assumes that yi , given the vector of covariates xi , is independently Poisson-distributed with f (y) = eāˆ’Ī»Ī»y y! . The mean is given by Āµ = Ī» and the variance, var(Āµ) = Ī». Suppose Y1, . . . , YN is a set of independent count outcomes, and let x1, . . . , xN represent the corresponding p-dimensional vectors of covariate values. The Poisson regression model with Ī² a vector of p ļ¬xed, unknown regression coeļ¬ƒcients is given by log(Ī»i ) = Ī²0 + Ī² Ɨ xi . In count and binary data, we donā€™t have the error term, why? Tadesse A. Ayele (Universities of Gondar) November 22, 2019 198 / 294
  • 199. Jimma Infant Data Outcome: Total child deaths (ā€˜deathsā€™). Factor1: place of residence (1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ). Factor2: monthly family income,ā€˜famincā€™ in birr (continuous). log(Ī») = Ī²0 + Ī²1SU + Ī²2R + Ī²3IN. ā€˜INā€™ refers to monthly family income, ā€˜SUā€™ and ā€˜Rā€™ are deļ¬ned as before. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 199 / 294
  • 200. Poisson regression Number of obs = 7556 LR chi2(3) = 203.15 Prob > chi2 = 0.0000 Log likelihood = -6849.1816 Pseudo R2 = 0.0146 ------------------------------------------------------------------------ deaths | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------- place | 2 | .035657 .0560403 0.64 0.525 -.07418 .145494 3 | .2299316 .0500235 4.60 0.000 .1318874 .3279758 | faminc | -.0013383 .0001642 -8.15 0.000 -.0016601 -.0010164 _cons | -.7445872 .0510623 -14.58 0.000 -.8446676 -.6445069 ------------------------------------------------------------------------ log(Ī») = āˆ’0.745 + 0.036SU + 0.230R āˆ’ 0.001IN. STATA CODE: poisson deaths i.place faminc Tadesse A. Ayele (Universities of Gondar) November 22, 2019 200 / 294
  • 201. The option ā€˜irrā€™ displays the incidence rate ratio estimates. Poisson regression Number of obs = 7556 LR chi2(3) = 203.15 Prob > chi2 = 0.0000 Log likelihood = -6849.1816 Pseudo R2 = 0.0146 ------------------------------------------------------------------------ deaths | IRR Std. Err. z P>|z| [95% Conf. Interval] ------------------------------------------------------------------------ place | 2 | 1.0363 .0580746 0.64 0.525 .9285045 1.156611 3 | 1.258514 .0629552 4.60 0.000 1.14098 1.388155 | faminc | .9986626 .000164 -8.15 0.000 .9983413 .9989841 _cons | .4749303 .024251 -14.58 0.000 .4297002 .5249213 ------------------------------------------------------------------------ STATA CODE: poisson deaths i.place faminc, irr Tadesse A. Ayele (Universities of Gondar) November 22, 2019 201 / 294
  • 202. exp(Ī²i ) measures the change in the expected log counts (incidence rate). exp(Ī²2) = exp(0.230) = 1.25 implies that the expected log counts of child death in rural is higher by 1.25 as compared to that of urban, or the incidence rate of child death in rural is higher by nearly 25% as compared to that of urban. exp(Ī²3) = exp(āˆ’0.0013) = 0.9987 implies that as family income increases by one Birr, the incidence rate of child death decreases by nearly 0.13%. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 202 / 294
  • 203. Negative-Binomial Regression I Recall: Poisson distribution assumes that its ā€˜meanā€™ and ā€˜varianceā€™ are equal. However, count data in practice violate this assumption as a result of unobserved heterogeneity. Usually, the sample variance is higher than the sample mean, a phenomenon referred to as ā€˜overdispersionā€™. The Overdispersion in the data canā€™t be fully captured by observed covariates. Hence, there is a need of models having more parameters than the Poisson distribution. Negative-binomial is an important extension to Poisson in this regard. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 203 / 294
  • 204. Negative-Binomial Regression II Negtive-binomial distribution is given by P(y) = Ī“(Ī±āˆ’1 + y) Ī“(Ī±āˆ’1)y! Ī±Āµ 1 + Ī±Āµ y 1 1 + Ī±Āµ 1/Ī± . E(y) = Āµ, Var(y) = Āµ + Ī±Āµ2. For regression purpose, we assume yi āˆ¼ Negbin(Āµi , Ī±). Applying a log link, logĀµi = xi T Ī². Tadesse A. Ayele (Universities of Gondar) November 22, 2019 204 / 294
  • 205. Jimma Infant Data Outcome: Total child deaths (ā€˜deathsā€™). Factor1: place of residence (1 = ā€˜urban , 2 = ā€˜semi-urban , 3 = ā€˜rural ). Factor2: monthly family income,ā€˜famincā€™ in birr (continuous). Summary statistics of the outcome (ā€˜deathsā€™): Variable | Obs Mean Std. Dev. Min Max ----------------------------------------------------------------- deaths | 8040 .460199 .7191223 0 2 Sample variance is higher than sample mean, i.e., suggesting evidence of overdispersion. Negative-binomial model might be a good idea instead of the Poisson model which was ļ¬tted earlier to these data. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 205 / 294
  • 206. Negative binomial regression Number of obs = 7556 LR chi2(3) = 176.10 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -6824.794 Pseudo R2 = 0.0127 -------------------------------------------------------------------------- deaths | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------------------------------------------------------------------- place | 2 | .0352622 .0595191 0.59 0.554 -.0813931 .1519174 3 | .2328639 .0532109 4.38 0.000 .1285724 .3371554 | faminc | -.0013315 .0001712 -7.78 0.000 -.0016671 -.0009959 _cons | -.7470684 .0540689 -13.82 0.000 -.8530415 -.6410953 -------------------------------------------------------------------------- /lnalpha | -1.124534 .1619651 -1.441979 -.807088 -------------------------------------------------------------------------- alpha | .3248039 .0526069 .2364592 .4461554 -------------------------------------------------------------------------- Likelihood-ratio test of alpha=0 chibar2(01) 48.78 Prob>=chibar2 = 0.000 Tadesse A. Ayele (Universities of Gondar) November 22, 2019 206 / 294
  • 207. Observations ... STATA CODE: nbreg deaths i.place faminc The dispersion parameter alpha now in the Negative-binomial model captures the extra variability in the data. The Negative-binomial model is an improvement in model ļ¬t over the Possion model. Furthermore, likelihood ratio test of the parameter alpha with (pvalue < 0.001) conļ¬rms the above result. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 207 / 294
  • 208. Zero-Inļ¬‚ated Models I Most count data are characterized by excessive zeros beyond what the common count distributions can predict. Often because of heterogeneity between subjects. Or omission of important covariances Not appropriately accounting for this feature leads to biased estimates. And hence erroneous conclusion. Hence, models for such extension are needed. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 208 / 294
  • 209. Examples 1: Number of child deaths; The question was asked for each question, number of death? The ļ¬rst Scenario is a women may give many children and say no child death. The second scenario is a women may not have children and yet may say zero number of death. The two scenario resulted zero, but may lead to inļ¬‚ated zero. Example 2: Number of ļ¬sh caught by the ļ¬sher man. There may be two diļ¬€erent ponds; one with out ļ¬sh and one with ļ¬sh. Number of ļ¬sh may be zero if the ļ¬sher tried in the ļ¬rst pond. That is always zero group. Where as a ļ¬sher man may catch zero due to his problem . Tadesse A. Ayele (Universities of Gondar) November 22, 2019 209 / 294
  • 210. Zero-Inļ¬‚ated Models II Commonly used models in this regard: Zero-inļ¬‚ated Poisson model (ZIP). Zero-inļ¬‚ated negative-binomial (ZINB). ZINB is considered when data are characterized by both overdispersion and zero-inļ¬‚ation. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 210 / 294
  • 211. Zero-Inļ¬‚ated Models III Zero-inļ¬‚ated Poisson model (ZIP)is commonly used models in this regard: This model allows for overdispersion assuming that there are two diļ¬€erent types of individuals in the data: (1) Those who have a zero count with a probability of 1 (Always-0 group), and (2) those who have counts predicted by the standard Poisson (Not always-0 group) Observed zero could be from either group, and if the zero is from the Always-0 group, it indicates that the observation is free from the probability of having a positive outcome Tadesse A. Ayele (Universities of Gondar) November 22, 2019 211 / 294
  • 212. Assignment 4 Consider EDHS2016 datasets The outcome variable is number of children a women has Fit poisson regression Test the assumptions of the model Interprate the results Write full report Tadesse A. Ayele (Universities of Gondar) November 22, 2019 212 / 294
  • 213. Chapter 4 Survival Data Analysis Tadesse A. Ayele (Universities of Gondar) November 22, 2019 213 / 294
  • 214. Introduction Survival Analysis refers to statistical methods for analyzing survival data. Survival data could be derived from laboratory, clinical, epidemiological studies, etc. Response of interest is the time from an initial observation until occurrence of a subsequent event. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 214 / 294
  • 215. Examples In many biomedical studies, the primary endpoint is time until an event occurs The event of interest includes Time to death, remission Time to adverse events Time to relapse Time to recovery from diseases Time to new symptom Time to discharge Time to start talk for a child etc Time from transplant surgery until the new organ fails Marriage duration This time interval from a starting point and a subsequent event is known as the survival time. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 215 / 294
  • 216. Example Which of the following data sets is likely to lend itself to survival analysis? A case-control study of caļ¬€eine intake and breast cancer A randomized controlled trial where the outcome was whether or not women developed breast cancer in the study period A cohort study where the outcome was the time it took women to develop breast cancer. A cross-sectional study which identiļ¬ed both whether or not women have ever had breast cancer and their date of diagnosis. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 216 / 294
  • 217. Event and Censoring We may not observe all the event of interest within the deļ¬ned follow-up period Thus, data are typically subject to censoring when a study ends before the event occurs Censoring: Subjects are said to be censored if they are lost to follow up or drop out of the study, or if the study ends before they die or have an outcome of interest Tadesse A. Ayele (Universities of Gondar) November 22, 2019 217 / 294
  • 218. Jimma Infant Data Infants in Southwest Ethiopia were studied for one year. Measurements were taken approximately every two months. Nearly 8000 infants at baseline. Data on infant, maternal and household characteristics were collected. Death of infants within their ļ¬rst year is an important problem with these data. We will therefore analyze the time from birth to death (in days). For infants still alive when these data were collected, time is the time from birth to the time of data collection. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 218 / 294
  • 219. Jimma Infant Data II The variable event is an indicator for whether time refers to death (1) or end of study (0). Possible explanatory variables for time-to-death could be place of residence and sex of infants, among others. These data can be described as survival data. Duration or survival data can generally not be analyzed by conventional methods such as linear regression. The main reason for this is that some durations are usually right-censored. That is, the endpoint of interest has not occurred during the period of observation. Another reason is that survival times tend to have positively skewed distributions. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 219 / 294
  • 220. Survivor Function The survival time T may be regarded as a random variable with a probability distribution F(t) and probability density function f (t). An obvious quantity of interest is the probability of surviving to time t or beyond, the survivor function or survival curve S(t), which is given by S(t) = P(T ā‰„ t) = 1 āˆ’ F(t). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 220 / 294
  • 221. Hazard Function A further function which is of interest for survival data is the hazard function. This represents the instantaneous death rate, that is, the probability that an individual experiences the event of interest at a time point given that the event has not yet occurred. The hazard function is given by h(t) = f (t) S(t) . The instantaneous probability of death at time t divided by the probability of surviving up to time t. Hazard function is just the incidence rate. Tadesse A. Ayele (Universities of Gondar) November 22, 2019 221 / 294
  • 222. Kaplan-Meier Estimator The Kaplan-Meier estimator is a nonparametric estimator of the survivor function S(t). If all the failure times, or times at which the event occurs in the sample are ordered, And labeled t(j) such that t(1) ā‰¤ t(2) ā‰¤ . . . ā‰¤ t(n), the estimator is given by S(t) = j|t(j)ā‰¤t (1 āˆ’ dj nj ). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 222 / 294
  • 223. Kaplan-Meier Estimator II Recall S(t) = j|t(j)ā‰¤t (1 āˆ’ dj nj ). dj is the number of individuals who experience the event at time t(j), nj is the number of individuals who have not yet experienced the event at that time, And are therefore still ā€˜at riskā€™ of experiencing it (including those censored at t(j)). Tadesse A. Ayele (Universities of Gondar) November 22, 2019 223 / 294