SlideShare a Scribd company logo
1 of 118
CATEGORICAL DATA ANALYSIS
1
Introduction
 Analysis of discrete data or discrete data analysis (CDA) refers
to methods for discrete response variables.
 DDA in practice is the analysis of count response variables.
 Statistical computations and analyses assume that the variables
have a specific levels of measurement
2
3
 Categorical/Discrete/Qualitative data: Measures on categorical or discrete
variables consist of assigning observations to one of a number of
categories in terms of counts or proportions
Counts are variables representing frequency of occurrence of an event:
Example: number of graduate students in the department of public
health, SPHMMC.
 Proportions or “bounded counts” are ratios of counts:
Example: number of graduate students in the department of public
health divided by the total number of graduate students, SPHMMC.
Discretely measured responses
Discretely measured responses can be:
Nominal(unordered)variables:e.g., gender, ethnic background,
religious or political affiliation
Ordinal (ordered) variables, e.g., grade levels, income levels, school
grades
Discrete interval variables with only a few values,
e.g., number of times married
Continuous variables grouped into small number of categories,
e.g., income grouped into subsets, blood pressure levels (normal,
high-normal etc)
4
What is a categorical (qualitative) variable?
 A categorical variable has a measurement consisting of a set of
categories. For example:
 Field goal result – success or failure
 Patient survival – yes or no
 Criminal offense convictions – murder, robbery, assault, …
 Highest attained education level –HS, BSc, MSc, PhD
 Monthly income < 650, 651-1200,…., > 10,999 ETB
• Religious affiliation
∴We are living in categorical world.
5
Types of Categorical Variables
1. Ordinal –categories have an ordering
 Education level (none, BSc., MSc., PhD, Professor)
 Social class (upper, middle, lower)
 Patient condition (good, fair, serious, critical)
2. Nominal –categories do not have an ordering
 Religious affiliation (Orthodox, Catholic, Protestant, Muslim, other)
 Mode of transportation to work (walk, bike, automobile, bus)
 Favorite type of music (classical, country, folk, jazz, rock)
 Choice of residence (apartment, condominium, house, other)
6
Note: The way that a variable is measured determines its classification.
 For example, “education” is
only nominal when measured as public school or private school;
it is ordinal when measured by highest degree attained, using the
categories none, high school, bachelor’s, master’s, and doctorate;
it is interval when measured by number of years of education,
using the integers 0,1, 2, … .
7
Where do Categorical Data occur?
 Social sciences: opinions on issues
 Health sciences: response to treatments/drugs
 Behavioral sciences: type of mental illness
 Public health: AIDS awareness
 Zoology: animals food preferences
 Education: students' response to exams
 Marketing: consumer preferences
Categorical data occur almost everywhere.
8
Variables and types of data for CDA
 Response variable(s) is categorical
 Explanatory/predictor variable(s) may be categorical or continuous ;they can be
of any type
Discrete Distributions
 Statistical inference requires assumptions about the probability distribution(i.e
random mechanism, sampling model) that generated the data.
 For example for a t-test, we assume that a random variable follows a normal
distribution.
 For discrete data key distributions are: Bernoulli, Binomial, Poisson and
Multinomial.
9
Bernoulli Probability Distributions
10
 Suppose Y = 1is a success where the probability of a success is π.
Also, suppose Y =0 is a failure with probability of a failure is 1−π.
Bernoulli probability mass function (pmf)
Notice that; P(Y=1) = π and P(Y=0) = 1 - π
Since E Yi = E Y2
i = 1 ∗ π + 0 ∗ 1 − π = π
E Yi = π and Var (Yi) =π(1 - π)
The Binomial distribution
• It is one of the most widely encountered discrete distributions.
• The origin of binomial distribution lies in Bernoulli’s trials.
• When a single trial of some experiment can result in only one of two
mutually exclusive outcomes (success or failure; dead or alive; sick or
well, male or female) the trail is called Bernoulli trial.
• Suppose an event can have only binary outcomes A and B. Let the
probability of A is π and that of B is 1 - π. The probability π stays the
same each time the event occurs.
11
• If an experiment repeated n times and the outcome is
independent from one trial to another, the probability that
outcome A occurs exactly y times is
•
• We write Y ∼ B(n, π)
12
Characteristics of a Binomial Distribution
• The experiment consist of n identical trials.
• There are only two possible outcomes on each trial.
• The probability of A remains the same from trial to trial. This probability is
denoted by p, and the probability of B is denoted by q. Note that q=1- p.
• The trials are independent.
• The binomial random variable Y is the number of A’s in n trials.
• n and π are the parameters of the binomial distribution.
• The mean is nπ and the variance is nπ(1- π)
13
Poisson distribution
• A different kind of discrete data arise when we count the number of
occurrences of an event , perhaps for different subjects or for units of time.
Examples:
- Daily number of new cases of breast cancer notified to a cancer registry
- Number of abnormal cells in a fixed area of histological slides from a
series of liver biopsies
• Suppose events happen randomly and independently in time at a constant
rate.
- If events happen with rate λ events per unit time, the probability of y
events happening in unit time is ; 𝜆 > 0
14
Characteristics of a Poisson distribution
• The Poisson distribution is very asymmetric when its mean is small
• With large means it becomes nearly symmetric
• It has no theoretical maximum value, but the probabilities tail off
towards zero very quickly
• λ is the parameter of the Poisson distribution
• The mean is λ and the variance is also λ
15
THE CHI-SQUARE DISTRIBUTION AND
THE ANALYSIS OF FREQUENCIES
16
• The chi-square distribution is the most frequently employed statistical
technique for the analysis of count or frequency data.
• For example, we may know for a sample of hospitalized patients how many
are male and how many are female.
• For the same sample we may also know how many have private insurance
coverage, how many have Medicare insurance, and how many are on
Medicaid assistance.
• We may wish to know, for the population from which the sample was
drawn, if the type of insurance coverage differs according to gender.
• Chi-square analysis has solution to such type of question.
• The chi-square distribution may be derived from normal distributions.
17
Chi square distribution
1. Chi-square distribution is a nonsymmetrical distribution
2. Chi-square distributions are determined by degree of freedom
18
Chi-square test statistic
 Cannot be negative because all discrepancies are squared.
 Will be zero only in the unusual event that each observed
frequency exactly equals the corresponding expected frequency.
 Larger the discrepancy between the expected frequencies and
their corresponding observed frequencies, the larger the
observed value of chi-square.
19
Types of Chi-Square Tests
i. Tests of goodness-of-fit
ii. Tests of independence
iii. Tests of homogeneity
20
Tests of goodness-of-fit
• All of the chi-square tests that we employ may be thought of as goodness-of-
fit tests.
• The phrase “goodness-of-fit” for use in a more restricted sense.
• We use it to refer to a comparison of a sample distribution to some theoretical
distribution that it is assumed describes the population from which the
sample came.
• Karl Pearson, who showed that the chi-square distribution may be used as a
test of the agreement between observation and hypothesis whenever the data
are in the form of frequencies.
21
Observed versus Expected Frequencies
• Observed frequencies: are the number of subjects or objects in our sample
that fall into the various categories of the variable of interest.
For example: if we have a sample of 100 hospital patients, we may observe that
50 are married, 30 are single, 15 are widowed, and 5 are divorced.
• Expected frequencies: are the number of subjects or objects in our sample
that we would expect to observe if some null hypothesis about the variable is
true.
For example, our null hypothesis might be that the four categories of marital
status are equally represented in the population from which we drew our
sample. In that case we would expect our sample to contain 25 married, 25
single, 25 widowed, and 25 divorced patients.
22
Chi-Square Test Statistic
• When the null hypothesis is true, 𝜒2
is distributed approximately as 𝜒2
with k-r
degrees of freedom. Where;
k- is equal to the number of groups for which observed and expected frequencies
are available
r - is the number of restrictions or constraints imposed on the given comparison
Oi is the observed frequency for the ith category of the variable of interest
Ei is the expected frequency
• This test is a right-tailed test, since when the O - E values are squared, the
answer will be positive or zero.
23
Two assumptions are needed for the goodness-of-fit test.
1. The data are obtained from a random sample.
2. The expected frequency for each category must be 5 or more.
- The steps for the chi-square goodness-of-fit test are summarized in this
procedure Table.
24
• When there is perfect agreement between the observed and the expected values,
χ2= 0. Also, 𝜒2 can never be negative.
• Finally, the test is right-tailed because “H0: Good fit” and “H1: Not a good fit”
mean that 𝜒2 will be small in the first case and large in the second case.
For example, suppose as a market analyst you wished to see whether consumers
have any preference among five flavors of a new fruit soda. A sample of 100 people
provided these data:
If there were no preference, you would expect each flavor to be selected with equal
frequency, i.e. 100/5 =20.
25
- Is there enough evidence to reject the claim that there is no preference in the
selection of fruit soda flavors? Let 𝛼 = 0.05.
Solution
- Step 1: State the hypotheses and identify the claim.
H0 : Consumers show no preference for flavors (claim).
H1: Consumers show a preference.
- Step 2: Find the critical value. The degrees of freedom are 5 -1= 4, and 𝛼 = 0.05.
Hence, the critical value from Chi-square table is 9.488.
- Step 3: Compute the test value by subtracting the expected value from the
corresponding observed value, squaring the result and dividing by the expected
value, and finding the sum. The expected value for each category is 20.
26
Step 4: Make the decision. The decision is to reject the null hypothesis,
since 18.0 > 9.488
Step 5: Summarize the results. There is enough evidence to reject the
claim that consumers show no preference for the flavors.
27
• The chi-square goodness-of-fit test can be used to test a variable to see if it
is normally distributed. The hypotheses are
H0: The variable is normally distributed.
H1: The variable is not normally distributed.
28
TESTS OF INDEPENDENCE
• The chi-square independence test can be used to test the independence of two
variables.
• To test the null hypothesis by using the chi-square independence test,
expected frequencies must be computed
• When data are arranged in table form for the chi-square independence test,
the table is called a contingency table.
• The table is made up of R rows and C columns.
• The degrees of freedom for any contingency table are (rows-1) times(columns-
1); that is, d.f. (R -1)(C -1).
• The reason for this formula for d.f. is that all the expected values except one
are free to vary in each row and in each column.
29
• For example, suppose a new postoperative procedure is administered to a
number of patients in a large hospital.
• The researcher can ask the question, do the doctors feel differently about this
procedure from the nurses, or do they feel basically the same way?
• Note that the question is not whether they prefer the procedure but whether
there is a difference of opinion between the two groups.
• To answer this question, a researcher selects a sample of nurses and doctors
and tabulates the data in table form, as shown.
30
H0:The opinion about the procedure is independent of the profession.
H1: The opinion about the procedure is dependent on the profession
• The degree of freedom for this case is (2-1)(3-1)= (1)(2) =2
31
32
33
• The final steps are to make the decision and summarize the results.
• This test is always a right-tailed test, and the degrees of freedom are (R-1)(C-1)
(2-1)(3-1)=2.
• If 𝛼=0.05,the critical value from Chi-square table is 5.991. Hence, the decision is
to reject the null hypothesis, since 26.67 > 5.991
• The conclusion is that there is enough evidence to support the claim that
opinion is related to (dependent on) profession—that is, that the doctors and
nurses differ in their opinions about the procedure.
34
The 2 X 2 Contingency Table
• Sometimes each of two criteria of classification may be broken down into only
two categories, or levels.
• When data are cross classified in this manner, the result is a contingency table
consisting of two rows and two columns.
• Such a table is commonly referred to as a 2X2 table.
• In the case of a 2X2 contingency table, however, X2
may be calculated by the
following shortcut formula:
35
• Where a, b, c, and d are the observed cell frequencies as shown in the
following table.
• When we apply the (r-1)(c-1) rule for finding degrees of freedom to a 2X2
table, the result is 1 degree of freedom.
A 2X2 Contingency Table
36
Example:
• According to Silver and Aiello study finding falls are of major concern among
polio survivors.
• Researchers wanted to determine the impact of a fall on lifestyle changes.
• The following table shows the results of a study of 233 polio survivors on
whether fear of falling resulted in lifestyle changes.
37
• Solution:
1. Data. From the information given we may construct the 2X2 contingency
table
2. Assumptions. We assume that the sample is equivalent to a simple random
sample.
3. Hypotheses.
H0: Fall status and lifestyle change because of fear of falling are independent.
H1: The two variables are not independent.
Let 𝛼 =.05
4. Test statistic. The test statistic is
OR
Answer: χ2
𝐶𝑎𝑙 = 31.74 38
Small Expected Frequencies
• The problems of how to handle small expected frequencies and small total sample
sizes may arise in the analysis of 2X2 contingency tables.
• Cochran suggests that the 𝜒2 test should not be used if n<20 or if 20<n<40 and any
expected frequency is less than 5.
• When n=40, an expected cell frequency as small as 1 can be tolerated.
39
• Yates’s Correction
• The observed frequencies in a contingency table are discrete and thereby give
rise to a
discrete statistic, 𝜒2
, which is approximated by the 𝜒2
distribution, which is
continuous.
• Yates proposed a procedure for correcting for this in the case of 2X2 tables.
• No correction is necessary for larger contingency tables
40
n≥40 and E≥5 n≥40 and 1≤ E < 5 n<40 or E<1




E
E
O 2
2 )
5
.
0
|
(|

)
)(
)(
)(
(
)
2
/
|
(| 2
2
d
b
c
a
d
c
b
a
n
n
bc
ad











E
E
O 2
2 )
(

)
)(
)(
)(
(
)
( 2
2
d
b
c
a
d
c
b
a
n
bc
ad







!
!
!
!
!
)!
(
)!
(
)!
(
)!
(
n
d
c
b
a
d
b
c
a
d
c
b
a
P





41
Logistic Regression
42
Logistic Regression
• Much research in the health sciences is motivated by a desire to
understand and describe the relationship between independent
variables and categorical dependent variable.
• Particularly plentiful are circumstances in which the outcome
variable is dichotomous (a variable that can assume only one of two
mutually exclusive values).
• These values are usually coded as Y=1 for a success and Y=0 for a
failure 43
• Logistic regression is the type of regression analysis that is usually
employed when the dependent variable is categorical.
• There can be many predictor variables (x’s) that could be categorical
or continuous.
44
Types of Logistic Regression
• Binary logistic regression: a regression analysis used to model outcome
variable with two categories
• Multinomial logistic regression: a regression analysis used to model
outcome variable of nominal scale with more than two categories
• Ordinal Logistic regression: a regression analysis used to model
outcome variable of ordinal scale with more than two categories
45
Linear vs. Logistic Regression
• What distinguishes logistic regression from linear regression
model is that the type of outcome variable.
• Linear regression: Outcome variable y is continuous
• Logistic regression: Outcome variable y is categorical
• The question a researcher need ask when choosing a regression
method is:
o What does my outcome look like?
46
• The difference is reflected both in
o the choice of a parametric model and
o the assumptions.
• However the methods employed in an analysis using logistic regression follow the
same general principles used in linear regression.
• Why not linear regression model for categorical outcome variables?
o Because having a categorical outcome variable violates the assumption of
linearity in linear regression.
o The error terms are heteroskedastic and is not normally distributed because Y
takes on only two values(0 and 1).
47
• The predicted probabilities can be greater than 1 or less than 0 which can
be a problem if the predicted values are used in a subsequent analysis.
• Some people try to solve this problem by setting probabilities that are
greater than (less than) 1 (0) to be equal to 1 (0).
• This amounts to an interpretation that a high probability of the Event
(Nonevent) occurring is considered a sure thing.
48
Objectives of Logistic Regression
• Estimating magnitude of outcome/exposure relationship
oTo evaluate the association of a binary outcome with a set of predictors
• Prediction
oDevelop an equation to determine the probability or likelihood that
individual has the condition (y = 1) that depends on the independent
variables (the x’s)
49
Assumptions of Logistic regression
• The outcome must be categorical
• Requires enough responses in each category of a given variable
• Groups should be mutually exclusive (e.g. multicollinear) which will make
maximum likelihood estimation impossible
• There is no assumption about the predictors being linearly related to each other
• There should not be multi-collinearity
• There should not be outliers and influential
• Independence of errors –assumes a between subjects design.
50
Logistic Regression Model
• The probability of the outcome is measured by the odds of occurrence of an event.
• If P is the probability of an event, then (1-P) is the probability of it not occurring.
o Odds of event = P/1-P
• In linear regression the estimates of effect directly quantified by the mean value
of response variable
• In logistic regression the estimates of effect are instead quantified by “Odds
Ratios”
51
Odds and Probability
52
53
• Taking the logarithms of both sides
• Can be transformed as follows
• Sometimes written as:
• Where ln ( or log) is the natural logarithm (base e)
54
Cont’d…
Logistic Vs. Linear Regression Equation
Logistic Regression:
Linear Regression:
• The other difference between linear and logistic regression models
concerns the conditional distribution of error.
55
Cont’d…
• In the linear regression model we assume that an observation of
the outcome variable may be expressed as y = E(𝒀𝒊|𝑿𝒊) + 𝜺.
• The error (𝜺) is an observation's deviation from the conditional
mean of y.
• The errors 𝜺 are normally distributed with mean 0 and constant
variance 𝜎2
(Equal variance). That is: 𝜀 ~N(0, 𝜎2
)
56
Cont’d…
• With a dichotomous outcome variable the conditional distribution of error term is
different.
• In this situation we may express the value of the outcome variable given x as y =
P(x)+ 𝜺.
• Here the quantity 𝜺 may assume one of two possible values.
o If y = 1 then 𝜺 =1-P(x)
o If y = 0 then 𝜺 = -P(x)
• Thus, 𝜺 are distributed with mean zero and variance equal to P(x)[1-P(x)].
• The conditional distribution of the outcome variable follows a binomial
distribution with probability given by the conditional mean, P(x).
57
Cont’d…
58
Why log transformation?
• The odds has a range of 0 to ∞
• Odds > 1 associated with an event being more likely to occur than
to not occur
• Odds <1 associated with an event that is less likely to occur than
not occur
• Transformation is useful because it creates variable ranges from -
∞ to +∞
• Hence, it solves the problem we encountered in fitting a linear
model to probabilities
59
Estimating Logistic regression
Simple Logistic regression
• The logistic model with a single independent variable X where the effects of
other variables is uncontrolled.
Multiple Logistic regression
oThe logistic model with a single predictor variable X can be extended to
two or more predictor variables.
60
Interpretation of slope
• 𝛽1 is the estimated change in the log odds of the outcome for a one unit
increase in 𝑥1
• It estimates the log odds ratio for comparing two groups of observations
• This estimated slope can be exponentiated to get the corresponding estimated
odds ratio.
What about the Intercept?
• The intercept is mathematically necessary to specify the entire equation.
61
Maximum Likelihood Estimation
• The method used to estimate the regression coefficients in logistic regression
is called Maximum Likelihood Estimation (MLE)
• Ordinary least square(OLS) is method used to estimate the regression
coefficients in linear regression
• MLE yields values for the unknown parameters which maximize the
probability of obtaining the observed set of data.
62
Cont’d…
• Basically, the resulting estimates of the slope and intercept are the
values that make the observed data most likely among all choices
of values for 𝛽0and 𝛽1.
• Along with the estimates of 𝛽0and 𝛽1this method yields estimates
of the standard error for each: that can be used to create confidence
intervals and do hypothesis tests
63
Test of Significance of Coefficients
• The fitted relationship i.e. the estimated value of 𝛽0 & 𝛽1 may simply be
the result of chance phenomena.
• We need to test whether or not the sample data set exhibits sufficient
evidence to indicate that X actually contributes significantly to the
prediction of the log odds of Y for a given value of X
• The test statistics is:
64
Example: Coronary Heart Disease (CD) and Age: In this study sampled individuals
were examined for signs of CD (present = 1 / absent = 0) and the potential
relationship between this outcome and their age (yrs.) was considered.
65
• For the CHD-age data set, we could try to estimate the following:
• p = probability of CHD evidence (proportion of persons with CHD evidence), 𝑥1 = age
• are called regression coefficients
• Another way to write the above equation:
• Recall, the higher the odds of an event, the larger the probability of an event
• A predictor 𝑥1 that is positively associated with the odds will also be positively associated
with the probability of the event (i.e. estimated slope 𝛽1will be positive)
66
• A predictor 𝑥1 that is negatively associated with the odds will also be negatively
associated with the probability of the event (i.e. estimated slope 𝛽1will be
negative)
• Results from logistic regression of log odds of CHD evidence on age:
• The resulting equation: = -5.34+0.11XAge
67
Cont’d..
• Where p is estimated probability of persons to have CHD amongst persons of a
given age
• The estimated coefficient (𝛽1) of age (𝑥1) is positive; hence we have
o Estimated a positive association between age and log odds of CHD
o Estimated a positive association between age and probability of CHD
• How can we actually interpret the value 0.11?
• Lets write out the equation comparing two groups of individuals who differ in
age by one year:
• Group 1, age = k years; Group 2, age = k + 1 years
68
Cont’d…
• The resulting equations estimating the ln odds of CHD evidence in each age
group
• Multiplying out, and taking the difference (subtracting)
69
Cont’d…
• So, when the dust settles:
• Reversing one of the famous properties of logarithms:
• So , 𝛽1the estimated slope for 𝑥1 is the natural log of an estimated odds
ratio:
• To get the estimated odds ratio, exponentiation of 𝛽1, i.e.:
70
Cont’d…
• In our example, recall 𝛽1= 0.11
• Here, 𝑂𝑅 = 𝑒𝛽1 = 𝑒0.11 = 1.116
• The estimated odds ratio of CHD evidence for a one year age difference is
1.116, older compared to younger.
o60 year olds compared to 59 years olds
o45 year old compared to 44 year olds
71
Interpretation of slope
• Change in the log odds of CHD for a one year increase in age
• One group with 𝑥1 one unit higher than the other
The Intercept?
• The resulting equation
= -5.34+0.11XAge
• Here, the intercept estimate 𝛽0is the estimated ln odds of CHD evidence
for persons of age 0
72
Test of Significance of Coefficients
• Hypothesis
• Assume null true, and calculate standardized “distance “ of 𝛽1from 0
=
0.11
0.03
= 3.67
• p-value is probability of being 3.67 or more standard errors away from 0 on a
normal curve: very low in this example, p < 0.001
73
Multiple Logistic Regression
• Multiple logistic regression allows us to model the relationships of several independent
variables to a response variable.
• These independent variables may be either continuous or discrete or a combination of
the two
• We can also estimate the association between each predictor and Pr(y = 1) controlling
for all other predictors
• In the previous example we found a statistically significant positive association
between CHD and age
= -5.34+0.11XAge
74
Cont’d…
• Smoking status of study participants was also included in the model to assess if
it do have a relationship with CHD
• What if smoking is also associated with age?
• Age could be a confounder of the smoking and CHD relationship (and vice-
versa)
• Can we estimate the age adjusted relationship between CHD and smoking?
• Even if smoking and age not related, and hence there is no confounding, both
predictors may tell more about CHD evidence than either alone.
75
Cont’d…
• Here, we need a logistic regression model with 2 predictors (𝑋𝑠):
• Where p = Pr(CHD evidence), 𝑋1 = age, 𝑋2 = smoking status (1=yes)
• How would we interpret the coefficients from a multiple logistic regression?
And the resulting odds ratio estimates?
76
Cont’d…
• 𝛽1is the estimated regression coefficient associated with age:
• It estimates the ln odds ratio for comparing two individuals (groups) who
differ by one year in age and are either both smokers or non-smokers
• 𝛽1is the estimated smoking-adjusted log odds ratio for age
• Just to demonstrate: Write out 2 equations for two groups of persons who
differ by one year in age and are all smokers
77
Cont’d…
78
Cont’d…
• 𝑿𝟏 is the age variable
• 𝛽1 is the estimated adjusted ln OR of CHD associated with age, after adjusting
for smoking status
• 𝑒𝛽1 is the estimated adjusted OR of CHD associated with age, after adjusting
for smoking status
• This 𝑂𝑅 compares two groups of individuals of the same smoking status but
who differ by one year in age (older to younger)
79
Cont’d…
• 𝑿𝟐 is the smoking variable
• 𝛽2 is the estimated regression coefficient associated with smoking:
• It estimates the ln odds ratio for comparing two groups of individuals of the
same age, where one group is smokers and the other is non-smokers
• 𝑒𝛽2 estimates the odds ratio for comparing two groups of individuals of the
same age, where one group is smokers and the other is non-smokers
80
Inference in Multiple Logistic Regression
• We can estimate each regression coefficients and ORs by constructing a
range of plausible values i.e. CIs
• We can also test the statistical significance of regression coefficients and
ORs using magnitude of test statistics or corresponding p-values or CI
• Each coefficient estimate has its own associated standard error
• Approach very similar to approach from simple logistic regression
81
Cont’d…
82
Model Development
• The approach to model development in multiple logistic regression analysis
is similar to the approach in normal theory multiple linear regression.
• Models are compared to assess the statistical significance of the extra
predictors in the larger model, controlling for the predictors in the smaller
model.
• This is done using the likelihood ratio test.
• If the likelihood ratio statistic is significant, we say that the added variables
are significant in adjusted analysis.
83
Logistic Regression Using SPSS
84
85
86
87
88
89
SPSS output
90
SPSS output
91
92
93
94
95
96
97
98
99
Multicategory Logit Models
100
Multicategory response
• The binary logistic regression provided analysis methods when there were
binary responses.
• What about more than two response categories?
Examples:
• Canadian political party affiliation – Conservative, New Democratic, Liberal
• Chemical compounds in drug discovery experiments – Positive, blocker, or
neither
• Five-level Likert scale – Strongly disagree, disagree, neutral, agree, or strongly
agree.
101
Cont’d…
• For these examples, some responses are ordinal (e.g., Likert scale) and
some are not (e.g., chemical compounds).
• We will investigate both nominal (unordered) and ordinal multicategory
responses.
Multinomial Probability Distribution
• The multinomial probability distribution is the extension of the binomial
distribution to situations where there are more than two categories for a
response.
• The probability mass function for observing particular values of n1, …,nc
is
102
𝑊here
• Y denotes the response category with levels of j = 1, …, c
• Each category has a probability of 𝜋𝑗= P(Y=j).
• n denotes the number of trials
• n1, …, nc denote the response count for category j
103
NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS
• Multinomial logistic regression is an extension of the (binary) logistic
regression model when the categorical response variable has more than two
levels.
• One possible way to handle such situations is to split the categorical response
variable and apply binary logistic regression to each dichotomous variable.
• However, this will result in several different analyses for only one categorical
response.
• A more structured approach is to formulate one model for the categorical
response by means of so-called generalized logits.
104
Cont’d…
• Suppose there are J categories for the response variable with corresponding
probabilities 𝜋1, 𝜋2, …, 𝜋𝐽.
• Using the first category as a “baseline”, we can form “baseline category logits” as
log(𝜋𝐽/𝜋1) for j = 2, …, J, which are simply log odds.
• When J = 2, we have log(𝜋2/𝜋1) = log(𝜋2/(1-𝜋2)),which is equivalent to log(𝜋/(1-
𝜋)) in binary logistic regression with 𝜋 = 𝜋2.
• When there is only one explanatory variable x, we can form the multinomial
logistic regression model of
105
Cont’d…
• One can easily compare other categories so that category 1 is not always
used.
• For example, suppose you would like to compare category 2 to 3 for J ≥ 3.
Then
and
• For more than one explanatory variable, the model becomes:
106
Odds ratios
• Because the log-odds are being modeled directly in a
multinomial regression model, odds ratios are useful for
interpreting an explanatory variable's relationship with the
response.
• Consider the model again of
• The odds of a category j response vs. a category 1 response are
𝐸𝑥𝑝 𝛽𝑗0 + 𝛽𝑗1𝑥 . This directly leads to using odds ratios as a way
to understand the explanatory variable in the model.
107
Cont’d…
• Thus, the odds of a category j vs. a category 1 response change by 𝒆𝒄𝜷𝒋𝟏 times for
every c-unit increase in x.
• In a similar manner, we could also compare category j to j(j  j, j > 1, j> 1):
• Wald and LR-based inference methods for odds ratios are performed.
108
Ordinal response models
• Suppose that the response categories are ordered in the following way:
category 1 < category 2 <….< category J
• For example, a response variable may be measured using a Likert scale with
categories strongly disagree, disagree, neutral, agree, or strongly agree.
• Logit transformations of the probabilities can incorporate these orderings in a
variety of ways.
• In this section, we focus on one way where probabilities are cumulated based on
these orderings.
109
Cont’d…
• The cumulative probability for Y is
P(Y  j) = 1 + … + j for j = 1, …, J.
• Note that: P(Y  J) = 1.
• The logit of the cumulative probabilities can be written as
for j = 1, …, J – 1. For each j, we are computing the log odds of being in
categories 1 through j vs. categories j + 1 through J.
110
Cont’d…
• When there is only one explanatory variable x, we can allow the log odds to vary
by using a proportional odds model:
for j = 1, …, J – 1.
• The proportional odds name comes from there being no j subscripts on the
𝛽 parameter, which means these parameters are the same for each possible log-
odds that can be formed. This leads to each odds being a multiple of exp (𝛽𝑗0).
111
Cont’d…
• Notes:
• 10<<J0 due to the cumulative probabilities. Thus, the odds increasingly become
larger for j=1, …, J – 1.
• A proportional odds model actually is a special case of a cumulative probability
model, which allows the parameter coefficient on each explanatory variable to vary
as a function of j.
112
Cont’d…
• For more than one explanatory variable, the model becomes:
• Consider the case of one explanatory variable x again:
113
Odds ratio
• Odds ratios are easily formed because the proportional odds model
equates log-odds to the linear predictor.
• The main difference now is the odds involve cumulative probabilities.
• Consider the model again of
• The odds ratio is
Where Oddsx (Y≤ j) denotes the odds of observing category j or smaller for
Y.
114
Cont’d…
• The formal interpretation of the odds ratio is
- The odds of Y ≤ j vs. Y > j change by exp(𝛽1) times for a c-unit increase in x.
Notes:
• When there is more than one explanatory variable, we will need to include a
statement like “holding the other variables in the model constant”.
• Adjustments need to be made to an odds ratio interpretation when
interactions or transformations are present in the model.
• Wald and LR-based inference methods for odds ratios are performed
115
Cont’d…
Reading Assignment
• Model for ordinal categories (adjacent category)
116
Logistic Regression - Multiple Dependent Variables
• Is it possible to list multiple dependent variables (DVs) in a single SPSS
logistic regression procedure?
• The Logistic Regression procedure does not allow you to list more than one
dependent variable, even in a syntax command.
• it is possible to write a short macro that loops through a list of dependent
variables.
• The list is an argument in the macro call and the Logistic Regression command
is embedded in the macro.
117
* compute a set of binary dependent variables to illustrate the macro.
do repeat y = y1 to yn.
compute y = (uniform(1) > .6).
end repeat.
exe.
define lrdef (!pos !charend('/') )
!do !i !in ( !1)
logistic regression !i
/method = enter v1v2v3..vn
/contrast (v1)=indicator /contrast (v2)=indicator
/save = pred
/criteria = pin(.05) pout(.10) iterate(20) cut(.5) .
!doend
!enddefine.
lrdef y1 y2 y3 …yn/.
118

More Related Content

What's hot

Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsankit_ppt
 
Application of ordinal logistic regression in the study of students’ performance
Application of ordinal logistic regression in the study of students’ performanceApplication of ordinal logistic regression in the study of students’ performance
Application of ordinal logistic regression in the study of students’ performanceAlexander Decker
 
The kolmogorov smirnov test
The kolmogorov smirnov testThe kolmogorov smirnov test
The kolmogorov smirnov testSubhradeep Mitra
 
Chi square test final
Chi square test finalChi square test final
Chi square test finalHar Jindal
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsAileen Balbido
 
Hypothesis testing and p values 06
Hypothesis testing and p values  06Hypothesis testing and p values  06
Hypothesis testing and p values 06DrZahid Khan
 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysisRabin BK
 
Correlation & Regression Analysis using SPSS
Correlation & Regression Analysis  using SPSSCorrelation & Regression Analysis  using SPSS
Correlation & Regression Analysis using SPSSParag Shah
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisHARISH Kumar H R
 
Multiple Regression and Logistic Regression
Multiple Regression and Logistic RegressionMultiple Regression and Logistic Regression
Multiple Regression and Logistic RegressionKaushik Rajan
 
Introduction to t-tests (statistics)
Introduction to t-tests (statistics)Introduction to t-tests (statistics)
Introduction to t-tests (statistics)Dr Bryan Mills
 

What's hot (20)

Ml3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metricsMl3 logistic regression-and_classification_error_metrics
Ml3 logistic regression-and_classification_error_metrics
 
Ordinal Logistic Regression
Ordinal Logistic RegressionOrdinal Logistic Regression
Ordinal Logistic Regression
 
Tests of significance z &amp; t test
Tests of significance z &amp; t testTests of significance z &amp; t test
Tests of significance z &amp; t test
 
Application of ordinal logistic regression in the study of students’ performance
Application of ordinal logistic regression in the study of students’ performanceApplication of ordinal logistic regression in the study of students’ performance
Application of ordinal logistic regression in the study of students’ performance
 
Confidence interval
Confidence intervalConfidence interval
Confidence interval
 
The kolmogorov smirnov test
The kolmogorov smirnov testThe kolmogorov smirnov test
The kolmogorov smirnov test
 
STATISTIC ESTIMATION
STATISTIC ESTIMATIONSTATISTIC ESTIMATION
STATISTIC ESTIMATION
 
Chi square test final
Chi square test finalChi square test final
Chi square test final
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Hypothesis testing and p values 06
Hypothesis testing and p values  06Hypothesis testing and p values  06
Hypothesis testing and p values 06
 
Statistics-Regression analysis
Statistics-Regression analysisStatistics-Regression analysis
Statistics-Regression analysis
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Correlation & Regression Analysis using SPSS
Correlation & Regression Analysis  using SPSSCorrelation & Regression Analysis  using SPSS
Correlation & Regression Analysis using SPSS
 
Multinomial Logistic Regression Analysis
Multinomial Logistic Regression AnalysisMultinomial Logistic Regression Analysis
Multinomial Logistic Regression Analysis
 
Multiple Regression and Logistic Regression
Multiple Regression and Logistic RegressionMultiple Regression and Logistic Regression
Multiple Regression and Logistic Regression
 
Correlation
CorrelationCorrelation
Correlation
 
Chi square
Chi squareChi square
Chi square
 
Testing Hypothesis
Testing HypothesisTesting Hypothesis
Testing Hypothesis
 
Introduction to t-tests (statistics)
Introduction to t-tests (statistics)Introduction to t-tests (statistics)
Introduction to t-tests (statistics)
 
Chi square
Chi squareChi square
Chi square
 

Similar to Categorical data analysis.pptx

Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiranKiran Ramakrishna
 
PARAMETRIC TESTS.pptx
PARAMETRIC TESTS.pptxPARAMETRIC TESTS.pptx
PARAMETRIC TESTS.pptxDrLasya
 
3.1 non parametric test
3.1 non parametric test3.1 non parametric test
3.1 non parametric testShital Patil
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminardrdeepika87
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in StatisticsVikash Keshri
 
BIOSTATISTICS.pptx
BIOSTATISTICS.pptxBIOSTATISTICS.pptx
BIOSTATISTICS.pptxkajolbhavsar
 
Epidemological methods
Epidemological methodsEpidemological methods
Epidemological methodsKundan Singh
 
Chi squared test
Chi squared testChi squared test
Chi squared testDhruv Patel
 
Class 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptxClass 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptxCallplanetsDeveloper
 
Overview of different statistical tests used in epidemiological
Overview of different  statistical tests used in epidemiologicalOverview of different  statistical tests used in epidemiological
Overview of different statistical tests used in epidemiologicalshefali jain
 
Research Designs
Research DesignsResearch Designs
Research DesignsAravind L R
 
Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104RSS6
 
Research method ch07 statistical methods 1
Research method ch07 statistical methods 1Research method ch07 statistical methods 1
Research method ch07 statistical methods 1naranbatn
 
De-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsDe-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsGillian Byrne
 
Chi square(hospital admin) A
Chi square(hospital admin) AChi square(hospital admin) A
Chi square(hospital admin) AMmedsc Hahm
 
Statistics Introduction In Pharmacy
Statistics Introduction In PharmacyStatistics Introduction In Pharmacy
Statistics Introduction In PharmacyPharmacy Universe
 

Similar to Categorical data analysis.pptx (20)

Statistics basics for oncologist kiran
Statistics basics for oncologist kiranStatistics basics for oncologist kiran
Statistics basics for oncologist kiran
 
PARAMETRIC TESTS.pptx
PARAMETRIC TESTS.pptxPARAMETRIC TESTS.pptx
PARAMETRIC TESTS.pptx
 
3.1 non parametric test
3.1 non parametric test3.1 non parametric test
3.1 non parametric test
 
Parametric tests seminar
Parametric tests seminarParametric tests seminar
Parametric tests seminar
 
Ds 2251 -_hypothesis test
Ds 2251 -_hypothesis testDs 2251 -_hypothesis test
Ds 2251 -_hypothesis test
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in Statistics
 
Chi square mahmoud
Chi square mahmoudChi square mahmoud
Chi square mahmoud
 
BIOSTATISTICS.pptx
BIOSTATISTICS.pptxBIOSTATISTICS.pptx
BIOSTATISTICS.pptx
 
Epidemological methods
Epidemological methodsEpidemological methods
Epidemological methods
 
Chi squared test
Chi squared testChi squared test
Chi squared test
 
Class 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptxClass 5 Hypothesis & Normal Disdribution.pptx
Class 5 Hypothesis & Normal Disdribution.pptx
 
Overview of different statistical tests used in epidemiological
Overview of different  statistical tests used in epidemiologicalOverview of different  statistical tests used in epidemiological
Overview of different statistical tests used in epidemiological
 
Testing of hypothesis and Goodness of fit
Testing of hypothesis and Goodness of fitTesting of hypothesis and Goodness of fit
Testing of hypothesis and Goodness of fit
 
Research Designs
Research DesignsResearch Designs
Research Designs
 
Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104Choosing appropriate statistical test RSS6 2104
Choosing appropriate statistical test RSS6 2104
 
Research method ch07 statistical methods 1
Research method ch07 statistical methods 1Research method ch07 statistical methods 1
Research method ch07 statistical methods 1
 
De-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statisticsDe-Mystifying Stats: A primer on basic statistics
De-Mystifying Stats: A primer on basic statistics
 
Chi square(hospital admin) A
Chi square(hospital admin) AChi square(hospital admin) A
Chi square(hospital admin) A
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
Statistics Introduction In Pharmacy
Statistics Introduction In PharmacyStatistics Introduction In Pharmacy
Statistics Introduction In Pharmacy
 

Recently uploaded

Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000aliya bhat
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...CALL GIRLS
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatorenarwatsonia7
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near MeHi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Menarwatsonia7
 
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...narwatsonia7
 
Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...
Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...
Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...narwatsonia7
 
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Miss joya
 
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls ServiceMiss joya
 
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiCall Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiNehru place Escorts
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Miss joya
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliRewAs ALI
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...narwatsonia7
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Deliverynehamumbai
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escortsaditipandeya
 
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...Miss joya
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Servicemakika9823
 

Recently uploaded (20)

Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000Ahmedabad Call Girls CG Road 🔝9907093804  Short 1500  💋 Night 6000
Ahmedabad Call Girls CG Road 🔝9907093804 Short 1500 💋 Night 6000
 
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
Call Girls Service Surat Samaira ❤️🍑 8250192130 👄 Independent Escort Service ...
 
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service CoimbatoreCall Girl Coimbatore Prisha☎️  8250192130 Independent Escort Service Coimbatore
Call Girl Coimbatore Prisha☎️ 8250192130 Independent Escort Service Coimbatore
 
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Whitefield Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original PhotosCall Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
Call Girl Service Bidadi - For 7001305949 Cheap & Best with original Photos
 
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near MeHi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
Hi,Fi Call Girl In Mysore Road - 7001305949 | 24x7 Service Available Near Me
 
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
Russian Call Girls in Bangalore Manisha 7001305949 Independent Escort Service...
 
Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...
Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...
Call Girls Doddaballapur Road Just Call 7001305949 Top Class Call Girl Servic...
 
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
Russian Call Girls in Pune Tanvi 9907093804 Short 1500 Night 6000 Best call g...
 
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls ServiceCALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune)  Girls Service
CALL ON ➥9907093804 🔝 Call Girls Baramati ( Pune) Girls Service
 
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service ChennaiCall Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
Call Girl Chennai Indira 9907093804 Independent Call Girls Service Chennai
 
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
Low Rate Call Girls Pune Esha 9907093804 Short 1500 Night 6000 Best call girl...
 
Aspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas AliAspirin presentation slides by Dr. Rewas Ali
Aspirin presentation slides by Dr. Rewas Ali
 
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCREscort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
Escort Service Call Girls In Sarita Vihar,, 99530°56974 Delhi NCR
 
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
Russian Call Girls Chickpet - 7001305949 Booking and charges genuine rate for...
 
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on DeliveryCall Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
Call Girls Colaba Mumbai ❤️ 9920874524 👈 Cash on Delivery
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
 
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
Russian Call Girls in Delhi Tanvi ➡️ 9711199012 💋📞 Independent Escort Service...
 
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
VIP Call Girls Pune Vani 9907093804 Short 1500 Night 6000 Best call girls Ser...
 
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls ServiceKesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
Kesar Bagh Call Girl Price 9548273370 , Lucknow Call Girls Service
 

Categorical data analysis.pptx

  • 2. Introduction  Analysis of discrete data or discrete data analysis (CDA) refers to methods for discrete response variables.  DDA in practice is the analysis of count response variables.  Statistical computations and analyses assume that the variables have a specific levels of measurement 2
  • 3. 3  Categorical/Discrete/Qualitative data: Measures on categorical or discrete variables consist of assigning observations to one of a number of categories in terms of counts or proportions Counts are variables representing frequency of occurrence of an event: Example: number of graduate students in the department of public health, SPHMMC.  Proportions or “bounded counts” are ratios of counts: Example: number of graduate students in the department of public health divided by the total number of graduate students, SPHMMC.
  • 4. Discretely measured responses Discretely measured responses can be: Nominal(unordered)variables:e.g., gender, ethnic background, religious or political affiliation Ordinal (ordered) variables, e.g., grade levels, income levels, school grades Discrete interval variables with only a few values, e.g., number of times married Continuous variables grouped into small number of categories, e.g., income grouped into subsets, blood pressure levels (normal, high-normal etc) 4
  • 5. What is a categorical (qualitative) variable?  A categorical variable has a measurement consisting of a set of categories. For example:  Field goal result – success or failure  Patient survival – yes or no  Criminal offense convictions – murder, robbery, assault, …  Highest attained education level –HS, BSc, MSc, PhD  Monthly income < 650, 651-1200,…., > 10,999 ETB • Religious affiliation ∴We are living in categorical world. 5
  • 6. Types of Categorical Variables 1. Ordinal –categories have an ordering  Education level (none, BSc., MSc., PhD, Professor)  Social class (upper, middle, lower)  Patient condition (good, fair, serious, critical) 2. Nominal –categories do not have an ordering  Religious affiliation (Orthodox, Catholic, Protestant, Muslim, other)  Mode of transportation to work (walk, bike, automobile, bus)  Favorite type of music (classical, country, folk, jazz, rock)  Choice of residence (apartment, condominium, house, other) 6
  • 7. Note: The way that a variable is measured determines its classification.  For example, “education” is only nominal when measured as public school or private school; it is ordinal when measured by highest degree attained, using the categories none, high school, bachelor’s, master’s, and doctorate; it is interval when measured by number of years of education, using the integers 0,1, 2, … . 7
  • 8. Where do Categorical Data occur?  Social sciences: opinions on issues  Health sciences: response to treatments/drugs  Behavioral sciences: type of mental illness  Public health: AIDS awareness  Zoology: animals food preferences  Education: students' response to exams  Marketing: consumer preferences Categorical data occur almost everywhere. 8
  • 9. Variables and types of data for CDA  Response variable(s) is categorical  Explanatory/predictor variable(s) may be categorical or continuous ;they can be of any type Discrete Distributions  Statistical inference requires assumptions about the probability distribution(i.e random mechanism, sampling model) that generated the data.  For example for a t-test, we assume that a random variable follows a normal distribution.  For discrete data key distributions are: Bernoulli, Binomial, Poisson and Multinomial. 9
  • 10. Bernoulli Probability Distributions 10  Suppose Y = 1is a success where the probability of a success is π. Also, suppose Y =0 is a failure with probability of a failure is 1−π. Bernoulli probability mass function (pmf) Notice that; P(Y=1) = π and P(Y=0) = 1 - π Since E Yi = E Y2 i = 1 ∗ π + 0 ∗ 1 − π = π E Yi = π and Var (Yi) =π(1 - π)
  • 11. The Binomial distribution • It is one of the most widely encountered discrete distributions. • The origin of binomial distribution lies in Bernoulli’s trials. • When a single trial of some experiment can result in only one of two mutually exclusive outcomes (success or failure; dead or alive; sick or well, male or female) the trail is called Bernoulli trial. • Suppose an event can have only binary outcomes A and B. Let the probability of A is π and that of B is 1 - π. The probability π stays the same each time the event occurs. 11
  • 12. • If an experiment repeated n times and the outcome is independent from one trial to another, the probability that outcome A occurs exactly y times is • • We write Y ∼ B(n, π) 12
  • 13. Characteristics of a Binomial Distribution • The experiment consist of n identical trials. • There are only two possible outcomes on each trial. • The probability of A remains the same from trial to trial. This probability is denoted by p, and the probability of B is denoted by q. Note that q=1- p. • The trials are independent. • The binomial random variable Y is the number of A’s in n trials. • n and π are the parameters of the binomial distribution. • The mean is nπ and the variance is nπ(1- π) 13
  • 14. Poisson distribution • A different kind of discrete data arise when we count the number of occurrences of an event , perhaps for different subjects or for units of time. Examples: - Daily number of new cases of breast cancer notified to a cancer registry - Number of abnormal cells in a fixed area of histological slides from a series of liver biopsies • Suppose events happen randomly and independently in time at a constant rate. - If events happen with rate λ events per unit time, the probability of y events happening in unit time is ; 𝜆 > 0 14
  • 15. Characteristics of a Poisson distribution • The Poisson distribution is very asymmetric when its mean is small • With large means it becomes nearly symmetric • It has no theoretical maximum value, but the probabilities tail off towards zero very quickly • λ is the parameter of the Poisson distribution • The mean is λ and the variance is also λ 15
  • 16. THE CHI-SQUARE DISTRIBUTION AND THE ANALYSIS OF FREQUENCIES 16
  • 17. • The chi-square distribution is the most frequently employed statistical technique for the analysis of count or frequency data. • For example, we may know for a sample of hospitalized patients how many are male and how many are female. • For the same sample we may also know how many have private insurance coverage, how many have Medicare insurance, and how many are on Medicaid assistance. • We may wish to know, for the population from which the sample was drawn, if the type of insurance coverage differs according to gender. • Chi-square analysis has solution to such type of question. • The chi-square distribution may be derived from normal distributions. 17
  • 18. Chi square distribution 1. Chi-square distribution is a nonsymmetrical distribution 2. Chi-square distributions are determined by degree of freedom 18
  • 19. Chi-square test statistic  Cannot be negative because all discrepancies are squared.  Will be zero only in the unusual event that each observed frequency exactly equals the corresponding expected frequency.  Larger the discrepancy between the expected frequencies and their corresponding observed frequencies, the larger the observed value of chi-square. 19
  • 20. Types of Chi-Square Tests i. Tests of goodness-of-fit ii. Tests of independence iii. Tests of homogeneity 20
  • 21. Tests of goodness-of-fit • All of the chi-square tests that we employ may be thought of as goodness-of- fit tests. • The phrase “goodness-of-fit” for use in a more restricted sense. • We use it to refer to a comparison of a sample distribution to some theoretical distribution that it is assumed describes the population from which the sample came. • Karl Pearson, who showed that the chi-square distribution may be used as a test of the agreement between observation and hypothesis whenever the data are in the form of frequencies. 21
  • 22. Observed versus Expected Frequencies • Observed frequencies: are the number of subjects or objects in our sample that fall into the various categories of the variable of interest. For example: if we have a sample of 100 hospital patients, we may observe that 50 are married, 30 are single, 15 are widowed, and 5 are divorced. • Expected frequencies: are the number of subjects or objects in our sample that we would expect to observe if some null hypothesis about the variable is true. For example, our null hypothesis might be that the four categories of marital status are equally represented in the population from which we drew our sample. In that case we would expect our sample to contain 25 married, 25 single, 25 widowed, and 25 divorced patients. 22
  • 23. Chi-Square Test Statistic • When the null hypothesis is true, 𝜒2 is distributed approximately as 𝜒2 with k-r degrees of freedom. Where; k- is equal to the number of groups for which observed and expected frequencies are available r - is the number of restrictions or constraints imposed on the given comparison Oi is the observed frequency for the ith category of the variable of interest Ei is the expected frequency • This test is a right-tailed test, since when the O - E values are squared, the answer will be positive or zero. 23
  • 24. Two assumptions are needed for the goodness-of-fit test. 1. The data are obtained from a random sample. 2. The expected frequency for each category must be 5 or more. - The steps for the chi-square goodness-of-fit test are summarized in this procedure Table. 24
  • 25. • When there is perfect agreement between the observed and the expected values, χ2= 0. Also, 𝜒2 can never be negative. • Finally, the test is right-tailed because “H0: Good fit” and “H1: Not a good fit” mean that 𝜒2 will be small in the first case and large in the second case. For example, suppose as a market analyst you wished to see whether consumers have any preference among five flavors of a new fruit soda. A sample of 100 people provided these data: If there were no preference, you would expect each flavor to be selected with equal frequency, i.e. 100/5 =20. 25
  • 26. - Is there enough evidence to reject the claim that there is no preference in the selection of fruit soda flavors? Let 𝛼 = 0.05. Solution - Step 1: State the hypotheses and identify the claim. H0 : Consumers show no preference for flavors (claim). H1: Consumers show a preference. - Step 2: Find the critical value. The degrees of freedom are 5 -1= 4, and 𝛼 = 0.05. Hence, the critical value from Chi-square table is 9.488. - Step 3: Compute the test value by subtracting the expected value from the corresponding observed value, squaring the result and dividing by the expected value, and finding the sum. The expected value for each category is 20. 26
  • 27. Step 4: Make the decision. The decision is to reject the null hypothesis, since 18.0 > 9.488 Step 5: Summarize the results. There is enough evidence to reject the claim that consumers show no preference for the flavors. 27
  • 28. • The chi-square goodness-of-fit test can be used to test a variable to see if it is normally distributed. The hypotheses are H0: The variable is normally distributed. H1: The variable is not normally distributed. 28
  • 29. TESTS OF INDEPENDENCE • The chi-square independence test can be used to test the independence of two variables. • To test the null hypothesis by using the chi-square independence test, expected frequencies must be computed • When data are arranged in table form for the chi-square independence test, the table is called a contingency table. • The table is made up of R rows and C columns. • The degrees of freedom for any contingency table are (rows-1) times(columns- 1); that is, d.f. (R -1)(C -1). • The reason for this formula for d.f. is that all the expected values except one are free to vary in each row and in each column. 29
  • 30. • For example, suppose a new postoperative procedure is administered to a number of patients in a large hospital. • The researcher can ask the question, do the doctors feel differently about this procedure from the nurses, or do they feel basically the same way? • Note that the question is not whether they prefer the procedure but whether there is a difference of opinion between the two groups. • To answer this question, a researcher selects a sample of nurses and doctors and tabulates the data in table form, as shown. 30
  • 31. H0:The opinion about the procedure is independent of the profession. H1: The opinion about the procedure is dependent on the profession • The degree of freedom for this case is (2-1)(3-1)= (1)(2) =2 31
  • 32. 32
  • 33. 33
  • 34. • The final steps are to make the decision and summarize the results. • This test is always a right-tailed test, and the degrees of freedom are (R-1)(C-1) (2-1)(3-1)=2. • If 𝛼=0.05,the critical value from Chi-square table is 5.991. Hence, the decision is to reject the null hypothesis, since 26.67 > 5.991 • The conclusion is that there is enough evidence to support the claim that opinion is related to (dependent on) profession—that is, that the doctors and nurses differ in their opinions about the procedure. 34
  • 35. The 2 X 2 Contingency Table • Sometimes each of two criteria of classification may be broken down into only two categories, or levels. • When data are cross classified in this manner, the result is a contingency table consisting of two rows and two columns. • Such a table is commonly referred to as a 2X2 table. • In the case of a 2X2 contingency table, however, X2 may be calculated by the following shortcut formula: 35
  • 36. • Where a, b, c, and d are the observed cell frequencies as shown in the following table. • When we apply the (r-1)(c-1) rule for finding degrees of freedom to a 2X2 table, the result is 1 degree of freedom. A 2X2 Contingency Table 36
  • 37. Example: • According to Silver and Aiello study finding falls are of major concern among polio survivors. • Researchers wanted to determine the impact of a fall on lifestyle changes. • The following table shows the results of a study of 233 polio survivors on whether fear of falling resulted in lifestyle changes. 37
  • 38. • Solution: 1. Data. From the information given we may construct the 2X2 contingency table 2. Assumptions. We assume that the sample is equivalent to a simple random sample. 3. Hypotheses. H0: Fall status and lifestyle change because of fear of falling are independent. H1: The two variables are not independent. Let 𝛼 =.05 4. Test statistic. The test statistic is OR Answer: χ2 𝐶𝑎𝑙 = 31.74 38
  • 39. Small Expected Frequencies • The problems of how to handle small expected frequencies and small total sample sizes may arise in the analysis of 2X2 contingency tables. • Cochran suggests that the 𝜒2 test should not be used if n<20 or if 20<n<40 and any expected frequency is less than 5. • When n=40, an expected cell frequency as small as 1 can be tolerated. 39
  • 40. • Yates’s Correction • The observed frequencies in a contingency table are discrete and thereby give rise to a discrete statistic, 𝜒2 , which is approximated by the 𝜒2 distribution, which is continuous. • Yates proposed a procedure for correcting for this in the case of 2X2 tables. • No correction is necessary for larger contingency tables 40
  • 41. n≥40 and E≥5 n≥40 and 1≤ E < 5 n<40 or E<1     E E O 2 2 ) 5 . 0 | (|  ) )( )( )( ( ) 2 / | (| 2 2 d b c a d c b a n n bc ad            E E O 2 2 ) (  ) )( )( )( ( ) ( 2 2 d b c a d c b a n bc ad        ! ! ! ! ! )! ( )! ( )! ( )! ( n d c b a d b c a d c b a P      41
  • 43. Logistic Regression • Much research in the health sciences is motivated by a desire to understand and describe the relationship between independent variables and categorical dependent variable. • Particularly plentiful are circumstances in which the outcome variable is dichotomous (a variable that can assume only one of two mutually exclusive values). • These values are usually coded as Y=1 for a success and Y=0 for a failure 43
  • 44. • Logistic regression is the type of regression analysis that is usually employed when the dependent variable is categorical. • There can be many predictor variables (x’s) that could be categorical or continuous. 44
  • 45. Types of Logistic Regression • Binary logistic regression: a regression analysis used to model outcome variable with two categories • Multinomial logistic regression: a regression analysis used to model outcome variable of nominal scale with more than two categories • Ordinal Logistic regression: a regression analysis used to model outcome variable of ordinal scale with more than two categories 45
  • 46. Linear vs. Logistic Regression • What distinguishes logistic regression from linear regression model is that the type of outcome variable. • Linear regression: Outcome variable y is continuous • Logistic regression: Outcome variable y is categorical • The question a researcher need ask when choosing a regression method is: o What does my outcome look like? 46
  • 47. • The difference is reflected both in o the choice of a parametric model and o the assumptions. • However the methods employed in an analysis using logistic regression follow the same general principles used in linear regression. • Why not linear regression model for categorical outcome variables? o Because having a categorical outcome variable violates the assumption of linearity in linear regression. o The error terms are heteroskedastic and is not normally distributed because Y takes on only two values(0 and 1). 47
  • 48. • The predicted probabilities can be greater than 1 or less than 0 which can be a problem if the predicted values are used in a subsequent analysis. • Some people try to solve this problem by setting probabilities that are greater than (less than) 1 (0) to be equal to 1 (0). • This amounts to an interpretation that a high probability of the Event (Nonevent) occurring is considered a sure thing. 48
  • 49. Objectives of Logistic Regression • Estimating magnitude of outcome/exposure relationship oTo evaluate the association of a binary outcome with a set of predictors • Prediction oDevelop an equation to determine the probability or likelihood that individual has the condition (y = 1) that depends on the independent variables (the x’s) 49
  • 50. Assumptions of Logistic regression • The outcome must be categorical • Requires enough responses in each category of a given variable • Groups should be mutually exclusive (e.g. multicollinear) which will make maximum likelihood estimation impossible • There is no assumption about the predictors being linearly related to each other • There should not be multi-collinearity • There should not be outliers and influential • Independence of errors –assumes a between subjects design. 50
  • 51. Logistic Regression Model • The probability of the outcome is measured by the odds of occurrence of an event. • If P is the probability of an event, then (1-P) is the probability of it not occurring. o Odds of event = P/1-P • In linear regression the estimates of effect directly quantified by the mean value of response variable • In logistic regression the estimates of effect are instead quantified by “Odds Ratios” 51
  • 53. 53
  • 54. • Taking the logarithms of both sides • Can be transformed as follows • Sometimes written as: • Where ln ( or log) is the natural logarithm (base e) 54
  • 55. Cont’d… Logistic Vs. Linear Regression Equation Logistic Regression: Linear Regression: • The other difference between linear and logistic regression models concerns the conditional distribution of error. 55
  • 56. Cont’d… • In the linear regression model we assume that an observation of the outcome variable may be expressed as y = E(𝒀𝒊|𝑿𝒊) + 𝜺. • The error (𝜺) is an observation's deviation from the conditional mean of y. • The errors 𝜺 are normally distributed with mean 0 and constant variance 𝜎2 (Equal variance). That is: 𝜀 ~N(0, 𝜎2 ) 56
  • 57. Cont’d… • With a dichotomous outcome variable the conditional distribution of error term is different. • In this situation we may express the value of the outcome variable given x as y = P(x)+ 𝜺. • Here the quantity 𝜺 may assume one of two possible values. o If y = 1 then 𝜺 =1-P(x) o If y = 0 then 𝜺 = -P(x) • Thus, 𝜺 are distributed with mean zero and variance equal to P(x)[1-P(x)]. • The conditional distribution of the outcome variable follows a binomial distribution with probability given by the conditional mean, P(x). 57
  • 59. Why log transformation? • The odds has a range of 0 to ∞ • Odds > 1 associated with an event being more likely to occur than to not occur • Odds <1 associated with an event that is less likely to occur than not occur • Transformation is useful because it creates variable ranges from - ∞ to +∞ • Hence, it solves the problem we encountered in fitting a linear model to probabilities 59
  • 60. Estimating Logistic regression Simple Logistic regression • The logistic model with a single independent variable X where the effects of other variables is uncontrolled. Multiple Logistic regression oThe logistic model with a single predictor variable X can be extended to two or more predictor variables. 60
  • 61. Interpretation of slope • 𝛽1 is the estimated change in the log odds of the outcome for a one unit increase in 𝑥1 • It estimates the log odds ratio for comparing two groups of observations • This estimated slope can be exponentiated to get the corresponding estimated odds ratio. What about the Intercept? • The intercept is mathematically necessary to specify the entire equation. 61
  • 62. Maximum Likelihood Estimation • The method used to estimate the regression coefficients in logistic regression is called Maximum Likelihood Estimation (MLE) • Ordinary least square(OLS) is method used to estimate the regression coefficients in linear regression • MLE yields values for the unknown parameters which maximize the probability of obtaining the observed set of data. 62
  • 63. Cont’d… • Basically, the resulting estimates of the slope and intercept are the values that make the observed data most likely among all choices of values for 𝛽0and 𝛽1. • Along with the estimates of 𝛽0and 𝛽1this method yields estimates of the standard error for each: that can be used to create confidence intervals and do hypothesis tests 63
  • 64. Test of Significance of Coefficients • The fitted relationship i.e. the estimated value of 𝛽0 & 𝛽1 may simply be the result of chance phenomena. • We need to test whether or not the sample data set exhibits sufficient evidence to indicate that X actually contributes significantly to the prediction of the log odds of Y for a given value of X • The test statistics is: 64
  • 65. Example: Coronary Heart Disease (CD) and Age: In this study sampled individuals were examined for signs of CD (present = 1 / absent = 0) and the potential relationship between this outcome and their age (yrs.) was considered. 65
  • 66. • For the CHD-age data set, we could try to estimate the following: • p = probability of CHD evidence (proportion of persons with CHD evidence), 𝑥1 = age • are called regression coefficients • Another way to write the above equation: • Recall, the higher the odds of an event, the larger the probability of an event • A predictor 𝑥1 that is positively associated with the odds will also be positively associated with the probability of the event (i.e. estimated slope 𝛽1will be positive) 66
  • 67. • A predictor 𝑥1 that is negatively associated with the odds will also be negatively associated with the probability of the event (i.e. estimated slope 𝛽1will be negative) • Results from logistic regression of log odds of CHD evidence on age: • The resulting equation: = -5.34+0.11XAge 67
  • 68. Cont’d.. • Where p is estimated probability of persons to have CHD amongst persons of a given age • The estimated coefficient (𝛽1) of age (𝑥1) is positive; hence we have o Estimated a positive association between age and log odds of CHD o Estimated a positive association between age and probability of CHD • How can we actually interpret the value 0.11? • Lets write out the equation comparing two groups of individuals who differ in age by one year: • Group 1, age = k years; Group 2, age = k + 1 years 68
  • 69. Cont’d… • The resulting equations estimating the ln odds of CHD evidence in each age group • Multiplying out, and taking the difference (subtracting) 69
  • 70. Cont’d… • So, when the dust settles: • Reversing one of the famous properties of logarithms: • So , 𝛽1the estimated slope for 𝑥1 is the natural log of an estimated odds ratio: • To get the estimated odds ratio, exponentiation of 𝛽1, i.e.: 70
  • 71. Cont’d… • In our example, recall 𝛽1= 0.11 • Here, 𝑂𝑅 = 𝑒𝛽1 = 𝑒0.11 = 1.116 • The estimated odds ratio of CHD evidence for a one year age difference is 1.116, older compared to younger. o60 year olds compared to 59 years olds o45 year old compared to 44 year olds 71
  • 72. Interpretation of slope • Change in the log odds of CHD for a one year increase in age • One group with 𝑥1 one unit higher than the other The Intercept? • The resulting equation = -5.34+0.11XAge • Here, the intercept estimate 𝛽0is the estimated ln odds of CHD evidence for persons of age 0 72
  • 73. Test of Significance of Coefficients • Hypothesis • Assume null true, and calculate standardized “distance “ of 𝛽1from 0 = 0.11 0.03 = 3.67 • p-value is probability of being 3.67 or more standard errors away from 0 on a normal curve: very low in this example, p < 0.001 73
  • 74. Multiple Logistic Regression • Multiple logistic regression allows us to model the relationships of several independent variables to a response variable. • These independent variables may be either continuous or discrete or a combination of the two • We can also estimate the association between each predictor and Pr(y = 1) controlling for all other predictors • In the previous example we found a statistically significant positive association between CHD and age = -5.34+0.11XAge 74
  • 75. Cont’d… • Smoking status of study participants was also included in the model to assess if it do have a relationship with CHD • What if smoking is also associated with age? • Age could be a confounder of the smoking and CHD relationship (and vice- versa) • Can we estimate the age adjusted relationship between CHD and smoking? • Even if smoking and age not related, and hence there is no confounding, both predictors may tell more about CHD evidence than either alone. 75
  • 76. Cont’d… • Here, we need a logistic regression model with 2 predictors (𝑋𝑠): • Where p = Pr(CHD evidence), 𝑋1 = age, 𝑋2 = smoking status (1=yes) • How would we interpret the coefficients from a multiple logistic regression? And the resulting odds ratio estimates? 76
  • 77. Cont’d… • 𝛽1is the estimated regression coefficient associated with age: • It estimates the ln odds ratio for comparing two individuals (groups) who differ by one year in age and are either both smokers or non-smokers • 𝛽1is the estimated smoking-adjusted log odds ratio for age • Just to demonstrate: Write out 2 equations for two groups of persons who differ by one year in age and are all smokers 77
  • 79. Cont’d… • 𝑿𝟏 is the age variable • 𝛽1 is the estimated adjusted ln OR of CHD associated with age, after adjusting for smoking status • 𝑒𝛽1 is the estimated adjusted OR of CHD associated with age, after adjusting for smoking status • This 𝑂𝑅 compares two groups of individuals of the same smoking status but who differ by one year in age (older to younger) 79
  • 80. Cont’d… • 𝑿𝟐 is the smoking variable • 𝛽2 is the estimated regression coefficient associated with smoking: • It estimates the ln odds ratio for comparing two groups of individuals of the same age, where one group is smokers and the other is non-smokers • 𝑒𝛽2 estimates the odds ratio for comparing two groups of individuals of the same age, where one group is smokers and the other is non-smokers 80
  • 81. Inference in Multiple Logistic Regression • We can estimate each regression coefficients and ORs by constructing a range of plausible values i.e. CIs • We can also test the statistical significance of regression coefficients and ORs using magnitude of test statistics or corresponding p-values or CI • Each coefficient estimate has its own associated standard error • Approach very similar to approach from simple logistic regression 81
  • 83. Model Development • The approach to model development in multiple logistic regression analysis is similar to the approach in normal theory multiple linear regression. • Models are compared to assess the statistical significance of the extra predictors in the larger model, controlling for the predictors in the smaller model. • This is done using the likelihood ratio test. • If the likelihood ratio statistic is significant, we say that the added variables are significant in adjusted analysis. 83
  • 85. 85
  • 86. 86
  • 87. 87
  • 88. 88
  • 89. 89
  • 92. 92
  • 93. 93
  • 94. 94
  • 95. 95
  • 96. 96
  • 97. 97
  • 98. 98
  • 99. 99
  • 101. Multicategory response • The binary logistic regression provided analysis methods when there were binary responses. • What about more than two response categories? Examples: • Canadian political party affiliation – Conservative, New Democratic, Liberal • Chemical compounds in drug discovery experiments – Positive, blocker, or neither • Five-level Likert scale – Strongly disagree, disagree, neutral, agree, or strongly agree. 101
  • 102. Cont’d… • For these examples, some responses are ordinal (e.g., Likert scale) and some are not (e.g., chemical compounds). • We will investigate both nominal (unordered) and ordinal multicategory responses. Multinomial Probability Distribution • The multinomial probability distribution is the extension of the binomial distribution to situations where there are more than two categories for a response. • The probability mass function for observing particular values of n1, …,nc is 102
  • 103. 𝑊here • Y denotes the response category with levels of j = 1, …, c • Each category has a probability of 𝜋𝑗= P(Y=j). • n denotes the number of trials • n1, …, nc denote the response count for category j 103
  • 104. NOMINAL RESPONSES: BASELINE-CATEGORY LOGIT MODELS • Multinomial logistic regression is an extension of the (binary) logistic regression model when the categorical response variable has more than two levels. • One possible way to handle such situations is to split the categorical response variable and apply binary logistic regression to each dichotomous variable. • However, this will result in several different analyses for only one categorical response. • A more structured approach is to formulate one model for the categorical response by means of so-called generalized logits. 104
  • 105. Cont’d… • Suppose there are J categories for the response variable with corresponding probabilities 𝜋1, 𝜋2, …, 𝜋𝐽. • Using the first category as a “baseline”, we can form “baseline category logits” as log(𝜋𝐽/𝜋1) for j = 2, …, J, which are simply log odds. • When J = 2, we have log(𝜋2/𝜋1) = log(𝜋2/(1-𝜋2)),which is equivalent to log(𝜋/(1- 𝜋)) in binary logistic regression with 𝜋 = 𝜋2. • When there is only one explanatory variable x, we can form the multinomial logistic regression model of 105
  • 106. Cont’d… • One can easily compare other categories so that category 1 is not always used. • For example, suppose you would like to compare category 2 to 3 for J ≥ 3. Then and • For more than one explanatory variable, the model becomes: 106
  • 107. Odds ratios • Because the log-odds are being modeled directly in a multinomial regression model, odds ratios are useful for interpreting an explanatory variable's relationship with the response. • Consider the model again of • The odds of a category j response vs. a category 1 response are 𝐸𝑥𝑝 𝛽𝑗0 + 𝛽𝑗1𝑥 . This directly leads to using odds ratios as a way to understand the explanatory variable in the model. 107
  • 108. Cont’d… • Thus, the odds of a category j vs. a category 1 response change by 𝒆𝒄𝜷𝒋𝟏 times for every c-unit increase in x. • In a similar manner, we could also compare category j to j(j  j, j > 1, j> 1): • Wald and LR-based inference methods for odds ratios are performed. 108
  • 109. Ordinal response models • Suppose that the response categories are ordered in the following way: category 1 < category 2 <….< category J • For example, a response variable may be measured using a Likert scale with categories strongly disagree, disagree, neutral, agree, or strongly agree. • Logit transformations of the probabilities can incorporate these orderings in a variety of ways. • In this section, we focus on one way where probabilities are cumulated based on these orderings. 109
  • 110. Cont’d… • The cumulative probability for Y is P(Y  j) = 1 + … + j for j = 1, …, J. • Note that: P(Y  J) = 1. • The logit of the cumulative probabilities can be written as for j = 1, …, J – 1. For each j, we are computing the log odds of being in categories 1 through j vs. categories j + 1 through J. 110
  • 111. Cont’d… • When there is only one explanatory variable x, we can allow the log odds to vary by using a proportional odds model: for j = 1, …, J – 1. • The proportional odds name comes from there being no j subscripts on the 𝛽 parameter, which means these parameters are the same for each possible log- odds that can be formed. This leads to each odds being a multiple of exp (𝛽𝑗0). 111
  • 112. Cont’d… • Notes: • 10<<J0 due to the cumulative probabilities. Thus, the odds increasingly become larger for j=1, …, J – 1. • A proportional odds model actually is a special case of a cumulative probability model, which allows the parameter coefficient on each explanatory variable to vary as a function of j. 112
  • 113. Cont’d… • For more than one explanatory variable, the model becomes: • Consider the case of one explanatory variable x again: 113
  • 114. Odds ratio • Odds ratios are easily formed because the proportional odds model equates log-odds to the linear predictor. • The main difference now is the odds involve cumulative probabilities. • Consider the model again of • The odds ratio is Where Oddsx (Y≤ j) denotes the odds of observing category j or smaller for Y. 114
  • 115. Cont’d… • The formal interpretation of the odds ratio is - The odds of Y ≤ j vs. Y > j change by exp(𝛽1) times for a c-unit increase in x. Notes: • When there is more than one explanatory variable, we will need to include a statement like “holding the other variables in the model constant”. • Adjustments need to be made to an odds ratio interpretation when interactions or transformations are present in the model. • Wald and LR-based inference methods for odds ratios are performed 115
  • 116. Cont’d… Reading Assignment • Model for ordinal categories (adjacent category) 116
  • 117. Logistic Regression - Multiple Dependent Variables • Is it possible to list multiple dependent variables (DVs) in a single SPSS logistic regression procedure? • The Logistic Regression procedure does not allow you to list more than one dependent variable, even in a syntax command. • it is possible to write a short macro that loops through a list of dependent variables. • The list is an argument in the macro call and the Logistic Regression command is embedded in the macro. 117
  • 118. * compute a set of binary dependent variables to illustrate the macro. do repeat y = y1 to yn. compute y = (uniform(1) > .6). end repeat. exe. define lrdef (!pos !charend('/') ) !do !i !in ( !1) logistic regression !i /method = enter v1v2v3..vn /contrast (v1)=indicator /contrast (v2)=indicator /save = pred /criteria = pin(.05) pout(.10) iterate(20) cut(.5) . !doend !enddefine. lrdef y1 y2 y3 …yn/. 118