Categorical Models: An Introduction to Contingency Tables, Logistic Regression and Generalized Linear Models

9/13/2010
1
Categorical ModelsCategorical Models
Presented by: Jeff Skinner, M.S.
Biostatistics Specialist
Bioinformatics and Computational Biosciences Branch
National Institute of Allergy and Infectious Diseases
Office of Cyber Infrastructure and Computational Biology
Introduction
Many biological experiments include categorical response
variables, which need to be analyzed with unfamiliar tests
• Simple contingency table methods
– Pearson vs. Fisher tests, odds ratios & relative risks, sensitivity & specificity
– M x N tables, McNemar’s test for paired data, MHC tests for confounding
• Logistic regression methods
– Odds ratios, estimating LD50, Wald and Likelihood Ratio Tests, …
• Generalized linear model (GLIM) methods
– Choosing distribution and link functions, overdispersion statistics, ...

9/13/2010
2
Contingency Tables
• Used to display relationships
t i l i bl
Pregnant? Row
Totalsamong categorical variables
– Responses in the columns
– Predictors in the rows
• Statistical significance tested
using Pearson chi‐square or
Fisher’s exact tests
Yes No
Pregnancy
Test?
Positive 27 3 30
Negative 4 26 30
31 29 60Column Totals →
Totals
↓
Fisher s exact tests
• Results interpreted using an
odds ratio or relative risk
Pearson’s Chi‐Squared Test
• Pearson’s chi‐square test
assumes that columns and
Pregnant?
assumes that columns and
rows are independent
– Computation of expected values
(Expij) assumes independence
• Chi‐square tests require large
sample sizes with no empty
Yes No
Pregnancy
Test?
Positive Obs11 Obs12 R1.
Negative Obs21 Obs22 R2.
C.1 C.2 N..
cells & few small cell counts
• P‐values computed from the
chi‐square distribution

9/13/2010
3
Fisher’s Exact Test
• Also tests the independence
f l d
Pregnant?
of columns and rows
• Fisher’s test is valid for all
sample sizes and cell counts
• Fisher’s test assumes column
Yes No
Pregnancy
Test?
Positive a b a+b
Negative c d c+d
a+c b+d n
Fisher s test assumes column
and row totals are fixed
– Fisher’s exact test may be
inappropriate for some tables
• P‐values computed using the hypergeometric
distribution shown above
• P‐value represents the probability of finding
this specific table vs. all possible tables of
sample size of n = a + b + c + d
Odds Ratios and Relative Risk
• Pearson’s chi‐square and
Fi h ’ i di
Pregnant?
Fisher’s exact tests indicate
whether a relationship is
statistically significant
– Did the results occur by chance?
• Odds ratios and relative risk
indicate the magnitude of a
Yes No
Pregnancy
Test?
Positive a b a+b
Negative c d c+d
a+c b+d n
indicate the magnitude of a
relationship or its effect size
– Was there a large difference in
the odds or risks among rows?

9/13/2010
4
Interpreting OR and RR
• The odds of pregnancy are OR = 58 5 times higher• The odds of pregnancy are OR = 58.5 times higher
for women who tested positive than the odds of
pregnancy for women who tested negative
• The risk of pregnancy is RR = 6.75 times higher for p g y g
women who tested positive than the odds of
women who tested negative
Sensitivity and Specificity
• Sensitivity and specificity
represent the performance
Pregnant?
represent the performance
of diagnostic tests
• Sensitivity is the proportion
of actual positives correctly
identified by the diagnostic
Yes No
Pregnancy
Test?
Positive TP FP
Negative FN TN
• Specificity is the proportion
of actual negatives correctly
identified by the diagnostic

9/13/2010
5
Table Formats
Pregnant?
Pregnancy
Test Pregnant? Count
Yes No
Pregnancy
Test?
Positive 27 3 30
Negative 4 26 30
31 29 60
Test Pregnant? Count
Positive Yes 27
Positive No 3
Negative Yes 26
Negative No 4
Contingency Table format Summarized Table format
• You may need to reformat your data table for some software
– Contingency table format for analysis in GraphPad Prism
– Summarized table format for analysis in JMP
Review Contingency Table Results
Pregnant?
Yes No
Pregnancy
Test?
Positive 27 3 30
Negative 4 26 30
31 29 60
Pearson Chi‐Square: X2 = 32.3026, p = 1.319e‐08q , p
Fisher’s Exact Test: p = 1.975e‐09
Odds of pregnancy are OR = 58.5 times higher after positive pregnancy test
Risk of pregnancy is RR = 6.75 times higher after positive pregnancy test
Pregnancy test has 87.1% sensitivity and 89.66% specificity

9/13/2010
6
More Complicated Models
• What if your contingency table is larger than 2 x 2?
– Pearson chi‐square and Fisher’s exact test for M x N tables
• What if your table contains paired data?
– McNemar’s Test for paired data
• What if your table has three variables?
– Mantel‐Haenzel‐Cochran (MHC) test
• What if you have a continuous predictor variable?y p
– Logistic regression models
• What about really complicated models?
– Generalized Linear Models (GLIM)
M x N Contingency Tables
Blood Types
P hi k h f l M N bl b
A B AB O
Ethnicity
Bambara 7 8 5 20 40
Peul 12 3 3 12 30
Tuareg 11 13 2 4 30
30 24 10 36 100
• Pearson chi‐square tests work the same for larger M x N tables, but
researchers need to remember the assumptions about cell counts
• Fisher’s exact test is difficult to compute for M x N tables, but it
can be computed using simulations in R or other software

9/13/2010
7
Ordinal vs. Nominal Variables
• Ordinal variables have outcomes that are ordered
D D 0 5 10 d 15– Drug Dosages: 0 mg, 5 mg, 10 mg and 15 mg
– Symptom Severity: Mild, Moderate and Severe
• Nominal variables have outcomes that are unordered
– Blood Types: A, B, AB and O
– Ethnicity: Bambara, Peul and Tuareg
• Most tests assume nominal variables by defaulty
– Ordinal variables require fewer odds ratio estimates
– Ordinal variables may allow for a simpler model
– E.g. compute odds ratios to compare Mild vs. Moderate and Moderate
vs. Severe, but do not compare Mild vs. Severe
McNemar’s Test
• McNemar’s test should be used
if t bl t t h d
Test 2
if table represents a matched
pairs design experiment
– E.g. Some matched pairs designs
arise from repeated sampling of
patients pre‐ and post‐treatment
– E g Case‐control experiments may
Pos Neg
Test 1
Positive a b a+b
Negative c d c+d
a+c b+d n
E.g. Case control experiments may
use McNemar’s test because case
and control patients have been
“matched” using key demographic
variables like age, gender, race, ...

9/13/2010
8
Mantel‐Haenzel‐Cochran Test
Age < 40 Age > 40
• Mantel‐Haenzel‐Cochran test determines if the relationship
All Ages
Heart Attack?
Yes No
Birth
Control?
Yes 16 34
No 34 16
Heart Attack?
Yes No
8 32
2 8
Heart Attack?
Yes No
8 2
32 8
between two table variables remains the same if the table is
“paneled” or split by a third table variable
• Often used to investigate Simpson’s Paradox
Logistic Regression
• Logistic regression fits the relationship
b t ti di t dbetween a continuous predictor and a
categorical response variable
– E.g. predict the gender of an unknown
person based on their height
– E.g. predict whether an animal will live or
die based on the dose of a drug
• The logistic regression plot represents
a change in log odds ratio for each onea change in log odds ratio for each one
unit increase in the predictor variable
– E.g. If an unknown person is 61 inches tall,
their odds of being male are near zero
– E.g. if an unknown person is 68 inches tall,
their odds of being male are about 50‐50

9/13/2010
9
“Long” Data Format
• Each row of data represents one p
patient, animal or subject
• Raw data format is useful when
continuous covariates are unique
to each subject or patientj p
– E.g. Exact weight of each patient
– E.g. Exact blood pressure, ...
“Wide” Data Format
• If each value of the continuous variable
has been replicated, the data can be
formatted as a summarized table
• Summarized tables require less space
and can be used in multiple modelsp
– Logistic regression models
– Log‐linear models
– Probit analysis

9/13/2010
10
Results from Logistic Regression
• Whole model results
Likelihood Ratio Test (LRT)– Likelihood Ratio Test (LRT)
– Model fit diagnostics
• Parameter estimates
– Regression coefficients
– Wald tests
• Odds ratios
Th dd f i l 1 107– The odds of survival are 1.107
times higher after every one
unit increase in log(dose)
– Odds of survival are 12.794
times higher after every one
unit increase in dose
Why Use Both Wald and LRT?
• Likelihood Ratio tests compare the fit of two statistical models
– Most statistical models can be described with a likelihood function, e.g., g
– A likelihood ratio test (LRT) computes the log‐likelihood function under a full
model (dose and intercept) and reduced model (intercept) to test model fit
• Wald tests evaluate the statistical significance of model parameters
– Wald test statistics are constructed very similar to Student’s T‐tests
– Results from Wald test should be consistent with LRT results

9/13/2010
11
Estimate LD50 from Logistic Regression
• You can use interpolated values
i di ti t ti tor inverse prediction to estimate
LD50 from a logistic regression
• Open the Inverse Prediction menu
and enter Prob = 0.500 to estimate
LD50 by finding X at Y = 0.500
– Enter Prob = 0.90 for LD90, ...,
• You may need to antilog your LD50
estimate if your predictor is on the
log scale (e.g. log10(dose))
Compute LD50 from Parameter Estimates
• Simple logistic regression is defined by the equation
• Therefore, by simple algebra, we find LD50 = ‐B0 / B1

9/13/2010
12
Reed‐Muench Method
• Graphical estimate of
LD50 from survival data
• Plot total number of
survivors and total
number dead against
dilution or concentration
• Intersection represents
best estimate of LD50
Reed‐Muench Method

9/13/2010
13
Generalized Linear Models
• Logistic regression, extensions of Pearson chi‐square tests and other
models can be defined as generalized linear models (GLIM)models can be defined as generalized linear models (GLIM)
• Each GLIM model is coerced into the form of a linear equation by
choosing the correct statistical distribution and link function
• Excluding logistic regression, most multifactor categorical models
must be specified using the GLIM procedures in your softwarep g p y
• GLIM procedures typically allow analysts to test for overdispersion,
where real data has more variance than expected from the model
Distribution Choices
• Modeling categorical responses directlyg g p y
– Binomial and multinomial distributions
– Negative binomial distribution
• Modeling contingency table cell counts
– Poisson distribution models all cell counts as rare eventsPoisson distribution models all cell counts as rare events
– Normal distribution models cell counts as common events

9/13/2010
14
Link Functions
• Link functions are mathematical transformations
used to coerce models into linear equations
– The identity link function g(y) = y for linear models
– The log link function g(y) = log(y) for log‐linear models
– The logit link function (below) for logistic regression models
– The probit link function (below) for probit analysis models
Historic Models as GLIM
• Logistic regressiong g
– Binomial distribution with logistic link function
• Probit analysis
– Binomial distribution with probit link function
• Log‐linear models
– Poisson distribution with log link function
• Negative Binomial regression
– Negative binomial distribution with log link function

9/13/2010
15
Overdispersion Parameters
• Traditional linear models, like linear regression, use independent
parameters to estimate the variance of the response dataparameters to estimate the variance of the response data
– E.g. linear regression has independent mean μ = Xβ and variance σ2
• Many GLIM models, like logistic regression, have fixed relationships
between the variance and other model parameters
– E.g. logistic regression has mean μ = np and variance σ2 = np(1 – p)
– E.g. log‐linear models have μ = σ2 = λ = np for rare event with small p
• Overdispersion parameters are used to account for extra variability• Overdispersion parameters are used to account for extra variability
in the responses, which cannot be explained by the model
– E.g. logistic regression modeled with variance σ2 = φnp(1 – p)
– Want to know if multiplier φ > 2 to determine significance or importance
Generalized Linear Mixed Models
• Generalized linear models can be advanced further by
including random effect variables
– These models are called generalized linear mixed models (GLMM)
– Random effect variables are included to account for paired designs,
repeated measures designs, split‐plot designs and other effects
– GLMM are typiaclly fit using generalized estimating equations (GEE), often
using linearization techniques (e.g. SAS PROC GLIMMIX)
l d d b f• Sometimes complicated GLM and GLMM must be fit
using nonlinear modeling procedures in your software
– Probit model with binomial errors or Poisson loss function models in JMP
– Probit‐Normal models and Poisson‐Normal models in SAS PROC NLMIXED

9/13/2010
16
Random vs. Fixed Effects
Subject effects are random Gender effects are fixed
• Subject effects are random because the subjects in a experiment
are a sample from the population of all possible subjects
• Gender effects are fixed because there are only two genders
Split‐plot Design
12 mice: 6 infected, 6 uninfected
3 infected males, 3 infected females, …
• Split‐plot design experiments model experiments where
whole plots and subplots represent different EUs
, ,
4 samples taken from each mouse
Each sample treated with one of 2 different drugs
Whole plot (mouse) EU’s: Infection, gender
Subplot (sample) EU’s: drug treatment
whole plots and subplots represent different EUs
– Whole plots are often locations, subjects, objects or factors that
are difficult to change (e.g. temperature in an incubator)
– Subplot effects are typically the effects of highest interest
– Subplot effects are tested with higher power than whole plot

9/13/2010
17
References
• Agresti A.  2002.  Categorical Data Analyses.  Second Ed.  Wiley‐Interscience.
• Reed LJ and H Muench.  1938.  A Simple Method of Estimating Fifty Percent
Endpoints.  The American Journal of Hygiene.  27(3):493‐497
• SAS Institute Inc.  2007.  SAS 9.1.3 Documentation.  Cary, NC.  SAS Institute Inc.
• SAS Institute Inc 2010 JMP Statistics and Graphics Guide Cary NC SAS• SAS Institute Inc.  2010.  JMP Statistics and Graphics Guide.  Cary, NC.  SAS
Institute Inc.

Categorical Models: An Introduction to Contingency Tables, Logistic Regression and Generalized Linear Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Categorical Models: An Introduction to Contingency Tables, Logistic Regression and Generalized Linear Models

Similar to Categorical Models: An Introduction to Contingency Tables, Logistic Regression and Generalized Linear Models (20)

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Recently uploaded

Recently uploaded (20)

Categorical Models: An Introduction to Contingency Tables, Logistic Regression and Generalized Linear Models