Poisson Regression Analysis of Count Data

Research in Nursing & Health, 2005, 28, 408–418
Focus on Research Methods
Analysis of Count Data Using
Poisson Regression*
M. Katherine Hutchinson1
,{
Matthew C. Holtman2z
1
University of Pennsylvania School of Nursing, 420 Guardian Drive, Philadelphia,
Pennsylvania 19104-6096
2
Fels Institute of Government and Department of Criminology, University of Pennsylvania
Accepted 16 May 2005
Abstract: Nurses and other health researchers are often concerned with
infrequently occurring, repeatable, health-related events such as number of
hospitalizations, pregnancies, or visits to a health care provider. Reports on
the occurrence of such discrete events take the form of non-negative integer
or count data. Because the counts of infrequently occurring events tend to be
non-normally distributed and highly positively skewed, the use of ordinary
least squares (OLS) regression with non-transformed data has several
shortcomings. Techniques such as Poisson regression and negative binomial
regression may provide more appropriate alternatives for analyzing
these data. The purpose of this article is to compare and contrast the use of
these three methods for the analysis of infrequently occurring count data. The
strengths, limitations, and special considerations of each approach are
discussed. Data from the National Longitudinal Survey of Adolescent Health
(AddHealth) are used for illustrative purposes.ß 2005 Wiley Periodicals, Inc. Res
Nurs Health 28:408–418, 2005
Keywords: Poisson regression; count data; data analysis
Nurses and other health researchers are often
concerned with infrequently occurring, repeatable,
health-related events such as number of hospitali-
zations, pregnancies, or visits to a health care
provider.Reportsontheoccurrenceofsuchdiscrete
events take the form of non-negative integer or
count data. Counts of infrequently occurring, repe-
atable events tend to cluster around the values of
*This research uses data from Add Health, a program project designed by J.
Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris, and funded by a grant
P01-HD31921 from the National Institute of Child Health and Human Development,
with cooperative funding from 17 other agencies. Special acknowledgment is due
Ronald R. Rindfuss and Barbara Entwisle for assistance in the original design.
Persons interested in obtaining data ﬁles from Add Health should contact Add Health,
Carolina Population Center, 123 W. Franklin Street, Chapel Hill, NC 27516-2524
(www.cpc.unc.edu/addhealth/contract.html).
Contract grant sponsor: National Institute of Mental Health (to MKH); Contract
grant number: R03 MH63659.
Contract grant sponsor: National Institute of Child Health and Human
Development; Contract grant number: P01-HD31921.
Correspondence to M. Katherine Hutchinson.
{
Assistant Professor and Associate Director.
z
Lecturer.
Published online in Wiley InterScience (www.interscience.wiley.com)
DOI: 10.1002/nur.20093
408 ß2005 Wiley Periodicals, Inc.

0 and/or 1 and exhibit low frequencies at higher
values.Thistypeofdistributionhasa positive skew.
They are truncated at 0, and gradually trail off
toward higher values; the mean is characteristically
low but greater than the median because of the
influence of a few relatively large observations. In a
regression model, the distribution of the error term
mirrors the distribution of the dependent variable
itself (Lewis-Beck, 1989). Because ordinary least
squares (OLS) regression assumes normality in the
distribution of error terms and hence in the depen-
dent variable, its use with this type of data is pro-
blematic if the data are not transformed to address
the effects of the positive skew (Lewis-Beck).
Poisson regression and negative binomial
regression may provide more appropriate alter-
natives for the analysis of infrequently occurring,
untransformed count data. Neither of these alter-
native types of regression analysis assumes
normal distribution of the error terms and depen-
dent variables. Poisson regression assumes a
Poisson distribution—a specific type of distribu-
tion in which scores take the form of non-negative
whole number or integer values. The Poisson
distribution is truncated at 0, highly skewed in the
positive direction, and exhibits equidispersion
(i.e., a mean that is equal to the variance; Allison,
1999; Cameron & Trivedi, 1998). For the use of
uncorrected Poisson regression, these character-
istics should be present. When overdispersion is
present (i.e., the variance is greater than the mean)
Poisson regression may still be employed but
statistical corrections must be incorporated into
the model to correct for the overdispersion).
In contrast, negative binomial regression is
based on the assumption of a Poisson-like dis-
tribution (Allison, 1999) and no assumptions
regarding equidispersion are made. When over-
dispersion is present, negative binomial regression
can be employed without any special corrections.
The purpose of this article is to compare and
contrast the use of these three methods for the
analysis of infrequently occurring count data. The
strengths, limitations, and special considerations
of each approach are discussed. Data from the
National Longitudinal Survey of Adolescent
Health (AddHealth) public use dataset are used
for illustrative purposes.
METHODS
In our initial attempts to build comparative
regression models, we used independent variables
from Wave 1 of the AddHealth public use dataset
to predict number of pregnancies reported at Wave
II. However, in order to demonstrate the range of
options available in Poisson regression, we felt
that it was important to include a varying exposure
variable (e.g., years of sexual activity) and model
an outcome variable that exhibited overdispersion
(i.e., model a dependent variable with a variance
greater than its mean). When we restricted our
sample to sexually experienced young women and
used Wave II reports of numbers of pregnancies as
our outcome, the resultant model did not exhibit
overdispersion. Therefore, although we would
have preferred to use more than one wave of data
in order to more accurately model the longitudinal
dynamics of pregnancy, for illustrative purposes,
we have confined the analyses to a cross-sectional
analysis using only data from Wave I.
National Longitudinal
Study of Adolescent Health
The National Longitudinal Study of Adolescent
Health (AddHealth) was mandated to the National
Institute of Child Health and Human Development
(NICHD) by a Congressional act. Detailed infor-
mation about the study can be obtained on its web
site at http://www.cpc.unc.edu/addhealth. The
studyincludesaschool-basedsampleofadolescents
in7ththrough12thgrades.Studentswhocompleted
the in-school questionnaire (n ¼ 90,000) and those
who were listed on the school roster were used as a
samplingframe for a corerandomsampleof12,105
adolescents,stratifiedbygenderandgrade.In-home
interviewswereconductedwiththiscoresample,as
wellasoversamplesofethnicminoritiesandspecial
populations based on self-report data from the in-
school questionnaire. The analyses reported here
were limited to Wave 1 data from in-home inter-
views with adolescents who were included in the
public use dataset subsample.
Sample
The AddHealth public use dataset includes a
subsample of approximately 6,500 respondents.
Of the 3,356 female respondents, 36% (n ¼ 1,241)
were sexually experienced, based on self-reports
of ever having sexual intercourse. Analyses were
limited to those respondents who were sexually
experienced and whose years of sexual experience
could be calculated from their current age and
their age at first sexual intercourse. We deleted a
few outlier cases whose reported ages at first
intercourse were so low that their calculated years
of sexual experience exceeded 10. Of those
ANALYSIS OF COUNT DATA / HUTCHINSON AND HOLTMAN 409

sexually experienced girls remaining in the analy-
sis (n ¼ 1,241), most (n ¼ 1,011, 81%) reported no
pregnancies. Of those who had been pregnant, 190
(15%) reported only one pregnancy; 40 (3%)
reported more than one pregnancy (Table 1).
Independent Variables
For our analyses, we included nine important
predictor variables that have been shown to be
related to adolescent pregnancy—age, race (Afri-
can American, Asian, Hispanic/Latino), marital
status, sexual experience, college plans, contra-
ceptive self-efficacy, and consistent contraceptive
use. The descriptive statistics for all variables are
summarized in Table 2.
Age. Agewas included in all of our analyses, as
sexual activity, childbearing intentions, and con-
traceptive behaviors tend to vary with age. Age, as
self-reported at Wave 1, was recorded in years.
Race. Race was included in the analyses as age
at first intercourse and rates of adolescent
pregnancy have been shown to vary by race (Blum
et al., 2000; Kann et al., 2000). In addition, African
American adolescents have been found to be less
compliant with oral contraceptive use than their
White peers (Scher, Emans, & Grace, 1982).
Three dichotomous dummy variables, coded 1
for yes and 0 for no, were included to repre-
sent respondents’ self-identification as African
American, Hispanic/Latino, and/or Asian.
Marital status. Although the vast majority
(98%) of girls in the sample had never been
married, those who had been married may have
had different attitudes and intentions towards
becoming pregnant than girls who were never
married. Marital status was coded as single/never
married (1) or other (0) and included as a
dichotomous dummy variable in our analyses.
Sexual experience. Pregnancy is directly
related to sexual activity. Sexual experience was
calculated as the number of years that had elapsed
between the year when the Wave I interview was
conducted and the year the respondent reported
having sexual intercourse for the first time.
College plans. Because feelings of hopeless-
ness and lack of future plans may act as contri-
buting factors in adolescent sexual risk-taking and
unintended pregnancy, whether or not respondents
had college plans was assessed at Wave I. The
single-item measure was worded: ‘‘How much do
you want to go to college?’’ The item was scored
on a Likert-type scale from 1 to 5; higher scores
indicated a greater desire to attend college. The
mean score was 4.4 (SD ¼ 1.1).
Contraceptive self-efficacy. Greater sexual
self-efficacy has consistently been shown to be
related to both condom use and contraceptive
use (DiClemente, Lodico, & Grinstead, 1996;
Hutchinson, 2002; Hutchinson & Cooney, 1998;
Hutchinson, Jemmott, Jemmott, Braverman, &
Fong, 2003; Jemmott, Jemmott, & Hacker, 1992).
Four items on the Wave I survey assessed self-
efficacy related to sexual behavior and contra-
ceptive use. Items were scored from 1 to 5. Higher
scores indicated greater contraceptive self-efficacy.
The overall contraceptive self-efficacy score was
computed as the simple average of the four items.
Total possible scores ranged from 1 to 5; the mean
contraceptive self-efficacy score was 3.8 (SD¼ .8).
Consistent contraceptive use. In the Add-
Health survey, a number of questions addressed
contraceptivebehavior.Forthe present analysis,we
Table 1. Frequency Distribution of Outcome
Variable
Pregnancies
reported n %
0 1,011 81.5
1 190 15.3
2 32 2.6
3 6 0.5
4 2 0.2
Table 2. Descriptive Statistics for Sample: Girls Reporting Sexual Experience
Variable n Mean SD Minimum Maximum
Number of pregnancies 1,241 .23 .52 .0 4.0
Current age (age) 1,241 17.01 1.39 12.7 20.7
African American (Black) 1,241 .30 .46 .0 1.0
Hispanic (Hisp) 1,241 .10 .30 .0 1.0
Asian (Asian) 1,241 .03 .18 .0 1.0
Never married (single) (0 ¼ false;1 ¼ true) 1,236 .98 .14 .0 1.0
Years of sexual activity (years) 1,169 1.74 1.42 .0 8.0
College plans (college) 1,237 4.38 1.06 1.0 5.0
Contraceptive self-efficacy (effic) 1,233 3.76 .81 1.0 5.0
Consistent contraceptive use (contrac) 1,241 1.27 .80 .0 2.0
410 RESEARCH IN NURSING & HEALTH

created a proxy measure for consistent contra-
ceptive use based on two questions that asked
whether respondents used contraception during
intercourse the first time and the most recent time
they had sexual intercourse. Possible values on the
scale range from0 (did not use contraceptioneither
time) to 2 (used contraception both times). The
average score on this measure was 1.29 (SD ¼ .8).
In summary, we designated nine predictor
variables to include in our models—age, race,
marital status, sexual experience, college plans,
contraceptive self-efficacy, and consistent contra-
ceptive use. In all of our models, we expected age,
race, marital status, and sexual experience to be
positively associated with number of pregnancies.
College plans, contraceptive self-efficacy, and
consistent contraceptive use were expected to be
inversely related to number of pregnancies.
Outcome Variable
The outcome variable of interest, number of pre-
gnancies, was taken from respondents’ self-
reports. We did not include information on
whether pregnancies were carried to term or
whether they resulted in live births. Almost 19%
of the sexually experienced female adolescents
who participated in AddHealth reported having
been pregnant at least once at Wave I; slightly less
than 82% reported they had never been pregnant.
The number of reported pregnancies ranged from
0 to 4. The number of pregnancies was non-
normally distributed and highly skewed with a
mean of .22 and a variance of .27.
DATA ANALYSES AND RESULTS
Our goal was to compare the use of OLS, Poisson,
and negative binomial regressions for modeling
the effects of nine independent variables on the
number of pregnancies experienced by sexually
experienced adolescent females from the National
Longitudinal Survey of Adolescent Health.
The outcome of interest, number of pregnan-
cies, is an infrequently occurring, discrete and
repeatable event. More than 80% of the sample
had a pregnancy count of 0. Of those who had been
pregnant, 190 had had only one pregnancy; 40
reported more than one pregnancy.
OLS Regression
Ignoring the specific characteristics of a Poisson-
distributed outcome variable, a typical OLS re-
gression model to address these effects might look
like the following: Number of pregnancies ¼ a þ
b1 (age) þ b2 (African American) þ b3 (His-
panic) þ b4 (Asian) þ b5 (single) þ b6 (years of
sexual experience) þ b7 (college) þ b8 (contra-
ceptive self-efficacy) þ b9 (consistent contracep-
tive use).
The results fromthismodel are shownin Table3.
All but two of the effects are significant at the .05
level. In interpreting the estimated effects, we see
that, controlling for the other variables, each addi-
tional year of age increases the predicted number of
pregnancies by .05. African American girls are
expected to have .18 more pregnancies than White
girls. Being single decreases the expected number
of pregnancies by nearly .72. Each year of sexual
activity increases the predicted number of preg-
nancies by .07. Greater aspiration for college
reduces the predicted number of pregnancies
by.04 per point, greater contraceptive self-efficacy
by .05 per point. Each additional point on the
consistent contraceptive use scale decreases the
predicted number of pregnancies by .07.
So what is wrong with using the OLS approach?
The answer is that OLS is inappropriate for models
in which the dependent variable is highly skewed.
Table 3. OLS Regression Output for Model Predicting Number of Pregnancies
Variable DF
Parameter
Estimate
Standard
Error t value p > j t j
Intercept 1 .37410 .22441 1.67 .0958
Current age 1 .04718 .01223 3.86 .0001
African American 1 .17686 .03219 5.49 <.0001
Hispanic 1 .03127 .04971 .63 .5295
Asian 1 .08816 .08171 1.08 .2808
Never married 1 À.71671 .09796 À7.32 <.0001
Years of sexual activity 1 .07221 .01092 6.61 <.0001
College plans 1 À.03711 .01402 À2.65 .0082
Contraceptive self-efficacy 1 À.04573 .02005 À2.28 .0227
Consistent contraceptive use 1 À.07412 .01879 À3.94 <.0001
OLS, ordinary least squares.

Our dependent variable, number of pregnancies, is
in this category. Its range is restricted by the fact
that its lower bound is zero. Furthermore, although
the number of girls who have been pregnant is
quite low, at the same time, a small handful of girls
reported several pregnancies, up to a maximum of
4. The result is a highly skewed distribution.
An OLS model does not perform well under
these conditions. OLS requires, as a basic assump-
tion, that the dependent variable and the error term
in the model be at least approximately normally
distributed (Lewis-Beck, 1989). With an outcome
variable this skewed, these assumptions are violat-
ed. Using OLS also risks violating the homosce-
dasticity assumption of OLS, which is that the error
terms are evenly distributed across values of the
dependent variable. When one or both of these
violations occur, the standard errors of the para-
meters will be estimated incorrectly, and as a result,
it does not produce accurate estimates for the t-tests
associated with the parameters. The user will be
unable to tell whether the effects are statistically
significant or not. Worse, because one can almost
always mechanically run such a model (even when
the model is inappropriate), there is a risk of taking
the results at face value, making a Type I error, and
reaching the wrong conclusions.
A second reason why a Poisson model is
preferred over OLS regression is that risk exposure
to the outcome of interest varies. OLS assumes
linear relationships between each independent
variable and the outcomes. Although using
OLSallowedcontrollingforyearsofsexualactivity
in the model, in doing so we were implicitly
assuming that the relationship between years and
pregnancies was linear. With a highly skewed
outcomevariable like pregnancy, thisassumptionis
not reasonable. As we will see, Poisson regression
has a more natural way of incorporating this
information into the model, producing more
realistic estimates of how likely a girl is to get
pregnant, holding constant the number of years she
has been sexually active.
Alternative Poisson-Type Models
There are two Poisson-type models (a true Poisson
and a negative binomial model, which is a
generalization of the Poisson) that are better
suited than OLS to analyzing the type of data in
the example. In specifying a Poisson-type model,
there are two initial choices to make: (a) whether
or not to incorporate time-frame adjustments in
the model and (b) whether to specify a true Poisson
approach or a negative binomial approach.
UnlikeOLSmodels,Poisson-type modelscanbe
specified in such a way as to take varying time-
frames or levels of exposure into account. Varying
time-frames or levels of exposure can be addressed
by including a predictor variable that adjusts the
model to account for the time-frame in which
the observations were made, or by standardizing
the outcome variable per time unit. Investigators
can also choose not to include such adjustments. In
the present example, the appropriate variable to
think about is the length of the interval of sexual
activity. Although the risk remains constant for
each exposure, the cumulative risk of becoming
pregnant increases with the number of sexual
encounters, and those encounters are likely to
increase with an increase in the interval of
exposure. Unfortunately, there is no direct measure
ofhow many times a girlhas had sex.Therefore, we
used years of sexual activity to give us an idea of
how much exposure each girl is likely to have
experienced. An 18-year-old girl who has been
sexually active since age fourteen, for example, has
probably had many more chances to become
pregnant than a girl who initiated sexual activity
at age 17. There may, of course be other diffe-
rences—the girl who started at 14 may be, in many
ways, less responsible or less able to plan than the
other girl. This kind of effect is not captured with
the time variable, just different levels of exposure.
A Poisson type regression can incorporate such
an exposure variable in one of two ways: by
including exposure as a standard predictor vari-
able, or incorporating exposure as an offset of the
outcome variable. In the latter case, the researcher
would include the log-transformed value of the
exposure variable (years sexually active) on
the right-hand-side of the equation, and instruct
the computer to assume that its coefficient is equal
to 1. In this kind of model, the outcome variable
predicted will be the rate of log-pregnancies per
unit of exposure (e.g., per year of sexual activity).
The reasons for these minor adjustments are
algebraic. The other alternative is simply to
include the untransformed (e.g., not logged)
exposure variable as a predictor on the right-hand
side of the equation. This slightly simpler specifi-
cation gives fairly comparable results, but requires
a slight shift in interpretation of the coefficients.
The second major option is whether to use a true
Poisson specification, or a negative binomial
specification. A true Poisson model assumes that
the distribution of the outcome variable (number of
pregnancies) has a mean equal to its variance. This
is part of the definition of the Poisson distribution.
The negative binomial (NB) distribution, on the
other hand, makes no such assumption and can

have a variance that is larger or smaller than its
mean. The negative binomial distribution is said,
therefore, to be overdispersed compared to the true
Poisson distribution. The models are interpreted in
exactly the same way. The researcher needs to be
aware of the possibility of overdispersion when
estimating a model, and to pick the better of the two
models (Poisson or NB) depending on whether
there is evidence for overdispersion or not. When
overdispersion is present, true Poisson modeling
can still be employed; however, a standardized
correction must be made for the overdispersion.
Poisson modeling. A Poisson model uses a
maximum likelihood estimation technique (much
like a logistic regression), and can be run in SAS
using the GENMOD procedure. A Poissonversion
of our model is shown below. In this variation, we
used the simpler approach for the exposure
variable by including it in raw form as a predictor
variable. Notice the line that specifies the log link
function. For technical reasons, Poisson and
negative binomial regression model the natural
log of the outcome variable.
Example SAS instructions are provided below
and the output from these instructions is included
in Table 4.
proc genmod data ¼ adhealth.PoissonAnalysis;
model numpreg ¼ age black hisp asian single
years college effic contrac
/link ¼ log
dist ¼ poisson;
run;
The output is similar to that of an OLS
regression, a parameter estimate, a standard error
for each predictor variable, and a p value that is
based on a w2
statistic. The upper and lower limits
of a 95% confidence interval for the parameter also
are provided. Finally, an extra line labeled scale,
which is set to 1 in this first model, appears. The
scale parameter identifies how much the output
was adjusted to take overdispersion into account.
We have not dealt with overdispersion yet in this
model, so this parameter is not yet relevant.
There are significant effects (p < .05) for most
of the predictors, just as before. Each of the effects
is again in the expected direction. In interpreting
the coefficients recall that the outcome variable is
really the natural log (log base e—the power that e
has to be raised to get the original number, where e
is about 2.718) of the number of pregnancies. This
is a somewhat unnatural way to think about
pregnancies (‘‘how many logged pregnancies did
you have?’’), so it may be helpful to transform the
parameter estimates into slightly more accessible
form. This can be accomplished using spreadsheet
software or a scientific calculator. For years of
sexual experience, for example, the estimated
coefficient is .22. Taking e to the .22 power, we get
1.25 pregnancies for every year of sexual experi-
ence: e.22
¼ 1.25.
There is a simple way to interpret these
estimates. The percentage change in the outcome
count (Y) expected with each one unit increase in
the independent variable (X) equals 100 times the
inverse natural log of the coefficient minus one
(Y% ¼ 100 Â [eB
À 1]; Allison, 1999). In this
example, the percent increase in the expected
number of pregnancies for each additional year of
sexual experience would be 100 Â (1.25À1) ¼
25%.
But how reasonable are these estimates? What
is the predicted number of pregnancies for an
average girl in the sample—that is, a girl who has
average values on all of the predictor variables?
No such girl may actually exist in the dataset, but
applying the averages of all the predictors is an
easy way to see whether the model gives reason-
able results. In calculating the predicted number of
Table 4. Poisson Regression Output for Model Predicting Number of Pregnancies, with Years of Sexual
Activity Included as a Predictor
Parameter DF Estimate
Standard
Error
Wald 95% Confidence
Limits w2
Probability > w2
Intercept 1 À3.5182 .9551 À5.3903 À1.6462 13.57 .0002
Current age 1 .2278 .0556 .1188 .3368 16.78 <.0001
African American 1 .6958 .1297 .4416 .9500 28.78 <.0001
Hispanic 1 .1498 .2195 À.2804 .5801 .47 .4949
Asian 1 .5333 .3290 À.1116 1.1782 2.63 .1051
Never married 1 À1.1035 .2122 À1.5194 À.6876 27.04 <.0001
Years of sexual activity 1 .2195 .0381 .1447 .2942 33.13 <.0001
College plans 1 À.1198 .0515 À.2208 À.0188 5.41 .0201
Contraceptive self-efficacy 1 À.1939 .0852 À.3609 À.0268 5.17 .0230
Consistent contraceptive use 1 À.3106 .0812 À.4697 À.1514 14.62 .0001
Scale 0 1 0 1 1

pregnancies, it is necessary to take into account the
effects of all the predictor variables (not just years
of sexual experience or exposure), and those
effects have to be added up before taking the expo-
nent of e. Referring to the descriptive statistics in
Table 2, multiply each predictor variable’s average
(current age, Black, Hispanic, Asian, single, years
of sexual activity, college plans, contraceptive
self-efficacy, consistent contraceptive use) by its
estimated effect from the model, and include the
intercept:
À3.52þ (.23 Â 17.01) þ (.70 Â .30) þ (.15 Â
.10) þ (.53 Â .03) þ (À1.10 Â .98) þ (.22 Â
1.74) þ (À.12 Â 4.38) þ (À.19 Â 3.76) þ (À.31
Â 1.27) ¼ À1.75
The total obtained is À1.75 for the log of the
predicted number of pregnancies, or .17 for the
predicted number of pregnancies (eÀ1.75
¼ .17).
The .17 figure is not far off the average number of
pregnancies in the sample, .20.
Model fit can be assessed usingthe deviance and
w2
statistics that are reported in the model output.
Under certain large-sample conditions, both sta-
tistics are approximately distributed as w2
with
degrees of freedom equal to the number of obser-
vations minus the number of parameters. In this
example, the deviance is 781.6 and the w2
is
3,359.9, both with 1,144 degrees of freedom. Both
statistics are non-significant (p ¼ 1 and .95,
respectively; the p values were calculated using
the w2
formula from standard spreadsheet soft-
ware). This suggests that the fit between the model
and the data is very good. A significant deviance or
w2
value would have indicated poor correspon-
dence between the model and the data, perhaps
due to inappropriate use of the Poisson specifica-
tion or due to the omission of an important pre-
dictor variable. In practice, the model fit statistics
will tend to get larger and become statistically
significant as the dataset gets larger, so they are not
always useful for assessing a single model by
itself. An alternative is to use the deviance statistic
to compare nested models with each other; for
details, refer to Cameron and Trivedi (1998).
The next step is to attempt a better specification
of the Poisson model using the offset method to
adjust for varying lengths of exposure to pregnancy
risk, in this case the interval of being sexually
active. In this approach, the log-transformed value
of years sexually activewill be included, calculated
using the following SAS code:
logyrs ¼ log(actvyrs);
In setting up the model, this variable is not
included with the other predictors, but instead a
line is included that says:
/offset ¼ logyrs
The resultant SAS code reads as follows:
college effic contrac
/link ¼ log
dist ¼ poisson
offset ¼ lactvyrs;
run;
Thissetsthecoefficientforlogyrsequalto1,and
adjusts the other estimates accordingly. It is done
this way as an algebraic reduction of the ratio
log(pregnancies/years). The logarithm of any
ratio is equal to the log of the numerator minus
the log of the denominator: log(a/b) ¼
log(a) À log(b), therefore log(pregnancies/years)
¼ log(pregnancies) À log(years). To estimate a
model predicting log(pregnancies) alone, add
log(years) to both sides of the regression equation.
The log(years) term then cancels out of the left
side, leaving log(pregnancies) by itself:
log(pregnancies/years) ¼ a þ b1(age) þb2-
(African American) þ b3(Hisp) þ b4(Asian) þ
b5(single) þ b6(college) þ b7(effic) þ b8(contrac)
When log(years) is added to both sides of the
equation, the result is:
log(pregnancies) ¼ a þ log(years) þ b1(age) þ
b2(African American) þ b3(Hispanic) þ b4-
(Asian) þ b5(single) þ b6(college) þ b7(effic) þ
b8(contrac)
The coefficient for log(years) is set to one in
order to maintain the correct scale. The results of
this model are shown in Table 5. In contrast to the
previous models, the effects for age and college
plans are no longer significant. If the sum of
the products of the means of all the predictor
variables with their coefficients is calculated, the
following is the result:
(À2.04) þ (.12 Â 17.01) þ (.66 Â .30) þ (.21
Â .10) þ (.49 Â .03) þ (À.93 Â .98) þ (À.09 Â
4.38) þ (À.22 Â 3.76) þ (À.27 Â 1.27) ¼ À2.21
The model predicts À2.21 log pregnancies per
year for an average girl, or eÀ2.21
¼ .11 pregnan-
cies per year. Because the average number of years
of sexual activity in the sample is 1.7, this would
give a total of 1.7 Â .11 ¼ .19 pregnancies for an
average girl, one who has average values on all the
predictor variables that we included in the model.
This estimate is the best fit with the known values
in our dataset.
Overdispersion. Overdispersion means having
a Poisson-like distribution that is not quite Pois-
son, because its variance is larger than its mean.
When overdispersion is present, Poisson regres-
sion coefficients are reliable but the variance is
larger than the statistical program would expect
for a Poisson distribution. As a result, the standard

errors calculated by the program are artificially
smaller than the true standard errors, and thus may
lead to more liberal significance test results and a
greater likelihood of Type I errors.
Overdispersion can be detected by comparing
the reported model w2
statistic with its degrees of
freedom (divide w2
by the degrees of freedom). An
overdispersed model will have a ratio greater than
2; the greater the ratio, the greater the over-
dispersion. An omitted predictor variable can
result in apparent overdispersion or underdisper-
sion; the researcher should investigate this possi-
bility while examining the diagnostics.
Overdispersion is corrected for by dividing each
standard error by the square root of the model
Pearson’s w2
divided by the degrees of freedom.
SAS will perform this adjustment automatically if
the PSCALE option is specified (Allison, 1999). In
this example, the code is as follows. Notice it is the
same as before except for the PSCALE option at
the end.
/link ¼ log
dist ¼ poisson
offset ¼ lactvyrs
pscale;
run;
The output, presented in Table 6, looks very
similar to the previous Poisson analysis. The
coefficient estimates are exactly the same. The
difference is in the estimated standard errors,
which are now a little larger. The scale parameter
has also been adjusted upward from 1, reflecting
the degree of adjustment made. The increase in the
standard errors affects the hypotheses tests, redu-
cing the w2
statistics and increasing the p values
slightly. This means that some statistically sig-
nificant effect estimates will no longer reach
significance. In this case, only African American
race and marital status remain as significant
effects. The correction for overdispersion takes a
conservative approach to estimating the standard
errors, which means it tends to err on the side
of type-II errors—saying that an effect is not
significant even if it might be. In order to gain a
little more power to detect significant effects,
Table 5. Poisson Regression Output for Model Predicting Number of Pregnancies, with Years of Sexual
Activity Included as an Offset to Account for Varying Time-Frame of Exposure
Standard
Error
Limits w2
Probability > w2
Intercept 1 À2.0420 .9607 À3.9248 À.1591 4.52 .0335
Current age 1 .1220 .0543 .0156 .2283 5.05 .0246
Hispanic 1 .2070 .2186 À.2215 .6354 .90 .3438
Asian 1 .4896 .3294 À.1561 1.1352 2.21 .1373
Never married 1 À.9294 .2080 À1.3371 À.5218 19.97 <.0001
College plans 1 À.0871 .0515 À.1881 .0139 2.86 .0909
Consistent
contraceptive use
1 À.2729 .0825 À.4346 À.1113 10.95 .0009
Scale 0 1 0 1 1
Table 6. Poisson Regression Output for Model with Correction for Overdispersion
Standard
Error
Limits w2
Probability > w2
Intercept 1 À2.0420 1.6456 À5.2673 1.1834 1.54 .2147
Current age 1 .1220 .0929 À.0602 .3041 1.72 .1894
African American 1 .6604 .2205 .2283 1.0925 8.97 .0027
Hispanic 1 .2070 .3745 À.5270 .9409 .31 .5805
Asian 1 .4896 .5643 À.6165 1.5956 .75 .3857
Never married 1 À.9294 .3563 À1.6277 À.2312 6.81 .0091
College plans 1 À.0871 .0883 À.2601 .0859 .97 .3237
Contraceptive self-efficacy 1 À.2241 .1447 À.5078 .0596 2.40 .1215
Consistent contraceptive use 1 À.2729 .1413 À.5498 .0039 3.73 .0533
Scale 0 1.713 0 1.713 1.713

an alternative approach would be to use a negative
binomial specification.
Negative binomial modeling. Another option
if overdispersion is present is to use the negative
binomial specification rather than the Poisson.
This is easily accomplished by changing the DIST
line in the model statement from POISSON to NB,
as follows:
/link ¼ log
dist ¼ nb
offset ¼ lactvyrs;
run;
The output is very similar and can be interpreted
in the same way. As with the PSCALE adjustment,
the negative binomial specification results in
exactly the same coefficient estimates as in the
main Poisson model, but with slightly larger
standard errors. The negative binomial specifica-
tion takes care of the overdispersion problem, but
is a little less conservative than using corrected
standard errors. When using this model there are
highly significant effects for most of the pre-
dictors, and marginally significant effects for age
and college plans. Table 7 presents the results from
this analytical approach.
Table 8 compares the parameter estimates and
standard errors from all four models: OLS,
Poisson, Poisson with the correction for over-
dispersion, and negative binomial. The parameter
estimates for the Poisson-type models are quite
similar, and the standard errors for the negative
binomial model are closer to those of the uncor-
rected than the corrected Poisson. Because it is not
possible to compare directly the OLS and Poisson-
type parameter estimates (because the former deal
with raw pregnancies and the latter with log-
pregnancies), in the last column of the table, the
expected percent change in pregnancies due to a
unit change in each predictor variable for the
Poisson and negative binomial models are pro-
vided. These transformed parameter estimates
were calculated in the way described above.
DISCUSSION
When making decisions about which modeling
approach is most appropriate for use with a given
data set, consider the degree of normality or non-
normality in the distribution of the outcome vari-
able. If the dependent variable is fairly normally
distributed, OLS regression may be the simplest
approach and an appropriate choice in many cases.
However, when the outcome variable of interest
takes the form of infrequently occurring count
data with highly skewed distributions, Poisson, or
negative binomial regression approaches may be
more appropriate. Through the use of SAS or
similar statistical software, such analyses are also
fairly simple to execute and generate results that
are meaningful and easy to interpret.
In choosing between Poisson and negative
binomial regression, the factors to consider are
overdispersion and power. When overdispersion is
present, and the Poisson assumption of equidis-
persion is violated, either a Poisson model with
corrected standard errors or a negative binomial
model may be used. Negative binomial modeling
may give you a little more statistical power.
Negativebinomialregressionmakesnoassump-
tions of equidispersion and no adjustments need to
be made when overdispersion is present. If Pois-
son modeling is chosen, the statistical adjustments
described above must be made to correct the
standard errors. Hutchinson et al. (2003) provided
examples of Poisson regressions that incorporate
corrections for overdispersion. In their analysis of
Table 7. Negative Binomial Regression Output
Standard
Error
Limits w2
Probability > w2
Intercept 1 À2.0414 .9608 À3.9245 À.1584 4.51 .0336
Current age 1 .1220 .0543 .0156 .2283 5.05 .0246
Hispanic 1 .2070 .2187 À.2216 .6355 .90 .3439
Asian 1 .4897 .3295 À.1561 1.1354 2.21 .1372
Never married 1 À.9297 .2081 À1.3375 À.5219 19.96 <.0001
College plans 1 À.0871 .0515 À.1882 .0139 2.86 .0909
Consistent contraceptive use 1 À.2730 .0825 À.4347 À.1113 10.96 .0009
Dispersion 0 .0007 0 .0007 .0007

sexual risk behaviors among inner-city adoles-
cent females, only one of the three sexual risk
outcomes measured (number of male sexual
partners during the past 3 months) showed
equidispersion. The other two outcomes (number
of days had sexual intercourse and number of days
had unprotected sexual intercourse) were over-
dispersed and corrected using the procedures
described above.
As is described above and illustrated in Table 8,
OLS, Poisson, and negative binomial regressions
yield regression coefficients that are quite similar.
However, because of the non-normality of the
distributions, the size of the standard errors and the
resultant level of significance of the coefficients
vary. The inappropriate use of OLS regression
could lead one to commit Type I errors and erro-
neously conclude that some variables are sig-
nificant predictors of the number of adolescent
pregnancies when in fact their effects are null.
A related problem is that OLS will model
pregnancies as a linear function of the predictor
variables, which can lead to inaccurate predictions
for the number of pregnancies for girls who
measure even moderately high or low on those
variables. Because the Poisson and negative bino-
mial models build nonlinearity into the model by
way of the log transformation, one gets a much
better model fit to the data and more realistic
predicted values.
Finally, the Poisson and negative binomial
models have a natural way of dealing with the
problem of differential exposure among subjects.
In our example, we take into account the number
of years of sexual experience of each girl, an
important predictor of pregnancies. By using years
of experience or exposure as an ‘‘offset’’ variable,
the models are automatically adjusted to give
results that reflect the risk of pregnancy per year of
exposure.
In conclusion, Poisson and negative binomial
regression may provide more appropriate means
for modeling infrequently occurring repeatable
events or counts. In addition to being better suited
to the data when the outcome variable is skewed,
these approaches have the additional advantages
of being able to accommodate differential expo-
sure and non-linear effects. Although researchers
may be less familiar with these regression models,
they are no more difficult to execute than tradi-
tional OLS regression when using SAS or similar
statistical software packages.
REFERENCES
Allison, P. (1999). Logistic regression using the SAS
system: Theory and application. Cary, NC: The SAS
Institute.
Blum, R.W., Beuhring, T., Shew, M.L., Bearinger, L.H.,
Sieving, R.E., & Resnick, M.D. (2000). The effects of
race/ethnicity, income, and family structure on
adolescent risk behaviors. American Journal of
Public Health, 90, 1879–1884.
Cameron, A.C., & Trivedi, P.K. (1998). Regression
analysis of count data. New York: Cambridge
University Press.
Table 8. Comparison of Four Regression Models
OLS Poisson
Negative
Binomial
Predicted
% Change
per Unit
for Poisson
and NB
Models
Parameter
Estimate SE
Parameter
Estimate SE
Corrected
SE
Parameter
Estimate SE
Intercept .37 .22 À2.04 .96 1.65 À2.04 .96
Current age .05 .01 .12 .05 .09 .12 .05 13.0
African American .18 .03 .66 .13 .22 .66 .13 93.6
Hispanic .03 .05 .21 .22 .37 .21 .22 23.0
Asian .09 .08 .49 .33 .56 .49 .33 63.2
Single À.72 .10 À.93 .21 .36 À.93 .21 À60.5
Years of sexual
activity
.07 .01
College plans À.04 .01 À.09 .05 .09 À.09 .05 À8.3
Contraceptive
self-efficacy
À.05 .02 À.22 .08 .14 À.22 .08 À20.1
Consistent
contraceptive use
À.07 .02 À.27 .08 .14 À.27 .08 À23.9
n for all models 1,154

DiClemente, R.J., Lodico, M., & Grinstead, O.A.
(1996). African American adolescents residing in
high-risk urban environments do use condoms:
Correlates and predictors of condom use among
adolescents in public housing developments. Pedia-
trics, 98, 269–278.
Hutchinson, M.K. (2002). Sexual risk communication
with mothers and fathers: Inﬂuence on the sexual risk
behaviors of adolescent daughters. Family Relations,
51, 238–247.
Hutchinson, M.K., & Cooney, T.M. (1998). Parent-teen
sexual risk communication implications for interven-
tion. Family Relations, 47, 185–194.
Hutchinson, M.K., Jemmott, J.B. III, Jemmott, L.S.,
Braverman, P., & Fong, G.T. (2003). The role of
mother–daughter sexual risk communication in
reducing sexual risk behaviors among urban adoles-
cent females: A prospective study. Journal of
Adolescent Health, 33, 98–107.
Jemmott, J.B. III, Jemmott, L.S., & Hacker, C.I. (1992).
Predicting intentions to use condoms among African
American adolescents: The theory of planned beha-
vior as a model of HIV risk-associated behavior.
Ethnicity and Disease, 2, 371–380.
Kann, L., Kinchen, S., Williams, B., Ross, J., Lowery,
R., Grunbaum, J., et al. ( 2000). Youth risk behavior
surveillance—U.S., 1999. Morbidity and Mortality
Weekly Report, 49, 1–96.
Lewis-Beck, M. (1989). Applied regression: An intro-
duction. Newbury Park, CA: Sage.
Scher, P.W., Emans, S.J., & Grace, E.M. (1982). Factors
associated with compliance to oral contraceptive use
in an adolescent population. Journal of Adolescent
Health Care, 3, 120–123.

Poisson Regression Analysis of Count Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Poisson Regression Analysis of Count Data

Similar to Poisson Regression Analysis of Count Data (20)

Recently uploaded

Recently uploaded (20)

Poisson Regression Analysis of Count Data