This document describes an analysis of count data from a study on the detection of anthelmintic resistance in gastrointestinal nematodes of small ruminants. The data consists of egg counts from 30 goats and 30 sheep that were grouped into Albendazole, Ivermectin, and control groups. The data was analyzed using Poisson and negative binomial regression models in R software. The Poisson model did not fit the data well due to overdispersion. However, the negative binomial regression model provided a better fit for the overdispersed data. Key findings from the negative binomial regression analysis are summarized.
Call Girls In Mahipalpur O9654467111 Escorts Service
Analysis of count data using Poisson and negative binomial regression models
1. I
ADDIS ABABA UNIVERSITY
COLLEGE OF VETERINARY MEDICINE AND AGRICULTURE
Assignment for the course “Advanced Biostatistics” ON ANALYSIS OF COUNT DATA
By Walkite Furgasa Chala (DVM) ID NO., GSR/2792/10
Submitted to;
Samson Leta (DVM, MSc, Assistant Professor )
December, 2017
Bishoftu, Ethiopia
2. II
Table of Contents Page
LIST OF TABLE ........................................................................................................................III
LIST OF FIGURES ....................................................................................................................IV
LIST OF ABBREVATIONS ........................................................................................................V
SUMMARY.................................................................................................................................VI
1. INTRODUCTION..................................................................................................................... 1
2. STATISTICAL TESTS TO ANALYZE COUNT DATA ..................................................... 2
2.1 Poisson Regression................................................................................................................. 2
2.2 Negative Binomial Regression................................................................................................ 3
2.3 Zero Inflated Regression........................................................................................................ 4
3. ANALYSIS OF A COUNT DATA .......................................................................................... 5
3.1. Source of Data....................................................................................................................... 5
3.2. Types of Variables of the Data............................................................................................... 8
3.3. Poisson Regression Analysis and Its Interpretation............................................................... 8
3.4. Negative Binomial Regression Analysis and Its Interpretation........................................... 14
4 REFERENCES......................................................................................................................... 21
4. IV
LIST OF FIGURES
Figure 1. Q-Q plot of poission regression analysis
Figure 2. Q-Q plot of negative binomial regression analysis
5. V
LIST OF ABBREVATIONS
AIC Akaike Information Criterion
EPG Egg pergram of feaces
GLM Generalized linear model
IRR Incident rate ratio
NBREG Negative binomial regression model
ZINB Zero inflated negative binomial model
ZIP Zero inflated poisson model
6. VI
SUMMARY
In statistics, count data is a statistical data type in which the observations can take only the non-
negative integer values. Count models are a subset of discrete response regression models and
are distributed as non-negative integers, are intrinsically heteroskedastic, right skewed, and
have a variance that increases with the mean. An individual piece of count data is often termed
as a count variable. When such a variable is treated as a random variable,
the Poisson and negative binomial distributions are commonly used to represent its distribution
and if there is excess zeros, zero Inflated Regression was used. The objective of this assignment
was to write and analyze certain data on count data using R software. The title of the the data is
“Detection of Anthelmintic Resistance in Gastrointestinal Nematodes of Small Ruminants in
Haramaya University Farms”. The sheep and goats infected with gastrointestinal nematodes were
selected and I took 30 goats and 30 sheep. The goats and sheep were grouped into Albendazole
group(10), Ivermectin group(10) and the control(10). The egg was counted before treatment and
after treatment in treated group and again the egg was also counted twice in control group in
parallel to treated groups. The change of egg count was taken from treated groups and the second
egg count was taken in control group for these assignment. The data was analyzed with R
software through poisson regression and negative binomial regression models. The poisson
model didn`t fit the data because the result of overdispersion test indicate there is evidence of
overdispersion (c is estimated to be 872.046) which speaks quite strong against the assumption
of equidispersion that means when c=0. Pchisq p-value also nonsignificant(0) which indicates
the data was not fit. The normal quartile plot also indicates that the error is not normally
distributed. So generally since almost all assumption were violated or the goodness of fit of the
Poisson model indicates that the model is not fit. The ‘dispersiontest’indicate the data to be over
dispersed but the negative binomial regression model fit the data. The data was interpreted based
on the result obtain through negative binomial regression.
Keywords: Analysis, count data
7. 1
1. INTRODUCTION
In statistics, count data is a statistical data type in which the observations can take only the
non-negative integer values {0, 1, 2, 3, ...}, and where these integers arise from counting.
The statistical treatment of count data is distinct from that of binary data, in which the
observations can take only two values, usually represented by 0 and 1, and from ordinal data,
which may also consist of integers but where the individual values fall on an arbitrary scale
and only the relative ranking is important(Cameron and Trivedi, 2013).
Count models are a subset of discrete response regression models. Count data are
distributed as non-negative integers, right skewed, and have a variance that increases with
the mean. Example, count data include such situations as length of hospital stay, the
number of a certain species of fish per defined area in the ocean, the number of lights
displayed by fireflies over specified time periods, the classic case of the number of deaths
and the number of occurrences of thunderstorms in a calendar year. An individual piece of
count data is often termed a count variable. When such a variable is treated as a random
variable, the Poisson and negative binomial distributions are commonly used to represent its
distribution (Cameron and Trivedi, 1986).
Graphical examination of count data may be aided by the use of data transformations chosen
to have the property of stabilising the sample variance. In particular, the square root
transformation might be used when data can be approximated by a Poisson
distribution (although other transformation have modestly improved properties), while an
inverse sine transformation is available when a binomial distribution is preferred(Hilbe,
2011b).
8. 2
2. STATISTICAL TESTS TO ANALYZE COUNT DATA
2.1 Poisson Regression
The Poisson distribution can form the basis for some analyses of count data and in this
case Poisson regression may be used. This is a special case of the class of generalized linear
models which also contains specific forms of model capable of using the binomial
distribution (binomial regression, logistic regression) or the negative binomial distribution
where the assumptions of the Poisson model are violated, in particular when the range of
count values is limited or when overdispersion is present(Hilbe, 2011a).
A key feature of the Poisson model is the equality of the mean and variance functions. When
the variance of a Poisson model exceeds its mean, the model is termed overdispersed.
Simulation studies have demonstrated that overdispersion is indicated when the Pearson
χ2dispersion is greater than 1.0. The dispersion statistic is defined as the Pearson χ2 divided
by the model residual degrees of freedom. Overdispersion, common to most Poisson models,
biases the parameter estimates and fitted values. When Poisson overdispersion is real, and
not merely apparent, a count model other than Poisson is required(Hilbe, 2008).
Poisson regression is the basic model from which a variety of count models are based. It is
derived from the Poisson probability mass function. The Poisson regression model is the
benchmark model for count data in much the same way as the normal linear model is the
benchmark for real-valued continuous data(Cameron and Trivedi, 1986).
The Poisson model is simple, and it is robust. If the only interest of the analysis lies in
estimating the parameters of a log-linear mean function, there is hardly any reason (except
for efficiency) to ever contemplate anything other than the Poisson regression model. In
fact, its applicability extends well beyond the traditional domain of count data. The
Poisson regression model can be used for any constant elasticity mean function, whether
the dependent variable is a count, and there are good reasons why it should be preferred
over the more common log transformation of the dependent variable. In fact, its
applicability extends well beyond the traditional domain of count data. And yet, there are
instances where the Poisson regression model is unsuited. Essentially, the Poisson model is
9. 3
always overly restrictive when it comes to estimating features of the population other than
the mean, such as the variance or the probability of single outcomes.
The Poisson distribution has a positive mean µ. Although a GLM can model a positive mean
using the identity link, it is more common to model the log of the mean. Like the linear
predictor α+βx, the log mean can take any real value. The log mean is the natural parameter
for the Poisson distribution, and the log link is the canonical link for a Poisson GLM. A
Poisson loglinear GLM assumes a Poisson distribution for Y and uses the log link. The
Poisson loglinear model with explanatory variable X is logµ=α+βx. For this model, the mean
satisfies the exponential relationship µ=exp(α+βx)=eα(eβ)x. A one unit increase in x has a
multiplicative impact of eβ on µ. The mean at x+1equals the mean at x multiplied by eβ.(Re)
.
In some contexts, the Poisson distribution describes the number of events that occur in a
given time period where its mean µ is the average number of events per period. It has the
unusual feature that its mean equals its variance. Its probability density function is Pr(Y = y )
= e-µµy/y!, y=0,1,2,. . .where e is the base of the natural logarithms and y ! is the factorial of
y . The skewness of the Poisson distribution is (1/µ) and the kurtosis is (3 + 1/µ), so that for
large µ, the distribution approaches the Normal N (µ,µ) with skewness of zero and kurtosis
of three (Christopher,2010)
2.2 Negative Binomial Regression
A limitation of the Poisson distribution is the equality of its mean and variance. It may often
observe count data processes where this equality is not reasonable: in particular, where the
conditional variance is larger than the conditional mean. This is termed overdispersion, and its
presence renders the assumption of a Poisson distribution for the error process untenable. It is
particularly likely to occur in the case of unobserved heterogeneity. In this circumstance, a
reasonable alternative is negative binomial regression. The negative binomial is a conjugate
mixture distribution for count data. The negative binomial (NB) distribution is a two-parameter
distribution. For positive integer n, it is the distribution of the number of failures that occur in a
sequence of trials before n successes have occurred, where the probability of success in each trial
is p. The distribution is defined for any positive n. The negative binomial distribution is a
10. 4
mixture of the Poisson distribution and the Gamma distribution, or generalized factorial function.
Unlike the Poisson, which is fully characterized by its mean µ, the NB distribution is a function
of both µ and α . Its mean is still µ, but its conditional variance is µ(1 +α). As evident, as α=0,
the distribution becomes the Poisson distribution(Christopher, 2010)
2.3 Zero Inflated Regression
In many studies count data may possess excess amount of zeros. If data consist of non-
negative, highly skewed sequence counts with a large proportion of zeros. Zero-Inflated
Poisson (ZIP), Zero-Inflated Negative Binomial (ZINB) Models and Hurdle models are
useful for analysing of such data. Zero counts may not occur in the same process as other
positive counts. Zero-inflated count data may not have equality of mean and variance. In
such case over-dispersion (or under-dispersion) need to be taken into account. (Lambert,
1992)
11. 5
3. ANALYSIS OF A COUNT DATA
3.1. Source of Data and Its Description .
The data was normally my DVM thesis. The data was on East Africa Journal of veterinary and
Animal science 03 gallery proof. walkite et al., 2017. The title of the the data or the research is
“Detection of Anthelmintic Resistance in Gastrointestinal Nematodes of Small Ruminants in
Haramaya University Farms”. The sheep and goats infected with gastrointestinal nematodes
were selected and I took 30 goats and 30 sheep. The goats and sheep were grouped into
Albendazole group(10), Ivermectin group(10) and the control(10). The egg was counted
before treatment and after treatment in treated group and again the egg was also counted twice
in control group in parallel to treated groups. The change of egg count was taken from treated
groups and the second egg count was taken in control group for these assignment(Walkite et
al., 2017).
Table 1:- The raw data of the Assignment
No, ID age species sex treatment EPG
1 1546 >3yrs goat male Albendazole 1050
2 1595 <-1yrs goat male Albendazole 2500
3 1612
2yrs-
3yrs goat male Albendazole 2800
4 1599 <-1yrs goat male Albendazole 1450
5 1576
2yrs-
3yrs goat male Albendazole 9050
6 1593 <-1yrs goat male Albendazole 2300
7 1609 <-1yrs goat male Albendazole 1050
8 1608 <-1yrs goat female Albendazole 650
9 1526
2yrs-
3yrs goat female Albendazole 1850
10 1605 <-1yrs goat female Albendazole 2350
11 63
2yrs-
3yrs goat female Ivermectin 400
12 42
2yrs-
3yrs goat female Ivermectin 3300
12. 6
13 110
2yrs-
3yrs goat male Ivermectin 5750
14 111
2yrs-
3yrs goat male Ivermectin 4900
15 28
2yrs-
3yrs goat male Ivermectin 1800
16 1425
2yrs-
3yrs goat male Ivermectin 1100
17 80
2yrs-
3yrs goat male Ivermectin 2200
18 96
2yrs-
3yrs goat male Ivermectin 1050
19 72
2yrs-
3yrs goat female Ivermectin 350
20 87
2yrs-
3yrs goat female Ivermectin 1500
21 1536 >3yrs goat female control 2550
22 1543 >3yrs goat female control 1600
23 1580 >3yrs goat male control 2250
24 13 >3yrs goat male control 350
25 68 >3yrs goat male control 2800
26 6 >3yrs goat male control 3450
27 5 >3yrs goat male control 700
28 21 >3yrs goat male control 600
29 31 >3yrs goat female control 1000
30 259 >3yrs goat female control 700
31 106 <-1yrs sheep male Albendazole 300
32 13 2yrs-3yrs sheep female Albendazole 2050
33 237 2yrs-3yrs sheep female Albendazole 400
34 42 2yrs-3yrs sheep female Albendazole 200
35 95 >1yrs sheep male Albendazole 5100
36 190 <-1yrs sheep male Albendazole 250
37 148 >3yrs sheep male Albendazole 1550
38 89 >3yrs sheep female Albendazole 1150
39 158 >3yrs sheep male Albendazole 1500
40 187 >3yrs sheep female Albendazole 2100
41 109 2yrs-3yrs sheep male Ivermectin 1100
13. 7
42 5 2yrs-3yrs sheep female Ivermectin 350
43 110 >3yrs sheep male Ivermectin 500
44 168 >1yrs sheep female Ivermectin 1200
45 120 >yrs sheep male Ivermectin 2350
46 20 2yrs-3yrs sheep male Ivermectin 300
47 83 1yrs sheep male Ivermectin 1850
48 60 2yrs-3yrs sheep female Ivermectin 2100
49 14 2yrs-3yrs sheep female Ivermectin 900
50 909 >3yrs sheep male Ivermectin 800
51 6 >3yrs sheep female control 1350
52 218 >3yrs sheep female control 1150
53 11 2yrs-3yrs sheep female control 350
54 86 2yrs-3yrs sheep male control 1200
55 220 2yrs-3yrs sheep female control 1350
56 217 >3yrs sheep female control 150
57 147 2yrs-3yrs sheep male control 550
58 15 2yrs-3yrs sheep male control 1200
59 2 2yrs-3yrs sheep female control 1350
60 9 >3yrs sheep female control 1350
14. 8
3.2. Types of Variables of the Data
The EPG is the count response variables and sex,species, age and treatment are the
explanatory variables.
3.3. Poisson RegressionAnalysis and Its Interpretation
attach(walkite_Assignment_)
names(walkite_Assignment_)
View(walkite_Assignment_)
nematode<-glm(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),family =
"poisson",data = walkite_Assignment_)
nematode
summary(nematode)
coef <- coefficients(nematode)
coef
IRR <- exp(coefficients(nematode))
IRR
# predicted values and residual error
pred <- predict(nematode, type="response") # estimate predicted values
pred
res <- residuals(nematode, type="deviance") # estimate residuals
res
qqnorm(res, plot.it = TRUE)
qqline(res)
#Evaluating the fitness of Poisson regression models
?pchisq
pchisq(nematode$deviance,df=nematode$df.residual,lower.tail = FALSE)
library(AER)
dispersion <- dispersiontest(nematode,trafo=1)
15. 9
dispersion
###################################################
library(readxl)
> walkite_Assignment_ <- read_excel("~/walkite Assignment .xlsx")
> View(walkite_Assignment_)
> attach(walkite_Assignment_)
The following object is masked _by_ .GlobalEnv:
age
The following objects are masked from walkite_Assignment_ (pos = 3):
age, EPG, ID, no,, sex, species, treatment
The following objects are masked from walkite_Assignment_ (pos = 4):
age, EPG, ID, no,, sex, species, treatment
The following objects are masked from walkite_Assignment_ (pos = 12):
age, EPG, ID, no,, sex, species, treatment
> names(walkite_Assignment_)
[1] "no," "ID" "age" "species" "sex" "treatment"
[7] "EPG"
> View(walkite_Assignment_)
>nematode<-glm(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),family =
"poisson",data = walkite_Assignment_)
> nematode
Call: glm(formula = EPG ~ factor(age) + factor(species) + factor(sex) +
16. 10
factor(treatment), family = "poisson", data = walkite_Assignment_)
Coefficients:
(Intercept) factor(age)>3yrs
7.452148 0.005106
factor(age)2yrs-3yrs factor(species)sheep
0.308401 -0.500520
factor(sex)male factor(treatment)control
0.393118 -0.356036
factor(treatment)Ivermectin
-0.307651
Degrees of Freedom: 59 Total (i.e. Null); 53 Residual
Null Deviance: 64280
Residual Deviance: 49460 AIC: 50000
> summary(nematode)
Call:
glm(formula = EPG ~ factor(age) + factor(species) + factor(sex) +
factor(treatment), family = "poisson", data = walkite_Assignment_)
Deviance Residuals:
Min 1Q Median 3Q Max
-41.835 -28.155 -6.764 14.689 78.557
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.452148 0.009425 790.677 <2e-16 ***
factor(age)>3yrs 0.005106 0.010897 0.469 0.639
factor(age)2yrs-3yrs 0.308401 0.009429 32.708 <2e-16 ***
factor(species)sheep -0.500520 0.006708 -74.620 <2e-16 ***
17. 11
factor(sex)male 0.393118 0.006936 56.681 <2e-16 ***
factor(treatment)control -0.356036 0.009612 -37.039 <2e-16 ***
factor(treatment)Ivermectin -0.307651 0.008409 -36.586 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 64280 on 59 degrees of freedom
Residual deviance: 49456 on 53 degrees of freedom
AIC: 50005
Number of Fisher Scoring iterations: 5
> coef <- coefficients(nematode)
> coef
(Intercept) factor(age)>3yrs
7.452147624 0.005106036
factor(age)2yrs-3yrs factor(species)sheep
0.308400816 -0.500519597
factor(sex)male factor(treatment)control
0.393117562 -0.356036156
factor(treatment)Ivermectin
-0.307651001
> IRR <- exp(coefficients(nematode))
> IRR
(Intercept) factor(age)>3yrs
1723.5607335 1.0051191
factor(age)2yrs-3yrs factor(species)sheep
1.3612465 0.6062156
factor(sex)male factor(treatment)control
20. 14
[1] 0
Interpretation: In this result the p-value zero (0) which indicates it is significant, indicating the lack of
fit of the data. The significance of the p-value in this result shows that there is presence of
overdispersion and it reveals that the poisson model data does not fit the data
> library(AER)
> dispersion <- dispersiontest(nematode,trafo=1)
> dispersion
Overdispersion test
data: nematode
z = 4.2675, p-value = 9.884e-06
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha
871.0029
The result of overdispersion test indicate there is evidence of overdispersion (c is estimated to be
872.046) which speaks quite strong against the assumption of equidispersion that means when c=0. So
generally since almost all assumption were violated or the goodness of fit of the Poisson model indicates
that the model is not fit. The ‘dispersiontest’indicate the data to be over dispersed. The normal quartile
plot also indicates that the error is not normally distributed. Thus, it is better to look for
Negative Binomial Regression.
3.4. Negative Binomial RegressionAnalysis and Its Interpretation
#Negative Binomial regression
library(MASS)
NBREG<-glm.nb(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),data =
21. 15
walkite_Assignment_)
NBREG
summary(NBREG)
#####Checking the model assumption
library(lmtest)
lrtest(nematode,NBREG)
coef <- coefficients(NBREG)
coef
IRR <- exp(coefficients(NBREG))
IRR
# predicted values and residual error
pred <- predict(NBREG, type="response") # estimate predicted values
pred
res <- residuals(NBREG, type="deviance") # estimate residuals
res
qqnorm(res, plot.it = TRUE)
qqline(res)
################################
> library(MASS)
>NBREG<-glm.nb(EPG~factor(age)+factor(species)+factor(sex)+factor(treatment),data =
walkite_Assignment_)
> NBREG
Call: glm.nb(formula = EPG ~ factor(age) + factor(species) + factor(sex) +
factor(treatment), data = walkite_Assignment_, init.theta = 1.923394949,
link = log)
Coefficients:
(Intercept) factor(age)>3yrs
7.59952 -0.08562
factor(age)2yrs-3yrs factor(species)sheep
26. 20
-0.22734653 -0.55289232 0.71465995 0.46009582 -1.21473337 -0.11852987
55 56 57 58 59 60
0.47377501 -1.85712317 -1.04996037 -0.11852987 0.47377501 0.71465995
> qqnorm(res, plot.it = TRUE)
> qqline(res)
>
*The normal quartile plot indicates that the error is almost normally distributed. Thus the
negative binomial regression fit the data.
IT’S INTERPRETATION
The interpretation should be based on negative binomial regression analysis because the poission
model does not fit the Data. In the above Negative binomial regression analysis ‘Albendazole’
from –treatment-, ‘female’ from –sex- and ‘<-1yrs’ from -age, `goat` from species were used
as references. Sex and age have statistically nonsignificant impact on EPG count and control
group has nonsignificant effect on EPG count. Species has significant impact on EPG count. The
reduction factor caused by Ivermectin drug is (exp(-0.21070)-1)*100= -18.998. Even if there is
reduction in EPG count the Ivermectin drug has nonsignificant impact on EPG count because the
p-value for Ivermectin is 0.3991. Hence this indicates that there is resistance of the parasite or
the efficacy of the drug is not good. Generally; since the control group (p-value=0.2661) has
27. 21
nonsignifant effect on EPG count, in both Albendazole and Ivermectin resistance of parasite
were detected.
4. REFERENCES
Cameron, A.C., Trivedi, P.K., 1986. Econometric models based on count data. Comparisons and
applications of some estimators and tests. Journal of applied econometrics 1, 29-53.
Cameron, A.C., Trivedi, P.K., 2013. Regression analysis of count data. Cambridge university
press.
Christopher, B., 2010. Models for Count Data and Categorical Response Data.
Hilbe, J.M., 2008. Brief overview on interpreting count model risk ratios: An addendum to
negative binomial regression. Cambridge University Press Cambridge.
Hilbe, J.M., 2011a. Modeling count data. International Encyclopedia of Statistical Science.
Springer, pp. 836-839.
Hilbe, J.M., 2011b. Negative binomial regression. Cambridge University Press.
Lambert, D., 1992. Zero-inflated Poisson regression, with an application to defects in
manufacturing. Technometrics 34, 1-14.
Walkite, F., Negesse, M., Anwar, H., 2017. Detection of Antihelmintic resistance in
gastrointestinal nematode parasite in small ruminant in Haramaya university farms. pp. 13-19.