Model For Estimating Diversity Presentation

Model for Estimating Population Diversity as the
Prediction of Sample needed for full Coverage
with Applications in Bioinformatics

Torres, David A., Pericchi, Luis R.
Department of Mathematics
University of Puerto Rico, Rio Piedras.

Abstract
There exist several methods for estimating
community diversity using coverage (Bunge and
Fitzpatrick 1993). The biologist and
environmental scientist challenge the
statisticians in order to solve such problem.
Here we present an approach for the estimation
using coverage model (Good, I. G, 1953) and a
population estimator (Good, I. G. and G. H.
Toulmin, 1956). We apply the method to a data
given from microbial diversity presented in the
crop of the hoatzin by molecular analysis of
cloned 16S RNA genes.

Introduction
• Estimating the number of species in a community is
a classical problem in Ecology, biogeography, and
conservation biology, and parallel problems arise in
many other disciplines. This research topic has been
extensively discussed in the literature; see Bunge
and Fitzpatrick (1993), Seber (1982, 1986, 1992) for a
review of the historical and theoretical development.
• Ecologists and other biologists have long
recognized that there are undiscovered species in
almost every survey or species inventory. A parallel
problem is tried to answer how many words did a
particular author know. Efron, B., Twisted, R. (1975).

• A random sample is taken from a Community. We
will refer to this sample as the basic sample.

• Our intention is calculate an estimator for coverage
of the community using the information provided in
the basic sample and then estimate the number of
species in the community.

• Moreover, we pretend to describe a method that
present an estimator of the number of additional
data needed to get a total coverage of the
community .

• An example will be presented in order to apply the
theory.

Methods
• A random sample of size N is drawn from a
community and let be the n r
numbers of
distinct species represented exactly r times in
the sample, then

∞

∑rn
i=1
i =N

• We shall be concerned with, qr , the community
frequency of an arbitrary species that is
represented r times in the basic sample.
• Let, Ε(q ) , be the expected value of q . A main
r r
result used by Good (1953) is that
* (2)
r
Ε (qr ) =
N
where ( r + 1) nr + 1 .
r =
*
nr

• This can be generalized to give a higher
moment of qr . As a matter of fact

m
 r + m  nr + m
Ε ( qr ) =   (3)
 N  nr

where
r = 1,2,3; m = 1,2,3
and t .
t =
m
∏i
i = m+1

• Recursively, we can rewrite (3) as
r + m −1
Ε (q ) ≈
m
r ∏ Ε (q )
i=r
i .

• Moreover, the variance of qr is approximately:

(r + 1)(r + 2) nr + 2  (r + 1)nr + 1 
V (qr ) = 2
− 
N nr  Nnr 
∞
• Note that, then we have that
nr ≤∑ =
rnr N
i=
( r + 1) nr
1

r *
≤
N

As an estimator of the expected total change of all species that

are each r
represented times
( )
r ≤ 1in the basic sample is
( r + 1) nr
N
Also the expected total chance of all species that are represented

times or more in the sample is approximately
∞
1
∑+1 ini
N i= r

In particular note that the expected total change in the sample is
1 ∞ Ε (n1 )
approximately
∑2 knk = 1 − N
N k= (4)

• Hence, the total coverage of the sample (i.e.
the proportion of community represented in the
sample, which is the sum of the population
frequencies of the species represented) is
approximately.

Ε (n1 ) n1 (5)
1− = 1−
N N

The change that the next member of the community will
belong to a new species is estimated as, n1 .
N
Lets write the total number of distinct species in the
sample as ∞
d = ∑ nx
x =1
and suppose that the total number of distinct species
in the community is a known finite number s. Then the
number of non-represented species in the sample is
given by n .=0 s − d

• Then let pµ ( µ = 1, 2,3,) the
be

population frequencies of the species. As in

Good (1953), equation (10),
(6)
s  N!  r
Ε (nr ) = ∑  p (1 − pµ ) N −r
 r !( N − r ) !  µ
µ =1 



Ε nr (λ =∑
( ))
µ

s

s



2
λN !
µ=  r !( λ
1
 r
N −r ) ! 
λ !
N

µ
For the population, we have similarly,

assuming p ≤ 1 for all .

pµ(1 − pµ)

λN −r

pµ 
− ( λ 1)
N −

=∑!( λ − )! pµ(1 −pµ)
r N r
r N−r
 +

1
1− µ p 

µ1= 
s
λ !
N ∞
−λ 1) N !
( −
=∑!( λ − )! pµ(1 −pµ)
µ1 r
= N r
r N−r
∑( −λ 1) N − )! p
i= i !
0 ( − i
∞
λ !
N −λ 1) N !
( − s
=∑!( λ − )! i !( −λ 1) N − )! ∑µ i (1 −pµ) N −
i= r
0 N r ( − i µ1 =
pr+ (

( λ ) ( − λ 1) N ) ( r + )!
( −
r i
N 1
= r+i
Ε nr + )
( i
r !i ! N
i
∞
( r + )!
i
λ
≈ ∑ 1)
(−
r
( λ 1) Ε nr + )
−
i
( i
i=0 r !i !

• For the case r = 0, we not need to assume the value of s,
since this assumption is not required to write
∞
d ∑
ˆ ( λ ) − d = ( − 1) i ( 1 − λ ) i n = s − n (λ )
i =1
i 0
(8)

• We may be particularly interested in the coverage of the
community, then using equation (5) and (7) with r=1 we
have the expected coverage is approximately

n1 1 (9)
1 − ≈ 1 − [n1 − 2(λ − 1)n2 + 3(λ − 1) n3 − ]
2

N N

• The expected number of distinct species
represented is approximately

d + ( λ − 1) n1 − ( λ − 1) n2 + 
2

• We use the coverage to estimate the value of
and straightforward the population size needed
to get 100% coverage. The equation (9) is the
one that is called Good-Toulmin model by the
fact that is a merge between the two models
proposed by them.

Application

• The hoatzin is a South American leaf-eating bird and the
its uniqueness lies in its particular foregut (crop), the only
known for the avian class.
• Forestomach compartmentalization allows mammal
herbivores to be nourished on microbial fermentation
products and microbial biomass. Bacteria are largely
responsible for fermentation of dietary components, and
bacterial cells are themselves subject to digestion by
gastric lysozyme expressed in the abomasum of
ruminants.

• The evolutionary pressure towards foregut specialization
in herbivores was presumably exerted by indigestible
plant polymers (cellulose), so that production of
microbial biomass at expenses of these indigestible
materials has clear advantages.

• In the hoatzin, a preliminary characterization of the crop
microflora was done by culture (Domínguez-Bello et al.,
1993). In this study we aim to characterize the bacterial
diversity in the crop of the hoatzin by a molecular
analysis of cloned 16S rRNA genes.

Results

• For the 69 O.T.U’s obtained, Good’s method left side of
equation (9)) indicated a coverage of diversity of 77%

• This means that 100% diversity will correspond to 90
O.T.U. Given that, applying the Good and Toulmin’s
model (figure 2), we estimate a λ=1.5 which means that
we need 98 (300-202) additional clones to obtain the 31
O.T.U’s needed to cover 100% diversity.

Conclusions (Application)

• The estimate indicates 300 clones are needed to
represent 100% of sample diversity 99% of the clones
and 88% of OTU analyzed are unidentified species.

• Based on 202 sequences yielding 69 O.T.U, Good and
Toulmin estimator indicates a coverage of 77% of the
total diversity.

Future Research
• There are many models and procedure try to calculate
coverage, instead of using the Good’s estimator of
coverage it will be interesting try another approach.
Perhaps, using Poisson process or an Multinomial
approach it’s possible to get better estimators. Another
approach could be the use of Bayesian inference in the
assumption of a no known distribution in a Metropolis
Hasting procedure.
• The importance of this type of problem is based on the
experimental designs.
• Good stated once that “I don’t believe it is usually
possible to estimate the number of unseen species …
but only an approximate lower bound to that number.”.
We will keep on the road.

Literature cited
• Godoy Filipa1, Gao, Z. 2, Pei Z.2, Zhou M.2 ,Garcia-Amado,
M.A.3,Pericchi, L.R. 4 ,Torres, D. 4 Michelangeli F.3, Blaser M.J 2 ,
Domínguez-Bello, M.G.1High bacterial diversity in the forestomach of
the Hoatzin is revealed by molecular analysis of 16S rRNA Genes.
1Department of Biology, University of Puerto Rico, Rio Piedras, San Juan,
PR 00931. 2 Departments of Medicine, Pathology and Microbiology, New
York University School of Medicine, New York, NY 10016 3Venezuelan
Institute of Scientific Research, CBB, Caracas, Venezuela. 4 Department of
Mathematics University of Puerto Rico, Rio Piedras, San Juan, PR 00931.
• Chao,A.,Lee,S.,1992. Estimating the Number of Classes via Sample
Coverage. Journal of the American Statistical Association,87: 210-217.
• Domínguez-Bello, M. G.M. Lovera, P. Suarez and F. Michelangeli, 1993,
Microbial inhabitants in the crop of the hoatzin (Opisthocomus
hoazin): the only foregut fermented avian. Physiol. Zool. 66: 374-383.
• Good, I. G. and G. H. Toulmin, 1956. The number of new species and the
increase in population coverage when the sample is increase.
Biometrika 43: 45-63.
• Good,I., 1953. The Population Frequencies of Species and the
Estimation of Population Parameters. Biometrika,40: 237-264.

Model For Estimating Diversity Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Model For Estimating Diversity Presentation

Similar to Model For Estimating Diversity Presentation (20)

Recently uploaded

Recently uploaded (20)

Model For Estimating Diversity Presentation