ysf_report

A Bayesian approach to identifying high-concern individuals in an infection-bearing population
A Bayesian approach to identifying high-concern
individuals in an infection-bearing population
Anqi Dong
March 9, 2012
1 Background
The delayed diagnosis and treatment of in-
dividuals who are carrying infectious diseases
can place a large burden on the healthcare sys-
tem [4, 8] and the general population. Such
delays are especially worrisome for infections
with prolonged asymptomatic periods, such as
HIV, chlamydia, and gonorrhea, in environments
where many persons have vulnerable immune
systems, such as hospitals and nursing homes,
and in environments characterized by high inter-
personal contact rates, such as schools and pris-
ons. Public health authorities use methods such
as contact tracing to locate contacts of reported
infected individuals and determine whether they
are infected as well. The lack of direct connec-
tion between an individual’s contact patterns and
that individual’s infection state means that con-
tact tracing procedures often need to be fairly
exhaustive in order to not miss infected individu-
als [9,13].
In certain contexts, there is much data col-
lected about the individual-level characteristics
of an infection within a population. One of the
key types of data about populations that is col-
lected is descriptions of potentially infection-
transmitting person-to-person interactions. These
interactions are generally represented using con-
tact networks—graphs, with vertices represent-
ing persons and edges connecting persons who in-
teract with each other [3]. For large, well-mixed
populations, such as those of cities, it is difﬁcult
and infeasible to obtain useful sets of data [11].
However, traditional contact tracing provides
data on individuals during outbreak investiga-
tions. In addition, for smaller, relatively closed
settings, such as hospitals, nursing homes, and
schools, it is possible to gather detailed epidemi-
ological data by distributing on-body proximity
sensors to the entire population to log persons’
contacts [6,12].
Bayesian statistical approaches are valu-
able tools for analysis of epidemiological data.
Current literature on using Bayesian tech-
niques in the context of infection spread on a
heterogeneously-mixing contact network gen-
erally focuses on inferring infection parame-
ters such as the average probability of infec-
tion [1, 2, 5, 7, 10]. Current techniques assume
that data on the structure of contact networks is
virtually nonexistent, and therefore either ignore
the effects of contact networks on the infection,
or sample from generated networks of a simple
family of graphs (for example, Bernoulli random
graphs) as part of the Bayesian approach [2,5].
This means that there is very little work on in-
ferring additional individual-level characteris-
tics from epidemiological data, especially when
knowledge of the contact network is good.
Analyses of available data on contact net-
works and individual epidemiological records
could better inform healthcare systems about
various infection patterns and trends, so that
March 9, 2012 Anqi Dong Page 1 of 6

they can better target their limited efforts and
resources. Such efficiencies include prioritizing
contact tracing and testing to first assess persons
who haven’t reported but who are likely to be
infected, and first vaccinating those who are un-
infected but who are at high risk of immediate
infection.
Coupled with an automated, continuous data-
gathering system such as iEpi [6], an inference
system could provide quasi-real-time predictions
in institutional settings, identifying unknown
sources of infection or patients at high risk of
becoming infected. The spatial data provided by
such a data-gathering system could even consider
hidden environmental pathogen reservoirs, such
as a contaminated surface at a particular location.
2 Project goals
To implement an inference system that iden-
tifies persons of interest or concern in a contact
network using an incomplete set of epidemio-
logical data. These persons of interest include
likely high spreaders of infection, persons who
are probably infected but whose infection sta-
tuses are unknown, and persons who probably
are not currently infected, but are likely to be-
come infected soon.
To design a general mathematical framework
for performing inference on individual-level con-
tact data and infection histories.
3 Methods
3.1 Epidemiological model
We consider a contact network C of n per-
sons. In terms of infection spread, the network is
idealized as a closed one, meaning that infection
cannot enter C except via a small set of persons—
the index infectives (infectious individuals) of the
population. This network can be heterogeneous—
the persons in C do not necessarily have the same
number of contacts or patterns of connection.
Let C be made up of persons p1, p2,..., pn.
In C , each person pi has a number of contacts,
the set C (pi). For each person pj that is an ele-
ment of C (pi), we can define cpj→pi(t), the rate
of contact from pj to pi at time t. This rate is not
symmetric, so cpj→pi(t) = cpi→pj (t) in general.
The function cpj→pi(t) reflects the number of po-
tentially infection-transmitting contacts from pj
to pi. Depending on the pathogen, a “contact”
can be events like sneezing, needle-sharing, or
sexual contact.
We use the convention that cpj→pi(t) = 0 if
pj is not contacting pi at t. This occurs, for ex-
ample, if pj and pi do not contact each other,
and also when pj is not infected. Per contact, an
uninfected person has a certain probability β of
becoming infected as the result of that contact.
Infectiousness is a boolean state—a person is
either infected or not, and a person cannot be re-
infected when already infected. We refer to the
product of β and the cumulative number of con-
tacts pi experiences per unit time at some time
as the “infection pressure” felt by pi at that time.
After becoming infected, we represent each
person as presenting their infection following
a second-order delay. In this paper, we make
the simplifying assumption that all patients are
treated for their infection upon presentation,
meaning that they will not continue spreading
infection after presentation.
We assume that a patient cannot naturally re-
cover from an infection—recover without health-
care intervention. We also assume herein that
each person can become infected at most once.
In order to create a general inference frame-
work, this model is not specific to a particular
pathogen, but instead can be applied reasonably
well to a range of microparasitic infections. Our
model parameters are thus not necessarily repre-
sentative of a specific disease.
3.2 Inference methods
Let ip and tp be respectively the infection and
presentation times of some arbitrary person p.
Consider
P tp ip C , (1)
the probability density of some presentation and
infection times, given a contact network. For

brevity, we omit some terms of this equation
when discussing it below.
By integrating or summing over some range
of values, and comparing the cumulative proba-
bility of this subset of values to the probability
of the universal set of all permissible values, we
can determine how likely the subset of data is to
occur. As there are many types of data embed-
ded in (1), the above probability density, with
manipulation, provides a rich set of probabilis-
tic information about the sets of infection and
presentation times and C . Here, we focus on de-
termining probabilities related to the infection
time of a certain person p.
P(ip) means “the probability density of per-
son p becoming infected at time ip”. This implies
two things: that person p was not infected before
time ip, and that person p was infected exactly at
time ip. Moreover, P(ip) is a probability density.
To find the probability of person p becoming
infected during the interval [a,b], we integrate,
finding the value of b
a P(ip)dip. Alternately, if
we know that person p must have been infected
somewhere in the interval [c,d], we can use the
definition of conditional probabilities to find that
the probability of ip being in the interval [a,b] is
b
a P(ip)dip
P(U)
=
b
a P(ip)dip
d
c P(ip)dip
,
where U is the universal set.
If we know the presentation or infection time
for a person, we can simply insert that value
into (1). However, often epidemiological data is
more scarce, and many presentation and infec-
tion times are not available. In this case, we can
marginalize the probability through integration.
For example, for some person q, if tq is known
to be in the range [a,b], and iq is known to be
within the range [c,d], the probability that iq < k
for some k, where a < k < b, can be calculated
as
k
a P(iq)diq
b
a P(iq)diq
=
k
a
d
c P(iqtq)diq dtq
b
a
d
c P(iqtq)diq dtq
. (2)
Equation 2 has some important consequences:
the probability that person p is already infected
but has not presented is equivalent to the prob-
ability that ip < T and tp > T (where T is the
current time). Also, the probability that p is un-
infected but will be infected “soon” is equivalent
to the probability that T < tp < T +∆t, where ∆t
quantifies the duration of “soon”.
In our inference model, we additionally allow
for the case where some individuals have been
tested in the past for their “infection status” (in-
fected/uninfected) at that time. We incorporate
this testing data by enforcing additional bounds
on the infection times. For example, if person p
was tested to be uninfected at time x1 and found
to be infected at time x2 (where x2 > x1), we
know that x1 < ip < x2. We can perform similar
bounding with a presentation time: if p has not
yet presented at the present time T, we know that
tp > T. However, the probability density function
itself remains unchanged, as knowledge about in-
fection status does not affect how the infection
behaves.
To calculate the numerical value of (1) and
related equations, we factor the probability into a
product of probabilities, with terms of the general
form P(tp|ip)P ip q∈C (p)
iq . Both of these
two probability terms are expressed in closed
form using typical epidemiological representa-
tions of infection.
3.2.1 Infection time partial ordering
If, say, both iA and iB are unknown, and A and
B are connected, we may not be able to determine
the direction of infection pressure (A → B versus
B → A). To resolve this ambiguity, we marginal-
ize the probability in Equation 1 as follows:
P tp ip C = ∑
d∈D
P tp ip d C .
The directed acyclic graph (dag) d imposes a
topological ordering on C . For each edge, d spec-
ifies which person of the pair was infected first,
thereby also specifying the directionality of infec-
tion pressure. The set D contains all the permis-

sible dags that contain all the vertices of C and
provide an ordering for all edges of C . Dags that
contradict other knowledge about the ordering of
infection times are excluded from D.
When computing probabilities considering
only infection and presentation times, d is a nui-
sance parameter, and we mathematically rewrite
the expressions to eliminate the use of d and D
in the final integral to be evaluated.
3.3 Assessment of inference
Sources in the literature generally assess the
accuracy and performance of their developed in-
ference algorithms by running their inference
models on historical datasets and discussing the
logicality of the results of the inference. While
such demonstrations are valuable in showing the
practicality of inference results, it is difficult to
validate statistical measures of historical data.
To assess the performance of our inference
algorithm, I instead developed and used a sim-
ulation model of infection spread. This model
representing the infection mechanisms described
above on a best-effort basis, simulating infection
spread and testing for all individuals contained
within a computer-generated contact network.
4 Results and discussion
My primary method of assessing the numer-
ical behavior of the inference algorithm was to
plot the probability of person p becoming in-
fected before time x (given the contact network,
some presentation times, and some patient his-
tory as recorded at some time t, where t may be
less than x) as a function of x. An example of
this can be seen in Figure 1. I plotted this figure
by manually splitting the desired integral along
ip into several integrals with mutually exclusive
regions.
Generally, the results produced by the infer-
ence appear reasonable, considering the addi-
tional data produced by the model (data that was
not used in the inference). That is, the time at
which p was infected in the simulation model is
usually close to or at a part of the integral with
a high rate of change. However, the results of
the simulation model are not necessarily highly
probable, and the probability of ip < t is not a
perfect analogue to the probability density that
ip = t, so such a comparison is not definitive.
Making use of data on patients’ infection his-
tory can lead to the probability densities exhibit-
ing subtle behavior. For example, if person X was
tested to be uninfected at time 2.95, inference
will usually suggest that there is a low probabil-
ity that X was infected by time 3.00. However, if
it was not known that X was uninfected at time
2.95, the probability that iX < 3.00 may be much
higher, for there would then be no restriction
that iX ≥ 2.95. Here, knowledge that iX ≥ 2.95
did not change the probability that iX = 3.00
but it did change the probability that iX < 3.00.
This further demonstrates that the probability that
ip < t is not analogous to the probability density
that ip = t.
Monte Carlo numerical integration (MCI)
techniques were used to evaluate the probabil-
ities required for inference. As MCI is a stochas-
tic technique, it is difficult to properly estimate
the technique’s precision without detailed math-
ematical knowledge of the specific integrand.
While error estimators such as the one in Mathe-
matica proved to be inaccurate assessors of the
MCI’s precision, testing smaller, symbolically
integrable functions showed that the used MCI
implementation was usually within an order of
magnitude of the exact integration value. Consid-
ering that different intervals of integration rou-
tinely differ from each other by ratios of 1010
or more, a magnitude of precision is probably
sufficient for most statistical uses.
Increasing the number of individuals in
C leads to higher-dimensional integrals and a
smaller integrand (in terms of absolute value).
For larger graphs, because of the high variance
observed when performing multiple evaluations
of the same integral, the simple MCI techniques
in Mathematica (the techniques that are currently
used) will be inadequate when scaling up the
inference algorithm to large contact networks.

0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6
Probabilitythatpersonhas
alreadybecameinfected
Model time
ip4 ip5 ip6 ip7
Figure 1: A plot of the cumulative probability of each person in a four-person contact network being
already infected, as a function of time.
However, using a well-designed Markov chain
process for point sampling during the integration
would lead to better integrand stability and allow
for inference to be performed on large contact
networks in a reasonable amount of time. Work
is being done towards implementing this feature
into the inference algorithm.
It may be the case that inputting higher-
degree contact networks into the inference model
may lead to tighter inferred distributions, because
larger contact networks generally embed more
information and heterogeneity. The increased
amount of available data means better statistics
can be inferred. However, some reengineering of
the integration mechanism is likely required be-
fore large-scale testing of higher-degree contact
networks can be performed.
5 Conclusions
The inference algorithms I developed demon-
strate that it is possible to infer distributions for
the likelihood of becoming infected at a certain
time from limited epidemiological data (contact
network structure, some presentation times, and
some infection testing history), even if this time
is in the future. However, when using Bayesian
probabilistic techniques, it is important to remem-
ber that they are not omniscient or failproof. The
inference techniques described herein, while po-
tentially very powerful, can be not very informa-
tive or even misleading if used to analyze sig-
niﬁcantly erroneous data or insufﬁcient sets of
data.
Though discussions of the required proce-
dures is beyond the scope of this report, it is
clear that our inference algorithm can be eas-
ily adapted mathematically to represent an even
wider range of scenarios and to infer more types
of data. Possible extensions of this inference
model in the near future include representing
natural recovery, allowing for dynamic (evolv-
ing) contact networks, modeling static sources
of infection, and determining the probability of a
particular directed edge spreading infection.
6 Acknowledgments
I would like to thank Dr. Michael Horsch
and Dr. Nathaniel Osgood of the University of
Saskatchewan for their oversight and their sug-
gestions.

References
[1] T. Britton, T. Kypraios, and P. D. O’Neill. Inference for epidemics with three levels of mixing:
Methodology and application to a measles outbreak. Scandinavian Journal of Statistics,
38(3):578–599, 2011.
[2] T. Britton and P. D. O’Neill. Bayesian inference for stochastic epidemics in populations with
random social structure. Scandinavian Journal of Statistics, 29:375–390, 2002.
[3] K. T. D. Eames and M. J. Keeling. Contact tracing and disease control. Proceedings of the
Royal Society of London B, 270:2565–2571, 2003.
[4] J. A. Fleishman, B. R. Yehia, R. D. Moore, K. A. Gebo, and HIV Research Network. The
economic burden of late entry into medical care for patients with hiv infection. Med Care,
48(12):1071–1079, 2010.
[5] C. Groendyke, D. Welch, and D. R. Hunter. Bayesian inference for contact networks given
epidemic data. Scandinavian Journal of Statistics, 38:600–616, 2011.
[6] M. Hashemian, K. G. Stanley, D. L. Knowles, J. Calver, and N. D. Osgood. Human network
data collection in the wild: The epidemiological utility of micro-contact and location data. In
Proceedings of the ACM SIGHIT International Health Informatics Symposium (IHI 2012),
Miami, FL, January 28–30 2012.
[7] Y. Hosseinkashi. Statistical Inference on Stochastic Graphs. PhD thesis, Department of
Statistics, University of Waterloo, 2011.
[8] H. B. Krentz, M. C. Auld, and M. J. Gill. The high cost of medical care for patients who
present late (CD4 < 200 cells/µl) with HIV infection. HIV Medicine, 5:93–98, 2004.
[9] C. Mulder, C. G. M. Erkens, P. M. Kouw, E. M. Huisman, W. Meijer-Veldman, M. W. Borgdorff,
and F. van Leth. Missed opportunities in tuberculosis control in the netherlands due to
prioritization of contact investigations. European Journal of Public Health (advance access),
2011.
[10] J. Ray and Y. M. Marzouk. A Bayesian method for inferring transmission chains in a partially
observed epidemic. In Proceedings of the Joint Statistical Meetings, Denver, CO, 2010. Sandia
National Laboratories.
[11] J. Read, K. Eames, and W. Edmunds. Dynamic social networks and the implications for the
spread of infectious disease. Journal of the Royal Society Interface, 5:1001–1007, 2008.
[12] M. Salath´e, M. Kazandjieva, J. W. Lee, P. Levis, M. W. Feldman, and J. H. Jones. A high-
resolution human contact network for infectious disease transmission. Proceedings of the
National Academy of Sciences of the USA, 107(51):22020–22025, 2010.
[13] J. Veen. Microepidemics of tuberculosis: the stone-in-the-pond principle. Tubercle and Lung
Disease, 73:73–76, 1992.

ysf_report

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to ysf_report

Similar to ysf_report (20)

ysf_report