Computational_biology_project_report

Simon Fraser University
Project report for CMPT 441/711
Bioinformatics Algorithms
Illuminating the Diversity of
Pathogenic Bacteria, Borrelia
Burgdorferi in Ticks
Authors:
Stanley Gan
Elijah Willie
Arthur Song
Ruochen Jiang
Supervisor:
Dr. Leonid
Chindelevitch
Co-Supervisor:
Katharine Walter
December 8, 2016

Abstract
In our group project, we try to assess the diversity of pathogenic
bacteria, Borrelia Burgdorferi, by understanding the co-infection pat-
terns in our samples. We introduced a combination of probabilistic
approach and optimization technique in attacking this problem. First
of all, we utilized multiple bioinformatics tool kits such as Bowtie,
IGV, Samtools etc, for read mapping and read visualization. Next,
based on the result from read mapping, we calculated proportions
of each variants observed at each loci for each sample using Bayes’
rule. At last, we will introduce a few approaches to answer questions
related to proportions of different co-infecting strain types in each
sample and introduce a minimum number of new strain types into
the existing reference database. The three approaches we attempted
during the course of our project are as follows, mixed integer linear
programming(Mixed ILP), network flow(NF), and Genetic Algorithm
Model(GA). In this report, we only provide the details of the ap-
proaches that we tried as the formulation needs to be reviewed before
implementation in order to get any significant results. However, from
the calculated proportions, we are able to infer that all of our samples
are indeed co-infected by multiple different strain types.
1 Introduction
Borrelia Burgdorferi is the bacterial agent that causes Lyme disease in hu-
mans and is spread by Ixodes ticks.[1] Lyme disease is a vector-borne infection
with numerous vertebrate species capable for transmission and humans are
one of the dead end hosts.[1] There are over 30,000 incidences reported in
the United States and hence it is one of the most common vector-borne dis-
ease spreading geospatially in North America. Due to the wide spreading of
Lyme disease, it is interesting yet challenging to study the strain diversity
of Borrelia. In our project, we would like to address two different biological
questions. Firstly, we would like to understand the number of different strain
types present in each tick sample. Secondly, we would like to compare these
strain types to the existing reference database and suggest a set of new strain
types if necessary. One of the motivation behind these biological questions
is co-infection patterns. By understanding these patterns, we can investigate
the complexity of heterogeneous bacterial population. One of the forces that
drive bacterial diversity is genetic recombination, which can only happen in
1

samples co-infected by at least two different bacterial strain types.[1] In our
context, genetic recombination will mean the exchange of genetic materi-
als between multiple chromosomes of different bacterial strains.[7] Moreover,
different bacterial genotypes have different transmission rates among host
and/or vectors, which can produce distinct transmission cycles and lead to
discrepancy in disease risks contributed by different host populations.[8] Fur-
thermore, different bacterial strain types are likely to have different possible
harms towards human. For example in [9], the paper describes arthritis as a
symptom that often accompanies B. burgdorferi s.s. infection. On the other
hand, neurological symptoms are associated with B. garinii, and skin disor-
ders with B. afzelii. Therefore, understanding co-infection patterns which in
turn illuminates bacterial strain diversity has a certain degree of significance
in developing more reliable prevention and control protocol.
2 Previous work
Some related biological aspects in our project were studied by Katharine Wal-
ter group[1] previously. In their research, they pointed out that Within-host
pathogen diversity may have important implications for human health and
disease epidemiology because hosts are frequently co-infected with multiple
pathogen species. They chose Borrelia burgdorferi as the model to study
within-host processes. In their experiment, they examined genomic variation
of Borrelia burgdorferi in 98 individual field-collected tick vectors. And the
experiment shows that 70% of ticks are infected with multiple strains. Their
work gives an idea that disease vectors like tick can be studied as epidemio-
logical sentinels.
The method used to capture the genome dataset we used in our project
was studied by Giovanna Carpi group[2] in 2015. In their work they used cus-
tom probes for multiplexed hybrid capture to sequence 30 Borrelia burgdor-
feri genomes and found that it nearly sequenced the complete( 99.5 %)
genome of Borrelia burgdorferi.
In addtion to this, Wibke Cramaro[3] group did a research on lxodes
ricinus, which is the most common tick species and most important vector
of human and animal pathogens in western Europe in 2013. They sequenced
2

lxodes ricinus’ genome and all sequence data was made available to public
database.
3 Data Description
Before further discussion, the data that we are working on is the whole
genome sequence of 30 tick samples sequenced using hybrid capture meth-
ods and Illumina short read technology(paired end 75bp)[2]. Also, we have
a reference database of 679 strain types based on multi-locus sequence typ-
ing(MLST) of 8 housekeeping genes. In this context, MLST is a molecular
typing method[11] in which the following 8 housekeeping genes are sequenced:
clpA, clpX, nifS, pepX, pyrG, recG, rpiB and uvrA. Furthermore, a unique
Borrelia bacterial strain type is deﬁned by these 8 housekeeping genes(please
refer to [10] for deﬁnition based on ospC gene).
4 Problem Illustration and Description
Figure 1: Illustration of the problem
We are given the data about the genotypes observed at all loci for N sam-
ples and their respective proportions. For example in Figure 1, we observed
genotype of type A and B with proportion 0.75 and 0.25 respectively at locus
1 for sample 1. Given the genotypes observed at all loci for N samples, we
3

produce different strains based on combinations of genotypes at all loci while
preserving proportions. For example in Figure 1, we have shown 2 different
examples which are J={0.25BXU, 0.25AXV, 0.5AYW} and K={0.25BYU,
0.25AYV, 0.5AXW}. In J, proportion of B is 0.25(from BXU), proportion
of A is 0.25(from AXV) + 0.5(AYW)=0.75, ... and so on. The proportions
are preserved. We restate the problem in a mathematical perspective:
Given a library of known strain types (679 types), we are trying to explain
what we see(genotypes) using as few new strain types as possible. From a
mathematical perspective, we want to minimize the number of new strains
introduced to our existing library.
For example in a simple case, if our library contains {BXU, AXV, ... },
we will choose J instead of K as we have to introduce at least 2 new strain
types if we choose K. Definitely, there are other criteria to consider such as
the proportions.
5 Methods
Before dealing with any optimization, we have to calculate the proportions
as illustrated in Figure 1.
5.1 Calculating Proportions
The first step was to extract the sequences for each of the 8 house keeping
genes and their individual variants. This information was also provided,
which made advancing to the next step much easier. After obtaining the
sequences for the housekeeping genes and their variants, the next step was
to compute the proportions of reads from the thirty bacteria in the ticks
for which their whole genome was captured that maps to each variant for
all the housekeeping genes. This is accomplished by first using Bowtie[4]
to map all the reads from each of the thirty samples to each of the eight
housekeeping genes. An interesting observation is to note that about less
than one percent of each sample will map to a gene. This is to be expected
as the whole genome was sequenced, and we are only interested in eight of
the total gene populus. Next Integrative Genomics Viewer (IGV) [5], and
Samtools ”tview” [6] was used for visualizing reads that mapped to each of
the variants for each of gene. Visualization was necessary because it enabled
4

us to check if there were any reads that had only a portion that mapped to
a variant. However, over 99% of reads that mapped to a variant mapped to
the inside of a variant. For each gene, to compute the proportions of reads
that maps to a variant. We are interested in computing
P(vari | readj) (1)
That is we are interested in computing the probability of a variant given a
read. This is very diﬃcult to do, as we do not have any prior information
about the distribution of variants for a gene. However, Bayes’ rule states
that
P(vari | readj) =
P(readj | vari) P(vari)
P(readj)
(2)
Thus we do not need to directly compute equation (1) we can compute
P(readj | vari) (3)
and multiply it by a proportionality constant to get
P(vari | readj) = P(readj | vari) kj (4)
where
kj =
P(vari)
P(readj)
(5)
By summing over all variants and equating to 1, we can solve for (5) i.e.
i
kj P(readj | vari) = 1 (6)
To compute (3) we can appeal to the binomial distribution since it is given
that 1
100
of mismatches within a mapping is due to sequencing errors. Thus
we can use the number of mismatches which bowtie reports for a mapped
read and the Binomial Distribution to compute a probability distribution
over (3). Thus we have that
P(readj | vari) =
mj
lj
(
1
100
)lj
(
99
100
)mj−lj
(7)
where mj is the length of the readj, and lj is number of mismatches for the
mapping between readj and varianti. Plugging (7) into (1), we get that
j
kj
m
lj
(
1
100
)l
(
99
100
)m−lj
= 1 (8)
5

Thus we are now able to compute kj for all readj. Now we are fully equipped
to compute the proportions for each variant in a gene. The proportion of a
varianti for a gene G will be
j P(vari | readj)
h i P(varh | readj)
(9)
We just sum up over all reads that maps to a particular variant for a gene
G and divide by the sum over all variants for that gene and multiply by 100
to get proportions in percentages.
5.2 Optimization
For the optimization part, we have 3 approaches that we tried during the
course of our project.
5.2.1 Mixed Integer Linear Programming
The idea in this program is to formulate the problem rigorously and minimize
the number of new strains, the proportions of the new strains using 0/1
weights indicator. Besides, this program also captures the possible errors
between the true proportion of a variant and its observed proportion, in
which these errors may happen due to sampling in the lab. These errors will
also be included into the objective function.
Known Parameters
• Number of loci: 8
• Number of samples: 30
• Set of diﬀerent genotypes observed on sample i at locus j, Gij = {g
(1)
ij ,
g
(2)
ij , ...}.
• Pij = {p
(1)
ij , p
(2)
ij , ...} where p
(k)
ij corresponds to the proportion of geno-
type g
(k)
ij in Gij. (Note:
|Pij|
k=1
p
(k)
ij = 1, ∀1≤i≤30, ∀1≤j≤8)
• Reference = Ω where |Ω|=679
6

• Set of possible different strains for sample i (different combinations
of the genotypes we observed at all loci), Vi = {V
(1)
i , V
(2)
i , ...,V
(Hi)
i },
where Hi = L
j=1 |Gij|
• A representation of the strain type, V
(k)
i =











a
(k)
i1,1 a
(k)
i2,1 . . . a
(k)
iL,1
a
(k)
i1,2 . . . . . . a
(k)
iL,2
...
...
...
a
(k)
i1,|Ni1|
...
...
...
...
a
(k)
i1,Ri
a
(k)
i2,Ri
. . . a
(k)
iL,Ri











,
i-th sample k-th combination, ∀1 ≤ i ≤ 30. a
(k)
ij = {0,1}, Ri =
maxj |Gij|. For those j such that |Gij| < Ri, a
(k)
ij = 0 for k = |Gi(j+1)|, .., Ri
• For the example in the problem description, if V1 = {V
(1)
1 = BXU, V
(2)
1
= AXV, V
(3)
1 = AYW }, the matrix representation is as follows: V
(1)
1
=


0 1 1
1 0 0
0 0 0

, V
(2)
1 =


1 1 0
0 0 1
0 0 0

, V
(3)
1 =


1 0 0
0 1 0
0 0 1


• Weight for each strain type V
(k)
i , w
(k)
i where w
(k)
i =1 iff V
(k)
i is a new
strain type, otherwise w
(k)
i =0
• Weight for the proportion of each strain type V
(k)
i , c
(k)
i where c
(k)
i =1 iff
V
(k)
i is a new strain type, otherwise c
(k)
i =0
Decision Variables
• Indicator variable a
(k)
i where a
(k)
i =1 iff V
(k)
i is chosen to explain the
samples, otherwise a
(k)
i =0
• Proportion of the strain type V
(k)
i , π
(k)
i
• Eij = {e
(1)
ij , e
(2)
ij , ...} where e
(k)
ij corresponds to the error of the observed
proportion p
(k)
ij of g
(k)
ij from its true proportion.
7

• For convenience, let Φi =











p
(1)
i1 p
(1)
i2 . . . p
(1)
iL
p
(2)
i1 . . . . . . p
(2)
iL
...
...
...
p
(|Ni1|)
i1
...
...
...
...
p
(Ri)
i1 p
(Ri)
i2 . . . p
(Ri)
iL











, ∀1 ≤ i ≤
30. For those j such that |Gij| < Ri, p
(k)
ij = 0 for k = |Gi(j+1)|, .., Ri
Constraints
• p
(k)
ij ∈ [0, 1], e
(k)
ij ∈ [−p
(k)
ij , 1 − p
(k)
ij ] ∀i, j, k and
|Pij|
k=1
(p
(k)
ij + e
(k)
ij ) = 1,
∀1 ≤ i ≤ 30, ∀1 ≤ j ≤ 8
• a
(k)
i ∈ {0, 1} ∀i, k
• π
(k)
i ∈ [0, 1] and
Hi
k=1
π
(k)
i = 1 for all i
• e
(k)
ij ≤ T
(k)
i − p
(k)
ij and e
(k)
ij ≤ p
(k)
ij − T
(k)
i where T
(k)
i =
(i,k):g
(k)
ij ∈V
(k)
i
π
(k)
i
• For all sample i where 1 ≤ i ≤ 30,
Hi
k=1
π
(k)
i · V
(k)
i = Φi
Objective Function
min
i,k
(w
(k)
i · a
(k)
i + c
(k)
i · π
(k)
i ) +
i,j,k
e
(k)
ij
8

5.2.2 Network Flow
In the NF model, we try to tackle the problem using a 2 step approach.
The first step would be to maximize the proportions of existing strain types,
which is analogous with finding a maximum flow in the network. The second
step is to explain the remaining proportions using a minimal number of new
strains. Our group tried to formulate the first step and implemented it. How-
ever, there are technicalities that have to be considered before implementing
it. In this report, we will introduce the idea that we tried and the reason it
does not work.
We create a network for each sample and we will use a simplified example
to illustrate it.
Figure 2: Simplified example for a sample
Figure 3: Network for simplified example
Consider a simplified example shown in figure 2. In this example we have
1 sample and 3 loci. Besides, assume that we have 4 reference strain types:
9

ACE, ADF, ACF and BCE. The proportion of each variant is shown in the
table. We build a network as shown in figure 3. In this network, we construct
several independent layers: Source layer, Locus layer, Reference layer, Merge
layer and Sink Layer. The order of the flow path will suggest which variant
to choose to explain the sample.
Construction: (1) Build layer: We construct the whole network layer
by layer. First we have Source Layer, which only contains a source node
required for a network flow structure. Next we build Locus Layer. Locus
Layer contains 3 sub-layer representing each independent locus. In each sub-
layer, it contains several nodes which represent the observed variants. Then
we have Reference layer, it contains all reference strain types that contain any
of the observed variants. For example, if XYZ is also a reference strain type
but we did not observe any variants of type X, Y, Z in the 3 loci respectively,
we do not include it in the network construction for this sample. Finally, we
have the Merge Layer which contains a merge node, and a Sink Layer that
has a sink node required for a network flow structure. (2) Edges: Connect
the source node to all nodes in the first Locus layer. As for the edges between
the sub layers in the Locus layer, they are dependent on the reference strain
types. For example, we have ACE as a reference strain type, hence we will
connect A → C →E. For the last layer in Locus layer, we just connect it to
all relevant reference nodes (the last variant in reference is the same as the
node in last Locus layer). Finally we connect all reference node with Merge
layer, and connect Merge layer to Sink layer (3) Assign capacity: For each
edge pointing to a node in Locus layer, the capacity is same as the proportion
in the table. For the edge pointing to reference nodes, Merge layer and Sink
layer, there is no limitation, we can just assign a maximum capacity to them.
Figure 4: A flaw in simplified example
10

Problem: The path of maximum flow flowing through a particular ref-
erence type represents the maximum proportion of that particular reference
type in which we are able to use, without violating any constraints. This
model is intuitive but it does not preserve the sequential relationship of the
variants in the reference types. The problem is illustrated as follows. As we
can see in figure 4, the node BCF is not a reference strain type and hence it
is not in the Reference layer. Therefore, there is a possibility that the max-
imum flow path goes through node B, node C and node F in Locus layer.
One possible solution to preserve the sequential relationship of the variants
in the reference type is the following:
1. Require each node in the Locus Layer has in-degree=1
2. If in-degree=k > 1, create k copies of the node
3. Given a sequential relationship of a known strain type, for example
uvw, u → v → w, if in-degree(w)=1, connect v to an unmatched copy
of w
5.2.3 Genetic Algorithm model
Since it may be difficult to find out a correct objective function, we also
attempted to design a genetic algorithm (GA) model to solve the problem
(we want to use the reference in library as much as possible to explain the
sample) in the network-flow section. GA is an algorithm that involves the
use of stochastic mutation, which helps to simulate the evolution process of
nature to help us explain the variations observed.
A common GA model usually has several parts, the initial population,
the environment, and the mutation function. In GA algorithm, we try to
evolve the initial population to our final population which fits the enviorn-
ment best. In our problem, we need another parameter to help make the
computation faster. This parameter is the called torch function. In this
problem, the population refers to all variables that needs to computed. For
example, if we have three references, BXU, AXV and AYW, we have three
variables prr1 (BXU ), prr2 (AXV ), prr3 (AYW ) to compute.
11

Initial population: Simply let prri
be the minimum proportion along its
path, because it is the maximum proportion that each reference node could
have.
Environment: The environment is a function to judge if an individual
in the population is suitable to survive. In this case, we testify whether
these variables (prri
) fit in our constraints. For example, we require that:
(1) if a reference contains a variant, then all
reference node i contains variant j
prri
≤
Prvariantj (2) prri
≤ the minimum proportion of variant that this reference
contains. We could say the environment possibly selected a good result for
us; Or, if an individual suits these requirements, then we can infer that this
individual possibly yields a good result which we expected. Hence, we want
to make a slight perturbation on this result and observe if we could get a
better result, which could serve as a reason for the survival of this individual.
Mutation: Mutation function tells us how much modification can be
made from previous result to a new one. For example, we set a mutation
rate to be less than or equal to 3
26
. We consider the mutation process of
character ’D’ and we apply this mutation on this alphabet, then ’D’ will
mutate from ’A’ to ’G’ in next iteration. Usually, for a simple version of
mutation, we could have a random mutation and control its mutation rate.
This mutation rate is similar to the learning rate in SGD (stochastic gradient
descent) method.
Torch: For the torch part, we record previous results and only keep in-
dividuals which are better than previous results. This serves as a guidance
to help the population evolve to a particular target.
At last, we combine all these components and let initial population live in
our environment. After running sufficient amount of iterations, we could get
a good enough result. However, we may not get the optimal result because
we do not know the actual objective function F(x) and there might numerous
x that satisfy F (x ) = 0.
12

6 Results
As the optimization problem is still in formulation, we can only present some
results for the proportions.
Figure 5: A part of the results from computing proportions for genes clpX,
clpA and nifS. Note the proportions are presented as percentages
Figure 5 shows partial results after computing the proportions of three
genes (clpX, clpA, and nifS). From these tables, we can see that the pro-
portion percentages are all relatively small. This indicates that there is no
sample with a variant that has an extensively greater proportion.We can
thus infer that each sample is infected with a substantial amount of diﬀerent
strains.
7 Discussion and Future Work
As most of our group mates are unfamiliar with bioinformatics and the bio-
logical aspects of this problem, during the course of this project, we learned
substantial knowledge about the biological importance of this project and
the terminologies used in bioinformatics, especially when we were using the
13

tool kits. Moreover, we experienced implementing a wide range of knowledge
in probability and mathematics, ranging from simple yet powerful theorem
such as Bayes’ rule to sophisticated algorithmic techniques such as network
flows and Mixed ILP.
We had a few challenges during the course of our project. One of them
is to understand a wide range of biological terminologies, which appeared
to be a steep learning curve in the beginning of our project. Furthermore,
there were lack of resources available with regards to the optimization part of
this project, based on our research. Therefore, we are required to formulate
new idea in tackling the optimization problem and we encountered numerous
failures while producing a good formulation to represent the problem. This
part of the project takes a substantial amount of time. Nevertheless, we are
satisfied and delighted to have these obstacles, as these challenges provide us
the chance to study this problem in great depth, and help us in formulating
better representations of the problem.
Once there is a clear and robust formulation of the optimization prob-
lem, we will have better insights on co-infection patterns among our tick
samples, which subsequently contribute to more effective ecological control
of the spreading of the disease. Definitely, we can try to tweak the Mixed
ILP model by trying complex weight function rather than 0/1, which might
encapsulate the biological meaning better. In fact, this approach can also be
applied to other pathogenic bacteria apart from Borrelia Burgdorferi. Hence,
the approaches that we explained in the report might serve as a general plat-
form for computational biologist or life scientist in their area of interest.
8 Supplementary Materials
In our first step, we used a python script to map reads and compute the
proportions of reads that map to a variant. The python script is avail-
able at http://tiny.cc/wexkhy. We got a large dataset of results from
computing proportions. Figure 5 in Results part shows a small part of our
computed results and all the computed results data is available at http:
//tiny.cc/ezjkhy.
We also used IGV and Samtools tview to visualize mapped reads. Figure
14

6 and Figure 7 show parts of our results.
Figure 6: A part of the visualized results by using IGV. Arrows show the
reads orientation, and colours shows the read pairing.
Figure 7: A part of the visualized results by using Samtools. Top strand
shows the reference sequence, ’.’ and ’,’ shows the paired sequences, and
mismaches are shown as single characters.
15

References
[1] Walter KS, Carpi G, Evans BR, Caccone A, Diuk-Wasser MA (2016)
Vectors as Epidemiological Sentinels: Patterns of Within-
Tick Borrelia burgdorferi Diversity. PLoS Pathog 12(7): e1005759.
doi:10.1371/journal.ppat.1005759
[2] Carpi G, Walter KSK, Bent SJS, Hoen AGA, Diuk-Wasser M, et al.
(2015) Whole genome capture of vector-borne pathogens from
mixed DNA samples: a case study of Borrelia burgdorferi.
[3] Wibke Jochum, Anna L. Reye and Claude P. Muller (2013) Whole
genome sequencing of Ixodes ricinus, the European Lyme dis-
ease vector.
[4] Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and
memory-efficient alignment of short DNA sequences to the hu-
man genome. Genome Biol 10:R25.
[5] James T. Robinson, Helga Thorvaldsdóttir, Wendy Winckler, Mitchell
Guttman, Eric S. Lander, Gad Getz, Jill P. Mesirov. (2011) Integrative
Genomics Viewer. Nature Biotechnology 29, 24–26
[6] Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N.,
Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
Processing Subgroup (2009) The Sequence alignment/map (SAM)
format and SAMtools. Bioinformatics, 25, 2078-9.
[7] Clancy, S., (2008). Genetic recombination. Nature education, 1(1),
p.40.
[8] Jacquot, M., Abrial, D., Gasqui, P., Bord, S., Marsot, M., Masseglia,
S., Pion, A., Poux, V., Zilliox, L., Chapuis, J.L. and Vourc’h, G.,
(2016). Multiple independent transmission cycles of a tick-borne
pathogen within a local host community. Scientific Reports, 6.
[9] Tilly, K., Rosa, P.A. and Stewart, P.E., (2008). Biology of infection
with Borrelia burgdorferi. Infectious disease clinics of North Amer-
ica, 22(2), pp.217-234.
16

[10] Barbour, A.G. and Travinsky, B., (2010). Evolution and distribu-
tion of the ospC gene, a transferable serotype determinant of
Borrelia burgdorferi. MBio, 1(4), pp.e00153-10.
[11] Maiden, M.C., Bygraves, J.A., Feil, E., Morelli, G., Russell, J.E., Ur-
win, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D.A. and Feavers,
I.M., (1998). Multilocus sequence typing: a portable approach
to the identiﬁcation of clones within populations of pathogenic
microorganisms. Proceedings of the National Academy of Sciences,
95(6), pp.3140-3145.
17

Computational_biology_project_report

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Computational_biology_project_report

Similar to Computational_biology_project_report (20)

More from Elijah Willie

More from Elijah Willie (7)

Computational_biology_project_report