SlideShare a Scribd company logo
Simon Fraser University
Project report for CMPT 441/711
Bioinformatics Algorithms
Illuminating the Diversity of
Pathogenic Bacteria, Borrelia
Burgdorferi in Ticks
Authors:
Stanley Gan
Elijah Willie
Arthur Song
Ruochen Jiang
Supervisor:
Dr. Leonid
Chindelevitch
Co-Supervisor:
Katharine Walter
December 8, 2016
Abstract
In our group project, we try to assess the diversity of pathogenic
bacteria, Borrelia Burgdorferi, by understanding the co-infection pat-
terns in our samples. We introduced a combination of probabilistic
approach and optimization technique in attacking this problem. First
of all, we utilized multiple bioinformatics tool kits such as Bowtie,
IGV, Samtools etc, for read mapping and read visualization. Next,
based on the result from read mapping, we calculated proportions
of each variants observed at each loci for each sample using Bayes’
rule. At last, we will introduce a few approaches to answer questions
related to proportions of different co-infecting strain types in each
sample and introduce a minimum number of new strain types into
the existing reference database. The three approaches we attempted
during the course of our project are as follows, mixed integer linear
programming(Mixed ILP), network flow(NF), and Genetic Algorithm
Model(GA). In this report, we only provide the details of the ap-
proaches that we tried as the formulation needs to be reviewed before
implementation in order to get any significant results. However, from
the calculated proportions, we are able to infer that all of our samples
are indeed co-infected by multiple different strain types.
1 Introduction
Borrelia Burgdorferi is the bacterial agent that causes Lyme disease in hu-
mans and is spread by Ixodes ticks.[1] Lyme disease is a vector-borne infection
with numerous vertebrate species capable for transmission and humans are
one of the dead end hosts.[1] There are over 30,000 incidences reported in
the United States and hence it is one of the most common vector-borne dis-
ease spreading geospatially in North America. Due to the wide spreading of
Lyme disease, it is interesting yet challenging to study the strain diversity
of Borrelia. In our project, we would like to address two different biological
questions. Firstly, we would like to understand the number of different strain
types present in each tick sample. Secondly, we would like to compare these
strain types to the existing reference database and suggest a set of new strain
types if necessary. One of the motivation behind these biological questions
is co-infection patterns. By understanding these patterns, we can investigate
the complexity of heterogeneous bacterial population. One of the forces that
drive bacterial diversity is genetic recombination, which can only happen in
1
samples co-infected by at least two different bacterial strain types.[1] In our
context, genetic recombination will mean the exchange of genetic materi-
als between multiple chromosomes of different bacterial strains.[7] Moreover,
different bacterial genotypes have different transmission rates among host
and/or vectors, which can produce distinct transmission cycles and lead to
discrepancy in disease risks contributed by different host populations.[8] Fur-
thermore, different bacterial strain types are likely to have different possible
harms towards human. For example in [9], the paper describes arthritis as a
symptom that often accompanies B. burgdorferi s.s. infection. On the other
hand, neurological symptoms are associated with B. garinii, and skin disor-
ders with B. afzelii. Therefore, understanding co-infection patterns which in
turn illuminates bacterial strain diversity has a certain degree of significance
in developing more reliable prevention and control protocol.
2 Previous work
Some related biological aspects in our project were studied by Katharine Wal-
ter group[1] previously. In their research, they pointed out that Within-host
pathogen diversity may have important implications for human health and
disease epidemiology because hosts are frequently co-infected with multiple
pathogen species. They chose Borrelia burgdorferi as the model to study
within-host processes. In their experiment, they examined genomic variation
of Borrelia burgdorferi in 98 individual field-collected tick vectors. And the
experiment shows that 70% of ticks are infected with multiple strains. Their
work gives an idea that disease vectors like tick can be studied as epidemio-
logical sentinels.
The method used to capture the genome dataset we used in our project
was studied by Giovanna Carpi group[2] in 2015. In their work they used cus-
tom probes for multiplexed hybrid capture to sequence 30 Borrelia burgdor-
feri genomes and found that it nearly sequenced the complete( 99.5 %)
genome of Borrelia burgdorferi.
In addtion to this, Wibke Cramaro[3] group did a research on lxodes
ricinus, which is the most common tick species and most important vector
of human and animal pathogens in western Europe in 2013. They sequenced
2
lxodes ricinus’ genome and all sequence data was made available to public
database.
3 Data Description
Before further discussion, the data that we are working on is the whole
genome sequence of 30 tick samples sequenced using hybrid capture meth-
ods and Illumina short read technology(paired end 75bp)[2]. Also, we have
a reference database of 679 strain types based on multi-locus sequence typ-
ing(MLST) of 8 housekeeping genes. In this context, MLST is a molecular
typing method[11] in which the following 8 housekeeping genes are sequenced:
clpA, clpX, nifS, pepX, pyrG, recG, rpiB and uvrA. Furthermore, a unique
Borrelia bacterial strain type is defined by these 8 housekeeping genes(please
refer to [10] for definition based on ospC gene).
4 Problem Illustration and Description
Figure 1: Illustration of the problem
We are given the data about the genotypes observed at all loci for N sam-
ples and their respective proportions. For example in Figure 1, we observed
genotype of type A and B with proportion 0.75 and 0.25 respectively at locus
1 for sample 1. Given the genotypes observed at all loci for N samples, we
3
produce different strains based on combinations of genotypes at all loci while
preserving proportions. For example in Figure 1, we have shown 2 different
examples which are J={0.25BXU, 0.25AXV, 0.5AYW} and K={0.25BYU,
0.25AYV, 0.5AXW}. In J, proportion of B is 0.25(from BXU), proportion
of A is 0.25(from AXV) + 0.5(AYW)=0.75, ... and so on. The proportions
are preserved. We restate the problem in a mathematical perspective:
Given a library of known strain types (679 types), we are trying to explain
what we see(genotypes) using as few new strain types as possible. From a
mathematical perspective, we want to minimize the number of new strains
introduced to our existing library.
For example in a simple case, if our library contains {BXU, AXV, ... },
we will choose J instead of K as we have to introduce at least 2 new strain
types if we choose K. Definitely, there are other criteria to consider such as
the proportions.
5 Methods
Before dealing with any optimization, we have to calculate the proportions
as illustrated in Figure 1.
5.1 Calculating Proportions
The first step was to extract the sequences for each of the 8 house keeping
genes and their individual variants. This information was also provided,
which made advancing to the next step much easier. After obtaining the
sequences for the housekeeping genes and their variants, the next step was
to compute the proportions of reads from the thirty bacteria in the ticks
for which their whole genome was captured that maps to each variant for
all the housekeeping genes. This is accomplished by first using Bowtie[4]
to map all the reads from each of the thirty samples to each of the eight
housekeeping genes. An interesting observation is to note that about less
than one percent of each sample will map to a gene. This is to be expected
as the whole genome was sequenced, and we are only interested in eight of
the total gene populus. Next Integrative Genomics Viewer (IGV) [5], and
Samtools ”tview” [6] was used for visualizing reads that mapped to each of
the variants for each of gene. Visualization was necessary because it enabled
4
us to check if there were any reads that had only a portion that mapped to
a variant. However, over 99% of reads that mapped to a variant mapped to
the inside of a variant. For each gene, to compute the proportions of reads
that maps to a variant. We are interested in computing
P(vari | readj) (1)
That is we are interested in computing the probability of a variant given a
read. This is very difficult to do, as we do not have any prior information
about the distribution of variants for a gene. However, Bayes’ rule states
that
P(vari | readj) =
P(readj | vari) P(vari)
P(readj)
(2)
Thus we do not need to directly compute equation (1) we can compute
P(readj | vari) (3)
and multiply it by a proportionality constant to get
P(vari | readj) = P(readj | vari) kj (4)
where
kj =
P(vari)
P(readj)
(5)
By summing over all variants and equating to 1, we can solve for (5) i.e.
i
kj P(readj | vari) = 1 (6)
To compute (3) we can appeal to the binomial distribution since it is given
that 1
100
of mismatches within a mapping is due to sequencing errors. Thus
we can use the number of mismatches which bowtie reports for a mapped
read and the Binomial Distribution to compute a probability distribution
over (3). Thus we have that
P(readj | vari) =
mj
lj
(
1
100
)lj
(
99
100
)mj−lj
(7)
where mj is the length of the readj, and lj is number of mismatches for the
mapping between readj and varianti. Plugging (7) into (1), we get that
j
kj
m
lj
(
1
100
)l
(
99
100
)m−lj
= 1 (8)
5
Thus we are now able to compute kj for all readj. Now we are fully equipped
to compute the proportions for each variant in a gene. The proportion of a
varianti for a gene G will be
j P(vari | readj)
h i P(varh | readj)
(9)
We just sum up over all reads that maps to a particular variant for a gene
G and divide by the sum over all variants for that gene and multiply by 100
to get proportions in percentages.
5.2 Optimization
For the optimization part, we have 3 approaches that we tried during the
course of our project.
5.2.1 Mixed Integer Linear Programming
The idea in this program is to formulate the problem rigorously and minimize
the number of new strains, the proportions of the new strains using 0/1
weights indicator. Besides, this program also captures the possible errors
between the true proportion of a variant and its observed proportion, in
which these errors may happen due to sampling in the lab. These errors will
also be included into the objective function.
Known Parameters
• Number of loci: 8
• Number of samples: 30
• Set of different genotypes observed on sample i at locus j, Gij = {g
(1)
ij ,
g
(2)
ij , ...}.
• Pij = {p
(1)
ij , p
(2)
ij , ...} where p
(k)
ij corresponds to the proportion of geno-
type g
(k)
ij in Gij. (Note:
|Pij|
k=1
p
(k)
ij = 1, ∀1≤i≤30, ∀1≤j≤8)
• Reference = Ω where |Ω|=679
6
• Set of possible different strains for sample i (different combinations
of the genotypes we observed at all loci), Vi = {V
(1)
i , V
(2)
i , ...,V
(Hi)
i },
where Hi = L
j=1 |Gij|
• A representation of the strain type, V
(k)
i =











a
(k)
i1,1 a
(k)
i2,1 . . . a
(k)
iL,1
a
(k)
i1,2 . . . . . . a
(k)
iL,2
...
...
...
a
(k)
i1,|Ni1|
...
...
...
...
a
(k)
i1,Ri
a
(k)
i2,Ri
. . . a
(k)
iL,Ri











,
i-th sample k-th combination, ∀1 ≤ i ≤ 30. a
(k)
ij = {0,1}, Ri =
maxj |Gij|. For those j such that |Gij| < Ri, a
(k)
ij = 0 for k = |Gi(j+1)|, .., Ri
• For the example in the problem description, if V1 = {V
(1)
1 = BXU, V
(2)
1
= AXV, V
(3)
1 = AYW }, the matrix representation is as follows: V
(1)
1
=


0 1 1
1 0 0
0 0 0

, V
(2)
1 =


1 1 0
0 0 1
0 0 0

, V
(3)
1 =


1 0 0
0 1 0
0 0 1


• Weight for each strain type V
(k)
i , w
(k)
i where w
(k)
i =1 iff V
(k)
i is a new
strain type, otherwise w
(k)
i =0
• Weight for the proportion of each strain type V
(k)
i , c
(k)
i where c
(k)
i =1 iff
V
(k)
i is a new strain type, otherwise c
(k)
i =0
Decision Variables
• Indicator variable a
(k)
i where a
(k)
i =1 iff V
(k)
i is chosen to explain the
samples, otherwise a
(k)
i =0
• Proportion of the strain type V
(k)
i , π
(k)
i
• Eij = {e
(1)
ij , e
(2)
ij , ...} where e
(k)
ij corresponds to the error of the observed
proportion p
(k)
ij of g
(k)
ij from its true proportion.
7
• For convenience, let Φi =











p
(1)
i1 p
(1)
i2 . . . p
(1)
iL
p
(2)
i1 . . . . . . p
(2)
iL
...
...
...
p
(|Ni1|)
i1
...
...
...
...
p
(Ri)
i1 p
(Ri)
i2 . . . p
(Ri)
iL











, ∀1 ≤ i ≤
30. For those j such that |Gij| < Ri, p
(k)
ij = 0 for k = |Gi(j+1)|, .., Ri
Constraints
• p
(k)
ij ∈ [0, 1], e
(k)
ij ∈ [−p
(k)
ij , 1 − p
(k)
ij ] ∀i, j, k and
|Pij|
k=1
(p
(k)
ij + e
(k)
ij ) = 1,
∀1 ≤ i ≤ 30, ∀1 ≤ j ≤ 8
• a
(k)
i ∈ {0, 1} ∀i, k
• π
(k)
i ∈ [0, 1] and
Hi
k=1
π
(k)
i = 1 for all i
• e
(k)
ij ≤ T
(k)
i − p
(k)
ij and e
(k)
ij ≤ p
(k)
ij − T
(k)
i where T
(k)
i =
(i,k):g
(k)
ij ∈V
(k)
i
π
(k)
i
• For all sample i where 1 ≤ i ≤ 30,
Hi
k=1
π
(k)
i · V
(k)
i = Φi
Objective Function
min
i,k
(w
(k)
i · a
(k)
i + c
(k)
i · π
(k)
i ) +
i,j,k
e
(k)
ij
8
5.2.2 Network Flow
In the NF model, we try to tackle the problem using a 2 step approach.
The first step would be to maximize the proportions of existing strain types,
which is analogous with finding a maximum flow in the network. The second
step is to explain the remaining proportions using a minimal number of new
strains. Our group tried to formulate the first step and implemented it. How-
ever, there are technicalities that have to be considered before implementing
it. In this report, we will introduce the idea that we tried and the reason it
does not work.
We create a network for each sample and we will use a simplified example
to illustrate it.
Figure 2: Simplified example for a sample
Figure 3: Network for simplified example
Consider a simplified example shown in figure 2. In this example we have
1 sample and 3 loci. Besides, assume that we have 4 reference strain types:
9
ACE, ADF, ACF and BCE. The proportion of each variant is shown in the
table. We build a network as shown in figure 3. In this network, we construct
several independent layers: Source layer, Locus layer, Reference layer, Merge
layer and Sink Layer. The order of the flow path will suggest which variant
to choose to explain the sample.
Construction: (1) Build layer: We construct the whole network layer
by layer. First we have Source Layer, which only contains a source node
required for a network flow structure. Next we build Locus Layer. Locus
Layer contains 3 sub-layer representing each independent locus. In each sub-
layer, it contains several nodes which represent the observed variants. Then
we have Reference layer, it contains all reference strain types that contain any
of the observed variants. For example, if XYZ is also a reference strain type
but we did not observe any variants of type X, Y, Z in the 3 loci respectively,
we do not include it in the network construction for this sample. Finally, we
have the Merge Layer which contains a merge node, and a Sink Layer that
has a sink node required for a network flow structure. (2) Edges: Connect
the source node to all nodes in the first Locus layer. As for the edges between
the sub layers in the Locus layer, they are dependent on the reference strain
types. For example, we have ACE as a reference strain type, hence we will
connect A → C →E. For the last layer in Locus layer, we just connect it to
all relevant reference nodes (the last variant in reference is the same as the
node in last Locus layer). Finally we connect all reference node with Merge
layer, and connect Merge layer to Sink layer (3) Assign capacity: For each
edge pointing to a node in Locus layer, the capacity is same as the proportion
in the table. For the edge pointing to reference nodes, Merge layer and Sink
layer, there is no limitation, we can just assign a maximum capacity to them.
Figure 4: A flaw in simplified example
10
Problem: The path of maximum flow flowing through a particular ref-
erence type represents the maximum proportion of that particular reference
type in which we are able to use, without violating any constraints. This
model is intuitive but it does not preserve the sequential relationship of the
variants in the reference types. The problem is illustrated as follows. As we
can see in figure 4, the node BCF is not a reference strain type and hence it
is not in the Reference layer. Therefore, there is a possibility that the max-
imum flow path goes through node B, node C and node F in Locus layer.
One possible solution to preserve the sequential relationship of the variants
in the reference type is the following:
1. Require each node in the Locus Layer has in-degree=1
2. If in-degree=k > 1, create k copies of the node
3. Given a sequential relationship of a known strain type, for example
uvw, u → v → w, if in-degree(w)=1, connect v to an unmatched copy
of w
5.2.3 Genetic Algorithm model
Since it may be difficult to find out a correct objective function, we also
attempted to design a genetic algorithm (GA) model to solve the problem
(we want to use the reference in library as much as possible to explain the
sample) in the network-flow section. GA is an algorithm that involves the
use of stochastic mutation, which helps to simulate the evolution process of
nature to help us explain the variations observed.
A common GA model usually has several parts, the initial population,
the environment, and the mutation function. In GA algorithm, we try to
evolve the initial population to our final population which fits the enviorn-
ment best. In our problem, we need another parameter to help make the
computation faster. This parameter is the called torch function. In this
problem, the population refers to all variables that needs to computed. For
example, if we have three references, BXU, AXV and AYW, we have three
variables prr1 (BXU ), prr2 (AXV ), prr3 (AYW ) to compute.
11
Initial population: Simply let prri
be the minimum proportion along its
path, because it is the maximum proportion that each reference node could
have.
Environment: The environment is a function to judge if an individual
in the population is suitable to survive. In this case, we testify whether
these variables (prri
) fit in our constraints. For example, we require that:
(1) if a reference contains a variant, then all
reference node i contains variant j
prri
≤
Prvariantj (2) prri
≤ the minimum proportion of variant that this reference
contains. We could say the environment possibly selected a good result for
us; Or, if an individual suits these requirements, then we can infer that this
individual possibly yields a good result which we expected. Hence, we want
to make a slight perturbation on this result and observe if we could get a
better result, which could serve as a reason for the survival of this individual.
Mutation: Mutation function tells us how much modification can be
made from previous result to a new one. For example, we set a mutation
rate to be less than or equal to 3
26
. We consider the mutation process of
character ’D’ and we apply this mutation on this alphabet, then ’D’ will
mutate from ’A’ to ’G’ in next iteration. Usually, for a simple version of
mutation, we could have a random mutation and control its mutation rate.
This mutation rate is similar to the learning rate in SGD (stochastic gradient
descent) method.
Torch: For the torch part, we record previous results and only keep in-
dividuals which are better than previous results. This serves as a guidance
to help the population evolve to a particular target.
At last, we combine all these components and let initial population live in
our environment. After running sufficient amount of iterations, we could get
a good enough result. However, we may not get the optimal result because
we do not know the actual objective function F(x) and there might numerous
x that satisfy F (x ) = 0.
12
6 Results
As the optimization problem is still in formulation, we can only present some
results for the proportions.
Figure 5: A part of the results from computing proportions for genes clpX,
clpA and nifS. Note the proportions are presented as percentages
Figure 5 shows partial results after computing the proportions of three
genes (clpX, clpA, and nifS). From these tables, we can see that the pro-
portion percentages are all relatively small. This indicates that there is no
sample with a variant that has an extensively greater proportion.We can
thus infer that each sample is infected with a substantial amount of different
strains.
7 Discussion and Future Work
As most of our group mates are unfamiliar with bioinformatics and the bio-
logical aspects of this problem, during the course of this project, we learned
substantial knowledge about the biological importance of this project and
the terminologies used in bioinformatics, especially when we were using the
13
tool kits. Moreover, we experienced implementing a wide range of knowledge
in probability and mathematics, ranging from simple yet powerful theorem
such as Bayes’ rule to sophisticated algorithmic techniques such as network
flows and Mixed ILP.
We had a few challenges during the course of our project. One of them
is to understand a wide range of biological terminologies, which appeared
to be a steep learning curve in the beginning of our project. Furthermore,
there were lack of resources available with regards to the optimization part of
this project, based on our research. Therefore, we are required to formulate
new idea in tackling the optimization problem and we encountered numerous
failures while producing a good formulation to represent the problem. This
part of the project takes a substantial amount of time. Nevertheless, we are
satisfied and delighted to have these obstacles, as these challenges provide us
the chance to study this problem in great depth, and help us in formulating
better representations of the problem.
Once there is a clear and robust formulation of the optimization prob-
lem, we will have better insights on co-infection patterns among our tick
samples, which subsequently contribute to more effective ecological control
of the spreading of the disease. Definitely, we can try to tweak the Mixed
ILP model by trying complex weight function rather than 0/1, which might
encapsulate the biological meaning better. In fact, this approach can also be
applied to other pathogenic bacteria apart from Borrelia Burgdorferi. Hence,
the approaches that we explained in the report might serve as a general plat-
form for computational biologist or life scientist in their area of interest.
8 Supplementary Materials
In our first step, we used a python script to map reads and compute the
proportions of reads that map to a variant. The python script is avail-
able at http://tiny.cc/wexkhy. We got a large dataset of results from
computing proportions. Figure 5 in Results part shows a small part of our
computed results and all the computed results data is available at http:
//tiny.cc/ezjkhy.
We also used IGV and Samtools tview to visualize mapped reads. Figure
14
6 and Figure 7 show parts of our results.
Figure 6: A part of the visualized results by using IGV. Arrows show the
reads orientation, and colours shows the read pairing.
Figure 7: A part of the visualized results by using Samtools. Top strand
shows the reference sequence, ’.’ and ’,’ shows the paired sequences, and
mismaches are shown as single characters.
15
References
[1] Walter KS, Carpi G, Evans BR, Caccone A, Diuk-Wasser MA (2016)
Vectors as Epidemiological Sentinels: Patterns of Within-
Tick Borrelia burgdorferi Diversity. PLoS Pathog 12(7): e1005759.
doi:10.1371/journal.ppat.1005759
[2] Carpi G, Walter KSK, Bent SJS, Hoen AGA, Diuk-Wasser M, et al.
(2015) Whole genome capture of vector-borne pathogens from
mixed DNA samples: a case study of Borrelia burgdorferi.
[3] Wibke Jochum, Anna L. Reye and Claude P. Muller (2013) Whole
genome sequencing of Ixodes ricinus, the European Lyme dis-
ease vector.
[4] Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and
memory-efficient alignment of short DNA sequences to the hu-
man genome. Genome Biol 10:R25.
[5] James T. Robinson, Helga Thorvaldsd´ottir, Wendy Winckler, Mitchell
Guttman, Eric S. Lander, Gad Getz, Jill P. Mesirov. (2011) Integrative
Genomics Viewer. Nature Biotechnology 29, 24–26
[6] Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N.,
Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data
Processing Subgroup (2009) The Sequence alignment/map (SAM)
format and SAMtools. Bioinformatics, 25, 2078-9.
[7] Clancy, S., (2008). Genetic recombination. Nature education, 1(1),
p.40.
[8] Jacquot, M., Abrial, D., Gasqui, P., Bord, S., Marsot, M., Masseglia,
S., Pion, A., Poux, V., Zilliox, L., Chapuis, J.L. and Vourc’h, G.,
(2016). Multiple independent transmission cycles of a tick-borne
pathogen within a local host community. Scientific Reports, 6.
[9] Tilly, K., Rosa, P.A. and Stewart, P.E., (2008). Biology of infection
with Borrelia burgdorferi. Infectious disease clinics of North Amer-
ica, 22(2), pp.217-234.
16
[10] Barbour, A.G. and Travinsky, B., (2010). Evolution and distribu-
tion of the ospC gene, a transferable serotype determinant of
Borrelia burgdorferi. MBio, 1(4), pp.e00153-10.
[11] Maiden, M.C., Bygraves, J.A., Feil, E., Morelli, G., Russell, J.E., Ur-
win, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D.A. and Feavers,
I.M., (1998). Multilocus sequence typing: a portable approach
to the identification of clones within populations of pathogenic
microorganisms. Proceedings of the National Academy of Sciences,
95(6), pp.3140-3145.
17

More Related Content

What's hot

In silico approach for viral mutations and sustainability of immunizations
In silico approach for viral mutations and sustainability of immunizationsIn silico approach for viral mutations and sustainability of immunizations
In silico approach for viral mutations and sustainability of immunizations
IJERA Editor
 
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Álvaro L. Valiñas
 
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Álvaro L. Valiñas
 
SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...
SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...
SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...
EuFMD
 
Biomedical Informatics 706: Precision Medicine with exposures
Biomedical Informatics 706: Precision Medicine with exposuresBiomedical Informatics 706: Precision Medicine with exposures
Biomedical Informatics 706: Precision Medicine with exposures
Chirag Patel
 
Infectious bursal disease in Ethiopian village chickens
Infectious bursal disease in Ethiopian village chickensInfectious bursal disease in Ethiopian village chickens
Infectious bursal disease in Ethiopian village chickens
ILRI
 
Graduate Symposium Presentation
Graduate Symposium PresentationGraduate Symposium Presentation
Graduate Symposium Presentation
schonborn
 
Martin's December 2015 CV
Martin's December 2015 CVMartin's December 2015 CV
Martin's December 2015 CV
Martin Grunnill
 
ISPPD_2016_POSTER_413 copy
ISPPD_2016_POSTER_413 copyISPPD_2016_POSTER_413 copy
ISPPD_2016_POSTER_413 copy
Arox Kamng'ona, BSc (hons), MSc, PhD
 
NCI systems epidemiology 03012019
NCI systems epidemiology 03012019NCI systems epidemiology 03012019
NCI systems epidemiology 03012019
Chirag Patel
 
Pcbi.1000660
Pcbi.1000660Pcbi.1000660
Pcbi.1000660
javier
 
art%3A10.1186%2Fs12936-016-1304-8
art%3A10.1186%2Fs12936-016-1304-8art%3A10.1186%2Fs12936-016-1304-8
art%3A10.1186%2Fs12936-016-1304-8
wagatua njoroge
 
Virus vector ppt
Virus vector pptVirus vector ppt
Virus vector ppt
Rameshgutha3852
 
EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119
Chirag Patel
 
malaria_paper
malaria_papermalaria_paper
malaria_paper
John Powers
 
Identification of SNP markers for resistance to Salmonella and IBDV in indige...
Identification of SNP markers for resistance to Salmonella and IBDV in indige...Identification of SNP markers for resistance to Salmonella and IBDV in indige...
Identification of SNP markers for resistance to Salmonella and IBDV in indige...
ILRI
 
Birth date and the flu
Birth date and the fluBirth date and the flu
Birth date and the flu
English Online Inc.
 
Luce-PosterPresentation-26April2016
Luce-PosterPresentation-26April2016Luce-PosterPresentation-26April2016
Luce-PosterPresentation-26April2016
Victoria Mann
 
Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...
Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...
Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...
Matthew Kirkby
 
Ian Has-Enfermedades transmitidas por vectores
Ian Has-Enfermedades transmitidas por vectoresIan Has-Enfermedades transmitidas por vectores
Ian Has-Enfermedades transmitidas por vectores
Fundación Ramón Areces
 

What's hot (20)

In silico approach for viral mutations and sustainability of immunizations
In silico approach for viral mutations and sustainability of immunizationsIn silico approach for viral mutations and sustainability of immunizations
In silico approach for viral mutations and sustainability of immunizations
 
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
 
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
Variation analysis of Swine influenza virus (SIV) H1N1 sequences in experimen...
 
SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...
SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...
SEROLOGICAL ELISAs BASED ON MONOCLONAL ANTIBODIES AS DIAGNOSTIC TOOLS FOR LUM...
 
Biomedical Informatics 706: Precision Medicine with exposures
Biomedical Informatics 706: Precision Medicine with exposuresBiomedical Informatics 706: Precision Medicine with exposures
Biomedical Informatics 706: Precision Medicine with exposures
 
Infectious bursal disease in Ethiopian village chickens
Infectious bursal disease in Ethiopian village chickensInfectious bursal disease in Ethiopian village chickens
Infectious bursal disease in Ethiopian village chickens
 
Graduate Symposium Presentation
Graduate Symposium PresentationGraduate Symposium Presentation
Graduate Symposium Presentation
 
Martin's December 2015 CV
Martin's December 2015 CVMartin's December 2015 CV
Martin's December 2015 CV
 
ISPPD_2016_POSTER_413 copy
ISPPD_2016_POSTER_413 copyISPPD_2016_POSTER_413 copy
ISPPD_2016_POSTER_413 copy
 
NCI systems epidemiology 03012019
NCI systems epidemiology 03012019NCI systems epidemiology 03012019
NCI systems epidemiology 03012019
 
Pcbi.1000660
Pcbi.1000660Pcbi.1000660
Pcbi.1000660
 
art%3A10.1186%2Fs12936-016-1304-8
art%3A10.1186%2Fs12936-016-1304-8art%3A10.1186%2Fs12936-016-1304-8
art%3A10.1186%2Fs12936-016-1304-8
 
Virus vector ppt
Virus vector pptVirus vector ppt
Virus vector ppt
 
EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119EWAS and the exposome: Mt Sinai in Brescia 052119
EWAS and the exposome: Mt Sinai in Brescia 052119
 
malaria_paper
malaria_papermalaria_paper
malaria_paper
 
Identification of SNP markers for resistance to Salmonella and IBDV in indige...
Identification of SNP markers for resistance to Salmonella and IBDV in indige...Identification of SNP markers for resistance to Salmonella and IBDV in indige...
Identification of SNP markers for resistance to Salmonella and IBDV in indige...
 
Birth date and the flu
Birth date and the fluBirth date and the flu
Birth date and the flu
 
Luce-PosterPresentation-26April2016
Luce-PosterPresentation-26April2016Luce-PosterPresentation-26April2016
Luce-PosterPresentation-26April2016
 
Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...
Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...
Mario H. Skiadopoulos Presentation on "Evaluation of the Antibody Threshold o...
 
Ian Has-Enfermedades transmitidas por vectores
Ian Has-Enfermedades transmitidas por vectoresIan Has-Enfermedades transmitidas por vectores
Ian Has-Enfermedades transmitidas por vectores
 

Viewers also liked

CV
CVCV
複数時点の単語出現頻度を 扱う時系列データモデリング
複数時点の単語出現頻度を 扱う時系列データモデリング複数時点の単語出現頻度を 扱う時系列データモデリング
複数時点の単語出現頻度を 扱う時系列データモデリング
奈良先端大 情報科学研究科
 
Khalil seminar
Khalil seminarKhalil seminar
Khalil seminar
Mohammad Khalil
 
Microservices forscale
Microservices forscaleMicroservices forscale
Microservices forscale
Deepak Singhvi
 
Salvador dali
Salvador daliSalvador dali
Salvador dali
Marcos Johnson Noya
 
Prédictions TMT 2017 de Deloitte
Prédictions TMT 2017 de DeloittePrédictions TMT 2017 de Deloitte
Prédictions TMT 2017 de Deloitte
Deloitte Canada
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid Search
Jeff Fried
 
velocidad de reacción
velocidad de reacciónvelocidad de reacción
velocidad de reacción
Flor Idalia Espinoza Ortega
 
Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014
Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014
Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014
cynthia menezes mello
 
Seurity policy
Seurity policySeurity policy
Seurity policy
Hari Sarda
 
Nuclear Storage_MH_6.15
Nuclear Storage_MH_6.15Nuclear Storage_MH_6.15
Nuclear Storage_MH_6.15
Mariah Harrod
 
P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16
P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16
P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16
Paul Kershaw
 
Making service an experience
Making service an experienceMaking service an experience
Making service an experience
Brook Calverley
 
My friend ray
My friend rayMy friend ray
My friend ray
Dick Nehls
 

Viewers also liked (14)

CV
CVCV
CV
 
複数時点の単語出現頻度を 扱う時系列データモデリング
複数時点の単語出現頻度を 扱う時系列データモデリング複数時点の単語出現頻度を 扱う時系列データモデリング
複数時点の単語出現頻度を 扱う時系列データモデリング
 
Khalil seminar
Khalil seminarKhalil seminar
Khalil seminar
 
Microservices forscale
Microservices forscaleMicroservices forscale
Microservices forscale
 
Salvador dali
Salvador daliSalvador dali
Salvador dali
 
Prédictions TMT 2017 de Deloitte
Prédictions TMT 2017 de DeloittePrédictions TMT 2017 de Deloitte
Prédictions TMT 2017 de Deloitte
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid Search
 
velocidad de reacción
velocidad de reacciónvelocidad de reacción
velocidad de reacción
 
Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014
Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014
Cynthia_Arquivo2_Oficial2_Leve - Errata - 2014
 
Seurity policy
Seurity policySeurity policy
Seurity policy
 
Nuclear Storage_MH_6.15
Nuclear Storage_MH_6.15Nuclear Storage_MH_6.15
Nuclear Storage_MH_6.15
 
P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16
P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16
P Kershaw PP and IAG LB Bexley EtG Conference 16-11-16
 
Making service an experience
Making service an experienceMaking service an experience
Making service an experience
 
My friend ray
My friend rayMy friend ray
My friend ray
 

Similar to Computational_biology_project_report

Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightning
David Soergel
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
Joaquin Dopazo
 
Genome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattleGenome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattle
Laurence Dawkins-Hall
 
Diversity of O Antigens within the Genus Cronobacter - Martina
Diversity of O Antigens within the Genus Cronobacter - MartinaDiversity of O Antigens within the Genus Cronobacter - Martina
Diversity of O Antigens within the Genus Cronobacter - Martina
Pauline Ogrodzki
 
Hetman immem xi final March 2016
Hetman immem xi final March 2016Hetman immem xi final March 2016
Hetman immem xi final March 2016
IRIDA_community
 
20091201 Transfer Seminar Final
20091201 Transfer Seminar Final20091201 Transfer Seminar Final
20091201 Transfer Seminar Final
marcus314
 
EVE 161 Winter 2018 Class 18
EVE 161 Winter 2018 Class 18EVE 161 Winter 2018 Class 18
EVE 161 Winter 2018 Class 18
Jonathan Eisen
 
Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...
Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...
Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...
Université de Dschang
 
Experimental and Mathematical
Experimental and Mathematical Experimental and Mathematical
Experimental and Mathematical
John Jeffrey Jones
 
14KoVar
14KoVar14KoVar
Reference for long range pcr based ngs applications
Reference for long range pcr based ngs applicationsReference for long range pcr based ngs applications
Reference for long range pcr based ngs applications
ssuser1e2788
 
EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9
Jonathan Eisen
 
Looking Back at Mycobacterium tuberculosis Mouse Efficacy Testing To Move Ne...
Looking Back at Mycobacterium tuberculosis Mouse  Efficacy Testing To Move Ne...Looking Back at Mycobacterium tuberculosis Mouse  Efficacy Testing To Move Ne...
Looking Back at Mycobacterium tuberculosis Mouse Efficacy Testing To Move Ne...
Sean Ekins
 
RNAP_paper
RNAP_paperRNAP_paper
RNAP_paper
Steven Bates
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary Analysis
James Warren
 
CECAM-2015_ESR10_poster_Andrew-Brockman
CECAM-2015_ESR10_poster_Andrew-BrockmanCECAM-2015_ESR10_poster_Andrew-Brockman
CECAM-2015_ESR10_poster_Andrew-Brockman
Andrew Brockman
 
Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...
Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...
Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...
Ivan Brukner
 
Research proposal sjtu
Research proposal sjtuResearch proposal sjtu
Research proposal sjtu
Aqsa Qambrani
 
Plang functional genome
Plang functional genomePlang functional genome
Plang functional genome
tcha163
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And Challenges
Amos Watentena
 

Similar to Computational_biology_project_report (20)

Soergel oa week-2014-lightning
Soergel oa week-2014-lightningSoergel oa week-2014-lightning
Soergel oa week-2014-lightning
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
Genome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattleGenome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattle
 
Diversity of O Antigens within the Genus Cronobacter - Martina
Diversity of O Antigens within the Genus Cronobacter - MartinaDiversity of O Antigens within the Genus Cronobacter - Martina
Diversity of O Antigens within the Genus Cronobacter - Martina
 
Hetman immem xi final March 2016
Hetman immem xi final March 2016Hetman immem xi final March 2016
Hetman immem xi final March 2016
 
20091201 Transfer Seminar Final
20091201 Transfer Seminar Final20091201 Transfer Seminar Final
20091201 Transfer Seminar Final
 
EVE 161 Winter 2018 Class 18
EVE 161 Winter 2018 Class 18EVE 161 Winter 2018 Class 18
EVE 161 Winter 2018 Class 18
 
Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...
Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...
Probability Models for Estimating Haplotype Frequencies and Bayesian Survival...
 
Experimental and Mathematical
Experimental and Mathematical Experimental and Mathematical
Experimental and Mathematical
 
14KoVar
14KoVar14KoVar
14KoVar
 
Reference for long range pcr based ngs applications
Reference for long range pcr based ngs applicationsReference for long range pcr based ngs applications
Reference for long range pcr based ngs applications
 
EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9EVE 161 Winter 2018 Class 9
EVE 161 Winter 2018 Class 9
 
Looking Back at Mycobacterium tuberculosis Mouse Efficacy Testing To Move Ne...
Looking Back at Mycobacterium tuberculosis Mouse  Efficacy Testing To Move Ne...Looking Back at Mycobacterium tuberculosis Mouse  Efficacy Testing To Move Ne...
Looking Back at Mycobacterium tuberculosis Mouse Efficacy Testing To Move Ne...
 
RNAP_paper
RNAP_paperRNAP_paper
RNAP_paper
 
ASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary AnalysisASHG 2015 - Redundant Annotations in Tertiary Analysis
ASHG 2015 - Redundant Annotations in Tertiary Analysis
 
CECAM-2015_ESR10_poster_Andrew-Brockman
CECAM-2015_ESR10_poster_Andrew-BrockmanCECAM-2015_ESR10_poster_Andrew-Brockman
CECAM-2015_ESR10_poster_Andrew-Brockman
 
Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...
Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...
Assay-for-estimating-total-bacterial-load-relative-qPCR-normalisation-of-bact...
 
Research proposal sjtu
Research proposal sjtuResearch proposal sjtu
Research proposal sjtu
 
Plang functional genome
Plang functional genomePlang functional genome
Plang functional genome
 
BIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And ChallengesBIOINFORMATICS Applications And Challenges
BIOINFORMATICS Applications And Challenges
 

More from Elijah Willie

Co-OP Presentation
Co-OP PresentationCo-OP Presentation
Co-OP Presentation
Elijah Willie
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
Elijah Willie
 
Molecular_bilogy_lab_report_2
Molecular_bilogy_lab_report_2Molecular_bilogy_lab_report_2
Molecular_bilogy_lab_report_2
Elijah Willie
 
Molecular_bilogy_lab_report_1
Molecular_bilogy_lab_report_1Molecular_bilogy_lab_report_1
Molecular_bilogy_lab_report_1
Elijah Willie
 
Target_heart_rate_monitor
Target_heart_rate_monitorTarget_heart_rate_monitor
Target_heart_rate_monitor
Elijah Willie
 
Image_processing
Image_processingImage_processing
Image_processing
Elijah Willie
 
Fin_whales
Fin_whalesFin_whales
Fin_whales
Elijah Willie
 

More from Elijah Willie (7)

Co-OP Presentation
Co-OP PresentationCo-OP Presentation
Co-OP Presentation
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
Molecular_bilogy_lab_report_2
Molecular_bilogy_lab_report_2Molecular_bilogy_lab_report_2
Molecular_bilogy_lab_report_2
 
Molecular_bilogy_lab_report_1
Molecular_bilogy_lab_report_1Molecular_bilogy_lab_report_1
Molecular_bilogy_lab_report_1
 
Target_heart_rate_monitor
Target_heart_rate_monitorTarget_heart_rate_monitor
Target_heart_rate_monitor
 
Image_processing
Image_processingImage_processing
Image_processing
 
Fin_whales
Fin_whalesFin_whales
Fin_whales
 

Computational_biology_project_report

  • 1. Simon Fraser University Project report for CMPT 441/711 Bioinformatics Algorithms Illuminating the Diversity of Pathogenic Bacteria, Borrelia Burgdorferi in Ticks Authors: Stanley Gan Elijah Willie Arthur Song Ruochen Jiang Supervisor: Dr. Leonid Chindelevitch Co-Supervisor: Katharine Walter December 8, 2016
  • 2. Abstract In our group project, we try to assess the diversity of pathogenic bacteria, Borrelia Burgdorferi, by understanding the co-infection pat- terns in our samples. We introduced a combination of probabilistic approach and optimization technique in attacking this problem. First of all, we utilized multiple bioinformatics tool kits such as Bowtie, IGV, Samtools etc, for read mapping and read visualization. Next, based on the result from read mapping, we calculated proportions of each variants observed at each loci for each sample using Bayes’ rule. At last, we will introduce a few approaches to answer questions related to proportions of different co-infecting strain types in each sample and introduce a minimum number of new strain types into the existing reference database. The three approaches we attempted during the course of our project are as follows, mixed integer linear programming(Mixed ILP), network flow(NF), and Genetic Algorithm Model(GA). In this report, we only provide the details of the ap- proaches that we tried as the formulation needs to be reviewed before implementation in order to get any significant results. However, from the calculated proportions, we are able to infer that all of our samples are indeed co-infected by multiple different strain types. 1 Introduction Borrelia Burgdorferi is the bacterial agent that causes Lyme disease in hu- mans and is spread by Ixodes ticks.[1] Lyme disease is a vector-borne infection with numerous vertebrate species capable for transmission and humans are one of the dead end hosts.[1] There are over 30,000 incidences reported in the United States and hence it is one of the most common vector-borne dis- ease spreading geospatially in North America. Due to the wide spreading of Lyme disease, it is interesting yet challenging to study the strain diversity of Borrelia. In our project, we would like to address two different biological questions. Firstly, we would like to understand the number of different strain types present in each tick sample. Secondly, we would like to compare these strain types to the existing reference database and suggest a set of new strain types if necessary. One of the motivation behind these biological questions is co-infection patterns. By understanding these patterns, we can investigate the complexity of heterogeneous bacterial population. One of the forces that drive bacterial diversity is genetic recombination, which can only happen in 1
  • 3. samples co-infected by at least two different bacterial strain types.[1] In our context, genetic recombination will mean the exchange of genetic materi- als between multiple chromosomes of different bacterial strains.[7] Moreover, different bacterial genotypes have different transmission rates among host and/or vectors, which can produce distinct transmission cycles and lead to discrepancy in disease risks contributed by different host populations.[8] Fur- thermore, different bacterial strain types are likely to have different possible harms towards human. For example in [9], the paper describes arthritis as a symptom that often accompanies B. burgdorferi s.s. infection. On the other hand, neurological symptoms are associated with B. garinii, and skin disor- ders with B. afzelii. Therefore, understanding co-infection patterns which in turn illuminates bacterial strain diversity has a certain degree of significance in developing more reliable prevention and control protocol. 2 Previous work Some related biological aspects in our project were studied by Katharine Wal- ter group[1] previously. In their research, they pointed out that Within-host pathogen diversity may have important implications for human health and disease epidemiology because hosts are frequently co-infected with multiple pathogen species. They chose Borrelia burgdorferi as the model to study within-host processes. In their experiment, they examined genomic variation of Borrelia burgdorferi in 98 individual field-collected tick vectors. And the experiment shows that 70% of ticks are infected with multiple strains. Their work gives an idea that disease vectors like tick can be studied as epidemio- logical sentinels. The method used to capture the genome dataset we used in our project was studied by Giovanna Carpi group[2] in 2015. In their work they used cus- tom probes for multiplexed hybrid capture to sequence 30 Borrelia burgdor- feri genomes and found that it nearly sequenced the complete( 99.5 %) genome of Borrelia burgdorferi. In addtion to this, Wibke Cramaro[3] group did a research on lxodes ricinus, which is the most common tick species and most important vector of human and animal pathogens in western Europe in 2013. They sequenced 2
  • 4. lxodes ricinus’ genome and all sequence data was made available to public database. 3 Data Description Before further discussion, the data that we are working on is the whole genome sequence of 30 tick samples sequenced using hybrid capture meth- ods and Illumina short read technology(paired end 75bp)[2]. Also, we have a reference database of 679 strain types based on multi-locus sequence typ- ing(MLST) of 8 housekeeping genes. In this context, MLST is a molecular typing method[11] in which the following 8 housekeeping genes are sequenced: clpA, clpX, nifS, pepX, pyrG, recG, rpiB and uvrA. Furthermore, a unique Borrelia bacterial strain type is defined by these 8 housekeeping genes(please refer to [10] for definition based on ospC gene). 4 Problem Illustration and Description Figure 1: Illustration of the problem We are given the data about the genotypes observed at all loci for N sam- ples and their respective proportions. For example in Figure 1, we observed genotype of type A and B with proportion 0.75 and 0.25 respectively at locus 1 for sample 1. Given the genotypes observed at all loci for N samples, we 3
  • 5. produce different strains based on combinations of genotypes at all loci while preserving proportions. For example in Figure 1, we have shown 2 different examples which are J={0.25BXU, 0.25AXV, 0.5AYW} and K={0.25BYU, 0.25AYV, 0.5AXW}. In J, proportion of B is 0.25(from BXU), proportion of A is 0.25(from AXV) + 0.5(AYW)=0.75, ... and so on. The proportions are preserved. We restate the problem in a mathematical perspective: Given a library of known strain types (679 types), we are trying to explain what we see(genotypes) using as few new strain types as possible. From a mathematical perspective, we want to minimize the number of new strains introduced to our existing library. For example in a simple case, if our library contains {BXU, AXV, ... }, we will choose J instead of K as we have to introduce at least 2 new strain types if we choose K. Definitely, there are other criteria to consider such as the proportions. 5 Methods Before dealing with any optimization, we have to calculate the proportions as illustrated in Figure 1. 5.1 Calculating Proportions The first step was to extract the sequences for each of the 8 house keeping genes and their individual variants. This information was also provided, which made advancing to the next step much easier. After obtaining the sequences for the housekeeping genes and their variants, the next step was to compute the proportions of reads from the thirty bacteria in the ticks for which their whole genome was captured that maps to each variant for all the housekeeping genes. This is accomplished by first using Bowtie[4] to map all the reads from each of the thirty samples to each of the eight housekeeping genes. An interesting observation is to note that about less than one percent of each sample will map to a gene. This is to be expected as the whole genome was sequenced, and we are only interested in eight of the total gene populus. Next Integrative Genomics Viewer (IGV) [5], and Samtools ”tview” [6] was used for visualizing reads that mapped to each of the variants for each of gene. Visualization was necessary because it enabled 4
  • 6. us to check if there were any reads that had only a portion that mapped to a variant. However, over 99% of reads that mapped to a variant mapped to the inside of a variant. For each gene, to compute the proportions of reads that maps to a variant. We are interested in computing P(vari | readj) (1) That is we are interested in computing the probability of a variant given a read. This is very difficult to do, as we do not have any prior information about the distribution of variants for a gene. However, Bayes’ rule states that P(vari | readj) = P(readj | vari) P(vari) P(readj) (2) Thus we do not need to directly compute equation (1) we can compute P(readj | vari) (3) and multiply it by a proportionality constant to get P(vari | readj) = P(readj | vari) kj (4) where kj = P(vari) P(readj) (5) By summing over all variants and equating to 1, we can solve for (5) i.e. i kj P(readj | vari) = 1 (6) To compute (3) we can appeal to the binomial distribution since it is given that 1 100 of mismatches within a mapping is due to sequencing errors. Thus we can use the number of mismatches which bowtie reports for a mapped read and the Binomial Distribution to compute a probability distribution over (3). Thus we have that P(readj | vari) = mj lj ( 1 100 )lj ( 99 100 )mj−lj (7) where mj is the length of the readj, and lj is number of mismatches for the mapping between readj and varianti. Plugging (7) into (1), we get that j kj m lj ( 1 100 )l ( 99 100 )m−lj = 1 (8) 5
  • 7. Thus we are now able to compute kj for all readj. Now we are fully equipped to compute the proportions for each variant in a gene. The proportion of a varianti for a gene G will be j P(vari | readj) h i P(varh | readj) (9) We just sum up over all reads that maps to a particular variant for a gene G and divide by the sum over all variants for that gene and multiply by 100 to get proportions in percentages. 5.2 Optimization For the optimization part, we have 3 approaches that we tried during the course of our project. 5.2.1 Mixed Integer Linear Programming The idea in this program is to formulate the problem rigorously and minimize the number of new strains, the proportions of the new strains using 0/1 weights indicator. Besides, this program also captures the possible errors between the true proportion of a variant and its observed proportion, in which these errors may happen due to sampling in the lab. These errors will also be included into the objective function. Known Parameters • Number of loci: 8 • Number of samples: 30 • Set of different genotypes observed on sample i at locus j, Gij = {g (1) ij , g (2) ij , ...}. • Pij = {p (1) ij , p (2) ij , ...} where p (k) ij corresponds to the proportion of geno- type g (k) ij in Gij. (Note: |Pij| k=1 p (k) ij = 1, ∀1≤i≤30, ∀1≤j≤8) • Reference = Ω where |Ω|=679 6
  • 8. • Set of possible different strains for sample i (different combinations of the genotypes we observed at all loci), Vi = {V (1) i , V (2) i , ...,V (Hi) i }, where Hi = L j=1 |Gij| • A representation of the strain type, V (k) i =            a (k) i1,1 a (k) i2,1 . . . a (k) iL,1 a (k) i1,2 . . . . . . a (k) iL,2 ... ... ... a (k) i1,|Ni1| ... ... ... ... a (k) i1,Ri a (k) i2,Ri . . . a (k) iL,Ri            , i-th sample k-th combination, ∀1 ≤ i ≤ 30. a (k) ij = {0,1}, Ri = maxj |Gij|. For those j such that |Gij| < Ri, a (k) ij = 0 for k = |Gi(j+1)|, .., Ri • For the example in the problem description, if V1 = {V (1) 1 = BXU, V (2) 1 = AXV, V (3) 1 = AYW }, the matrix representation is as follows: V (1) 1 =   0 1 1 1 0 0 0 0 0  , V (2) 1 =   1 1 0 0 0 1 0 0 0  , V (3) 1 =   1 0 0 0 1 0 0 0 1   • Weight for each strain type V (k) i , w (k) i where w (k) i =1 iff V (k) i is a new strain type, otherwise w (k) i =0 • Weight for the proportion of each strain type V (k) i , c (k) i where c (k) i =1 iff V (k) i is a new strain type, otherwise c (k) i =0 Decision Variables • Indicator variable a (k) i where a (k) i =1 iff V (k) i is chosen to explain the samples, otherwise a (k) i =0 • Proportion of the strain type V (k) i , π (k) i • Eij = {e (1) ij , e (2) ij , ...} where e (k) ij corresponds to the error of the observed proportion p (k) ij of g (k) ij from its true proportion. 7
  • 9. • For convenience, let Φi =            p (1) i1 p (1) i2 . . . p (1) iL p (2) i1 . . . . . . p (2) iL ... ... ... p (|Ni1|) i1 ... ... ... ... p (Ri) i1 p (Ri) i2 . . . p (Ri) iL            , ∀1 ≤ i ≤ 30. For those j such that |Gij| < Ri, p (k) ij = 0 for k = |Gi(j+1)|, .., Ri Constraints • p (k) ij ∈ [0, 1], e (k) ij ∈ [−p (k) ij , 1 − p (k) ij ] ∀i, j, k and |Pij| k=1 (p (k) ij + e (k) ij ) = 1, ∀1 ≤ i ≤ 30, ∀1 ≤ j ≤ 8 • a (k) i ∈ {0, 1} ∀i, k • π (k) i ∈ [0, 1] and Hi k=1 π (k) i = 1 for all i • e (k) ij ≤ T (k) i − p (k) ij and e (k) ij ≤ p (k) ij − T (k) i where T (k) i = (i,k):g (k) ij ∈V (k) i π (k) i • For all sample i where 1 ≤ i ≤ 30, Hi k=1 π (k) i · V (k) i = Φi Objective Function min i,k (w (k) i · a (k) i + c (k) i · π (k) i ) + i,j,k e (k) ij 8
  • 10. 5.2.2 Network Flow In the NF model, we try to tackle the problem using a 2 step approach. The first step would be to maximize the proportions of existing strain types, which is analogous with finding a maximum flow in the network. The second step is to explain the remaining proportions using a minimal number of new strains. Our group tried to formulate the first step and implemented it. How- ever, there are technicalities that have to be considered before implementing it. In this report, we will introduce the idea that we tried and the reason it does not work. We create a network for each sample and we will use a simplified example to illustrate it. Figure 2: Simplified example for a sample Figure 3: Network for simplified example Consider a simplified example shown in figure 2. In this example we have 1 sample and 3 loci. Besides, assume that we have 4 reference strain types: 9
  • 11. ACE, ADF, ACF and BCE. The proportion of each variant is shown in the table. We build a network as shown in figure 3. In this network, we construct several independent layers: Source layer, Locus layer, Reference layer, Merge layer and Sink Layer. The order of the flow path will suggest which variant to choose to explain the sample. Construction: (1) Build layer: We construct the whole network layer by layer. First we have Source Layer, which only contains a source node required for a network flow structure. Next we build Locus Layer. Locus Layer contains 3 sub-layer representing each independent locus. In each sub- layer, it contains several nodes which represent the observed variants. Then we have Reference layer, it contains all reference strain types that contain any of the observed variants. For example, if XYZ is also a reference strain type but we did not observe any variants of type X, Y, Z in the 3 loci respectively, we do not include it in the network construction for this sample. Finally, we have the Merge Layer which contains a merge node, and a Sink Layer that has a sink node required for a network flow structure. (2) Edges: Connect the source node to all nodes in the first Locus layer. As for the edges between the sub layers in the Locus layer, they are dependent on the reference strain types. For example, we have ACE as a reference strain type, hence we will connect A → C →E. For the last layer in Locus layer, we just connect it to all relevant reference nodes (the last variant in reference is the same as the node in last Locus layer). Finally we connect all reference node with Merge layer, and connect Merge layer to Sink layer (3) Assign capacity: For each edge pointing to a node in Locus layer, the capacity is same as the proportion in the table. For the edge pointing to reference nodes, Merge layer and Sink layer, there is no limitation, we can just assign a maximum capacity to them. Figure 4: A flaw in simplified example 10
  • 12. Problem: The path of maximum flow flowing through a particular ref- erence type represents the maximum proportion of that particular reference type in which we are able to use, without violating any constraints. This model is intuitive but it does not preserve the sequential relationship of the variants in the reference types. The problem is illustrated as follows. As we can see in figure 4, the node BCF is not a reference strain type and hence it is not in the Reference layer. Therefore, there is a possibility that the max- imum flow path goes through node B, node C and node F in Locus layer. One possible solution to preserve the sequential relationship of the variants in the reference type is the following: 1. Require each node in the Locus Layer has in-degree=1 2. If in-degree=k > 1, create k copies of the node 3. Given a sequential relationship of a known strain type, for example uvw, u → v → w, if in-degree(w)=1, connect v to an unmatched copy of w 5.2.3 Genetic Algorithm model Since it may be difficult to find out a correct objective function, we also attempted to design a genetic algorithm (GA) model to solve the problem (we want to use the reference in library as much as possible to explain the sample) in the network-flow section. GA is an algorithm that involves the use of stochastic mutation, which helps to simulate the evolution process of nature to help us explain the variations observed. A common GA model usually has several parts, the initial population, the environment, and the mutation function. In GA algorithm, we try to evolve the initial population to our final population which fits the enviorn- ment best. In our problem, we need another parameter to help make the computation faster. This parameter is the called torch function. In this problem, the population refers to all variables that needs to computed. For example, if we have three references, BXU, AXV and AYW, we have three variables prr1 (BXU ), prr2 (AXV ), prr3 (AYW ) to compute. 11
  • 13. Initial population: Simply let prri be the minimum proportion along its path, because it is the maximum proportion that each reference node could have. Environment: The environment is a function to judge if an individual in the population is suitable to survive. In this case, we testify whether these variables (prri ) fit in our constraints. For example, we require that: (1) if a reference contains a variant, then all reference node i contains variant j prri ≤ Prvariantj (2) prri ≤ the minimum proportion of variant that this reference contains. We could say the environment possibly selected a good result for us; Or, if an individual suits these requirements, then we can infer that this individual possibly yields a good result which we expected. Hence, we want to make a slight perturbation on this result and observe if we could get a better result, which could serve as a reason for the survival of this individual. Mutation: Mutation function tells us how much modification can be made from previous result to a new one. For example, we set a mutation rate to be less than or equal to 3 26 . We consider the mutation process of character ’D’ and we apply this mutation on this alphabet, then ’D’ will mutate from ’A’ to ’G’ in next iteration. Usually, for a simple version of mutation, we could have a random mutation and control its mutation rate. This mutation rate is similar to the learning rate in SGD (stochastic gradient descent) method. Torch: For the torch part, we record previous results and only keep in- dividuals which are better than previous results. This serves as a guidance to help the population evolve to a particular target. At last, we combine all these components and let initial population live in our environment. After running sufficient amount of iterations, we could get a good enough result. However, we may not get the optimal result because we do not know the actual objective function F(x) and there might numerous x that satisfy F (x ) = 0. 12
  • 14. 6 Results As the optimization problem is still in formulation, we can only present some results for the proportions. Figure 5: A part of the results from computing proportions for genes clpX, clpA and nifS. Note the proportions are presented as percentages Figure 5 shows partial results after computing the proportions of three genes (clpX, clpA, and nifS). From these tables, we can see that the pro- portion percentages are all relatively small. This indicates that there is no sample with a variant that has an extensively greater proportion.We can thus infer that each sample is infected with a substantial amount of different strains. 7 Discussion and Future Work As most of our group mates are unfamiliar with bioinformatics and the bio- logical aspects of this problem, during the course of this project, we learned substantial knowledge about the biological importance of this project and the terminologies used in bioinformatics, especially when we were using the 13
  • 15. tool kits. Moreover, we experienced implementing a wide range of knowledge in probability and mathematics, ranging from simple yet powerful theorem such as Bayes’ rule to sophisticated algorithmic techniques such as network flows and Mixed ILP. We had a few challenges during the course of our project. One of them is to understand a wide range of biological terminologies, which appeared to be a steep learning curve in the beginning of our project. Furthermore, there were lack of resources available with regards to the optimization part of this project, based on our research. Therefore, we are required to formulate new idea in tackling the optimization problem and we encountered numerous failures while producing a good formulation to represent the problem. This part of the project takes a substantial amount of time. Nevertheless, we are satisfied and delighted to have these obstacles, as these challenges provide us the chance to study this problem in great depth, and help us in formulating better representations of the problem. Once there is a clear and robust formulation of the optimization prob- lem, we will have better insights on co-infection patterns among our tick samples, which subsequently contribute to more effective ecological control of the spreading of the disease. Definitely, we can try to tweak the Mixed ILP model by trying complex weight function rather than 0/1, which might encapsulate the biological meaning better. In fact, this approach can also be applied to other pathogenic bacteria apart from Borrelia Burgdorferi. Hence, the approaches that we explained in the report might serve as a general plat- form for computational biologist or life scientist in their area of interest. 8 Supplementary Materials In our first step, we used a python script to map reads and compute the proportions of reads that map to a variant. The python script is avail- able at http://tiny.cc/wexkhy. We got a large dataset of results from computing proportions. Figure 5 in Results part shows a small part of our computed results and all the computed results data is available at http: //tiny.cc/ezjkhy. We also used IGV and Samtools tview to visualize mapped reads. Figure 14
  • 16. 6 and Figure 7 show parts of our results. Figure 6: A part of the visualized results by using IGV. Arrows show the reads orientation, and colours shows the read pairing. Figure 7: A part of the visualized results by using Samtools. Top strand shows the reference sequence, ’.’ and ’,’ shows the paired sequences, and mismaches are shown as single characters. 15
  • 17. References [1] Walter KS, Carpi G, Evans BR, Caccone A, Diuk-Wasser MA (2016) Vectors as Epidemiological Sentinels: Patterns of Within- Tick Borrelia burgdorferi Diversity. PLoS Pathog 12(7): e1005759. doi:10.1371/journal.ppat.1005759 [2] Carpi G, Walter KSK, Bent SJS, Hoen AGA, Diuk-Wasser M, et al. (2015) Whole genome capture of vector-borne pathogens from mixed DNA samples: a case study of Borrelia burgdorferi. [3] Wibke Jochum, Anna L. Reye and Claude P. Muller (2013) Whole genome sequencing of Ixodes ricinus, the European Lyme dis- ease vector. [4] Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the hu- man genome. Genome Biol 10:R25. [5] James T. Robinson, Helga Thorvaldsd´ottir, Wendy Winckler, Mitchell Guttman, Eric S. Lander, Gad Getz, Jill P. Mesirov. (2011) Integrative Genomics Viewer. Nature Biotechnology 29, 24–26 [6] Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [7] Clancy, S., (2008). Genetic recombination. Nature education, 1(1), p.40. [8] Jacquot, M., Abrial, D., Gasqui, P., Bord, S., Marsot, M., Masseglia, S., Pion, A., Poux, V., Zilliox, L., Chapuis, J.L. and Vourc’h, G., (2016). Multiple independent transmission cycles of a tick-borne pathogen within a local host community. Scientific Reports, 6. [9] Tilly, K., Rosa, P.A. and Stewart, P.E., (2008). Biology of infection with Borrelia burgdorferi. Infectious disease clinics of North Amer- ica, 22(2), pp.217-234. 16
  • 18. [10] Barbour, A.G. and Travinsky, B., (2010). Evolution and distribu- tion of the ospC gene, a transferable serotype determinant of Borrelia burgdorferi. MBio, 1(4), pp.e00153-10. [11] Maiden, M.C., Bygraves, J.A., Feil, E., Morelli, G., Russell, J.E., Ur- win, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D.A. and Feavers, I.M., (1998). Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proceedings of the National Academy of Sciences, 95(6), pp.3140-3145. 17