1. Genomic
Predic,on
&
compara,ve
analysis
of
Pathogenicity
of
the
new
“super
bug”:
Clostridium
difficile
Debjit
Ray*,
Kelly
Williams*,
Hudson
Corey*,
Christopher
Polage†,
Joseph
S.
Schoeniger*
*Sandia
NaConal
Laboratories,
Livermore,
CA;
†University
of
California
Davis
Medical
Center,
Sacramento,
CA
IntroducCon
Experimental
Design
and
Methods
Conclusions
and
future
direcCons
We have demonstrated that it is possible to rapidly sequence and produce de novo genome assemblies for reagent costs
of around $200 per genome
• Assembly errors mainly occur at repeat regions, especially rRNA.
• The resulting genomes appear suitable for comparative phylogenetic analysis.
• Improved bioinformatics tools may be able to significantly improve assemblies.
Preliminary data indicates that it is feasible to sequence and assemble and obtain nearly complete coverage of genomes
from samples composed of mixed gDNA from disparate genera. This intentional strategy of limited metagenomic assembly
may enable library prep costs to be halved. In the near future we will test whether long read data (e.g. Oxford Nanopore
MinION) can improve our ability to scaffold over repeats and close genomes.
Results
Sandia
NaConal
Laboratories
is
a
mulC-‐program
laboratory
managed
and
operated
by
Sandia
CorporaCon,
a
wholly
owned
subsidiary
of
Lockheed
MarCn
CorporaCon,
for
the
U.S.
Department
of
Energy's
NaConal
Nuclear
Security
AdministraCon
under
contract
DE-‐AC04-‐94AL85000.
debray@sandia.gov
Horizontal gene transfer (HGT) and recombination leads to the emergence of
bacterial antibiotic resistance and pathogenic traits. Genetic changes range from
acquisition of a large plasmid to insertion of transposon into a regulatory gene. In-
depth comparative phylogenomics can identify subtle genome or plasmid
structural changes or mutations associated with phenotypic changes. Comparative
phylogenomics requires that accurately sequenced, complete and properly
annotated genomes of the organism. Assembling closed genomes requires
additional mate-pair reads or “long read” sequencing data to accompany short-
read paired-end data. Our goal is to improve the understanding of emergence of
pathogenesis using sequencing, comparative genomics, and machine learning
analysis of ~1000 pathogen genomes.
Machine learning algorithms will be
used to digest the diverse features
(change in virulence genes,
recombination, horizontal gene
transfer, patient diagnostics).
Temporal data and evolutionary
models can thus determine
whether the origin of a particular
isolate is likely to have been from
the environment. It can be useful
for comparing differences in
virulence along or across the tree.
Culturing
of
Microorganisms
and
Sequencing
Library
Prep
Peptoclostridium
difficile
(Cdiff)
hypervirulent
strains
(027
ribotype)
were
obtained
from
collec,ons
of
clinical
isolates
at
UC
Davis
Medical
Center
and
grown
on
plates
with
permissive
media
at
37
degrees
C
for
72
hours
under
anaerobic
condi,ons.
Total
genomic
DNA
(gDNA)
was
extracted
using
the
QIAgen
Blood
&
Tissue
Total
DNA
Isola,on
kit.
Libraries
were
prepared
for
the
Illumina
NextSeq
sequencer
following
Illumina
protocols
for
kits
using
transposon-‐mediated
fragmenta,on,
as
shown
below.
Sequencing
was
performed
using
a
300
cycle
kit
to
create
150bp
paired
end
reads.
Funding was provided by the Laboratory Directed Research and Development program at Sandia National Laboratories
Paired-‐Ends
(90
min
/
$19)
Sequencing
of
10M
Reads
(2
day/$100
)
Mate-‐Pairs
(2
day/$80
)
Sequencing
and
Sequence
Assembly
Both
mate
pair
and
paired
end
libraries
were
prepared
for
seventeen
Cdiff
isolates
(S1
through
S17).
In
total
17
mate
pair
libraries
and
17
paired
end
libraries
were
bar-‐
coded
and
sequenced
together
in
a
single
NexSeq
run
with
a
kits
that
produced
~150M
reads.
Standard
Illumina
mate
pair
kits
support
only
up
to
12
single-‐end
bar
codes
sequencer
run,
but
these
cannot
be
easily
demul,plexed
using
standard
so[ware
such
as
bcl2fastq
(Illumina).
SPAdes
3.6.0
is
capable
in
a
few
hours
of
conver,ng
mixes
of
reads
from
different
library
preps
into
high-‐quality
assemblies
with
only
a
few
gaps.
Remaining
breaks
in
scaffolds
are
generally
due
to
repeats
(e.g.,
rRNA
genes)
and
we
are
use
gap
closure
techniques
that
avoid
custom
PCR
or
targeted
sequencing.
Improvements
could
be
made
toward
comple,ng
the
whole
genome
by
developing
our
own
so[ware
tools
for
mate
pair
guided
bridging
(Bridger)
Sample
Paired
end
reads
Mate
pair
reads
Spades
Scaff
Final
con,gs
Genome
Mean
GC%
Cdiff
1
7,696,793
5,178,578
17
2
3957333
28.54
Cdiff
2
8,049,303
2,566,745
19
5
4182280
28.71
Cdiff
3
9,598,027
4,713,959
13
3
4154044
28.65
Cdiff
4
8,884,058
3,555,923
20
2
4145236
28.61
Cdiff
5
7,305,180
4,604,059
20
3
4169542
28.69
Cdiff
6
7,265,736
4,959,974
23
3
4120797
28.51
Cdiff
7
7,160,304
3,344,022
18
4
4201537
28.75
Cdiff
8
6,988,513
6,429,131
13
4
4169879
28.33
Cdiff
9
6,431,108
6,493,984
11
5
4178334
28.14
Cdiff
10
8,757,850
9,326,335
17
3
4227574
28.66
Cdiff
11
6,820,879
6,598,639
21
3
4175884
28.88
Cdiff
12
5,660,381
6,605,606
19
2
4175038
28.21
Cdiff
13
6,656,614
6,314,774
33
3
4271639
28.28
Cdiff
14
5,847,659
9,675,039
13
3
4151289
28.50
Cdiff
16
6,495,214
6,436,182
12
3
4172824
28.11
Cdiff
17
4,973,061
6,786,947
11
2
4171486
28.25
Genome
Size
Func,on
Cd2
170
hypotheCcal
protein
1919
Tetracycline
resistance
protein
TetM
Cd16
1466
Prophage
LambdaBa042C
site-‐specific
recombinase2C
phage
integrase
200
hypotheCcal
protein
Cd17
2147
Excisionase
from
transposon
Tn916
395
Transposase
from
transposon
Tn916
221
ConjugaCve
transposon
protein
TcpC
Increase
Mate
Pair
Size
to
Span
rDNA
Repeats
Reliably
Compara,ve
Analysis
of
Genomes
Un,l
recently,
sequencing
and
assembling
and
annota,ng
a
bacterial
genome
was
a
major
effort,
generally
undertaken
in
order
to
establish
phylogeny
and
a
basic
inventory
of
genes,
metabolic
pathways.
A
large
number
of
well-‐annotated
reference
genomes
now
exist,
however,
for
most
pathogens,
and
there
are
good
tools
for
standard
annota,on.
It
is
now
feasible
to
sequence
and
assemble
large
numbers
of
closely-‐related
strains
in
order
to
understand
changes
to
the
genome
that
occur
over
short
,me
scales
We
are
construc,ng
pipelines
for
assembly,
annota,on
and
compara,ve
analysis
of
genomes
that
primarily
focus
on
the
iden,fica,on
of
mobile
elements
and
genes
and
genome
features
closely
associated
with
virulence
and
an,bio,c
resistance.
Genome
%
tRNA
Iden,ty
Island
Length
Island_1
Cd1-‐
Cd16
100
18,965
Cas,
Phage_integrase,
SmpB
Island_2
Cd2,
Cd17
89
82,810
Phage_integrase
Island_3
Cd7,
Cd10
98
21,817
Phage_integrase
Island_1
Cd1-‐
Cd16
100
18,965
Cas,
Phage_integrase,
SmpB
label2
7
3 1
3 2
3 5
3 8
4 1
4 8
5 7
6 4
6 5
6 8
6 9
7 3
8 1
8 4
8 5
8 8
9 1
9 2
9 5
9 8
100
5.0E-6
Cd2
Cd8
Cd11
CD196
Cd17
Cd7
Cd13
CIP_107932
2007855
Cd9
Cd16
Cd6
R20291
Cd14
Cd5
QCD_76w55
QCD_97b34
QCD_32g58
Cd12
Cd4
Cd1
QCD_66c26
BI1
Cd10
Cd3
QCD_37x79
6 5
3 1
2 7
7 3
9 8
3 2
9 5
8 4
6 4
9 8
3 8
5 7
9 8
8 1
4 1
6 8
100
9 2
9 1
4 8
8 8
6 9
3 5
8 5
Phylogene,c
Tree
Feature
annota,on
and
machine
learning
Tools
such
has
Mugsy
(mugsy.sourceforge.net/)
enable
mul,ple
whole
genome
alignment
to
form
a
Pan-‐Genome.
Features
that
are
unique
to
subsets
of
the
genomes
can
be
iden,fied
and
genome
annota,on
collected
for
these
regions.
A
preliminary
exercise
of
this
strategy
on
the
clinical
isolates
of
Cdiff
reveals
several
puta,ve
recent
horizontal
gene
transfer
events
that
may
be
associated
with
changes
in
an,bio,c
resistance
or
virulence.
Other
tools
such
as
Islander
(bioinforma,cs.sandia.gov)
enable
discovery
of
new
genomic
islands.
(Phage
integra,on
may
lead
to
acquisi,on
of
new
virulence
genes.)
Create
Pan-‐Genome
Conserved
and
unique
blocks
Unique
genomic
features
Unique
HGT
/
Transposons
Ab
resistance
The
unique
genomic
features
across
the
different
clinical
samples
and
their
corresponding
pa,ent
phenotypic
features
(age,
sex,
onset
,me
etc.)
would
be
used
to
develop
the
machine
learning
algorithm
that
can
predict
pa,ent
outcomes.
Chances
of
reoccurrence
and
gradual
changes
in
the
an,bio,c
resistance.
The
so[ware
tool
developed
would
be
suitable
for
rou,ne
clinical
pathogenecity
detec,on
and
drug
administra,on.
Assembled
Genomes
Annotation
“RAST” or “PROKKA”
Gene
Finder
”Prodigal”
RNA
Genes
“rfind”
Islands
”Islander”
Gene
Families
“HMMR”
Virulence
DB
Abx
Res
DB
Transposases
Integrases
CAS/CRISPR
Custom
(Cdiff)
Integrons
”Integral”
Whole
Genome
Alignment
“Mugsy”