Metagenomic Data Analysis and Microbial Genomics

Metagenomics Annotation Algorithms Metagenomics to acquire a single genome Antibiotic Resistance Tailored refere
Metagenomic Data Analysis and Microbial
Genomics
Fabio Gori
Intelligent Systems, Institute for Computing and Information Sciences
in collaboration with
Department of Microbiology
Radboud University Nijmegen
The Netherlands
22
nd May 2015

Table of Contents
1 Introduction to Metagenomics
2 Taxonomic-annotation Algorithms
3 Metagenomics to Retrieve a Bacteria
4 Comparative Genomics for Antibiotic Resistance
5 Appendix: construction of a tailored reference

What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
99% microbes
cannot be sequenced
Understand interactions
between organisms
Human microbiota

What kind of data? A meta. . . jigsaw puzzle
Reads
of multiple microbes
Original pictures are
unknown
Pieces are similar
Biased abundance of pieces

Annotation: discovering the original pictures of the puzzles
Assign each read
to an organism or
to a taxonomic identier

Taxonomy: a biological classication
Linnean taxonomy:
Formal system for classifying and naming
living things
Based on a simple hierarchical structure
Similar elements are grouped together
Rank: level in the hierarchy (left)
Taxon: unit of the hierarchy
(group of similar living things)

Lowest Common Ancestor (LCA) Algorithm
For each read r of the metagenome:
1 Compare r with reference sequences (e.g. with BLASTX)
2 Assign r to the lowest common taxonomic ancestor
of the matching species Hi 's
Example
LCA
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12

LCA: Pros and Cons
Pros:
Higher accuracy than BLASTX best hit
Assign to taxa is more realistic
(with short reads)
Cons:
Few reads at low ranks
Many unassigned reads
How can we improve it?

MTR: Multiple Taxonomic Rank based clustering
Goal: Taxonomic Annotation of Short
Metagenomics reads (rank-level)
Assign from the highest rank
to the lowest feasible rank
Assignments of reads are
dependent on each other

MTR Algorithm scheme: top-down strategy
1 Compare reads R with reference proteins
(we used BLASTX and NCBI-NR database)
2 For each rank j (from the highest to the lowest):
1 T ← {taxa at rank j of proteins matching R}
2 Annotate by clustering R in clusters Ci
each Ci corresponds to a taxon ti ∈ T
3 Remove from R reads with incoherent classication
(w.r.t. higher ranks classications)
3 For each rank j (from the lowest to the highest):
1 Majority Vote on clusters' intersections at rank j
2 Make higher ranks classications coherent with the Majority
Vote results

MTR: Annotation via combinatorial optimization
For each rank j: For each taxon ti of rank j:
Create cluster Ci ⊆ R of reads similar to taxon ti
Set Covering Problem
Select collection of clusters (taxa) s.t.
No sequence is left outside
Minimal number of selected clusters
If Ci is selected, sequences of Ci will be assigned to ti
Example:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
→
Clustering Solution:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •

MTR vs LCA: MTR better wrt to quantity
MTR annotates more reads than LCA
Simulated data: MTR 8% 37% more reads
At rank Genus: 28% 89%
Real-life data: MTR 15% 30% more reads
At rank Species: 120% 208%
Experiments: 12 simulated data and 3 real life data (100bp reads)

MTR vs LCA: LCA better accuracy
Accuracy and Number of reads assigned (for each rank)
Rank MTR (#of reads) LCA (#of reads)
Kingdom 100.00 (166,948) 99.99 (155,263)
Phylum 99.86 (166,948) 99.93 (155,258)
Class 99.73 (166,936) 99.81 (141,829)
Order 97.67 (166,148) 98.14 (115,732)
Family 97.62 (165,231) 98.04 (110,488)
Genus 97.42 (140,476) 98.35 (110,139)
Table: Data name: M3, Coverage 4X, Tot reads:166,978
Rank MTR (#of reads) LCA (#of reads)
Kingdom 95.07 (88,537) 94.66 (73,176)
Phylum 93.21 (88,537) 92.57 (73,169)
Class 89.25 (87,635) 88.98 (60,294)
Order 89.24 (85,657) 88.44 (57,373)
Family 77.35 (81,366) 81.84 (48,760)
Genus 61.36 (77,307) 74.60 (40,823)
Table: Data name: M2, Coverage 1X, Tot reads:288,730

MTR vs LCA: MTR better population distribution
Population distributions (rank Genus) of M2, coverage 0.1X

Population distributions (rank Genus) of Coral dataset
MTR
1031
279
3492
133
80
14657
4540
90
313
128
1133
MTR
Acinetobacter (9.03%)
Aspergillus (2.44%)
Gibberella (30.57%)
Neurospora (1.16%)
Podospora (0.70%)
Chaetomium (1.28%)
T4−like viruses (0.50%)
Porites (39.75%)
Phaeosphaeria (0.79%)
Magnaporthe (2.74%)
Nitrosopumilus (1.12%)
Others (9.92%)
LCA
944
80
1804
76
76
105
57
643
51
169
76
604
LCA
Acinetobacter (20.15%)
Aspergillus (1.71%)
Gibberella (38.51%)
Neurospora (1.62%)
Podospora (1.62%)
Chaetomium (2.24%)
T4−like viruses (1.22%)
Porites (13.72%)
Phaeosphaeria (1.09%)
Magnaporthe (3.61%)
Nitrosopumilus (1.62%)
Others (12.89%)

Conclusions
MTR outperforms LCA in two ways:
More sequences annotated
especially at low ranks
Better estimate of
population distribution
LCA tends to be more accurate

Future Developments
Replace BLASTX with composition-based
similarity measure
Additional constraints of cluster selection
e.g. consistent coverage depth on proteins
or constraints on genome location coverage

Metagenomic sequencing to acquire an organism

Candidatus Brocadia
fulgida
Brocadia genome had not been
previously sequenced

Candidatus Brocadia
fulgida
Brocadia genome had not been
previously sequenced
Sequencing platforms
(mean read length):
SangerShotgun (800bp)
SangerFosmid (800bp)
454 GS20 (200bp)
First standard annotation:
Reads are assigned to
BLASTX best hit
Reads assigned to Brocadia
if best hit is Kuenenia
(Kuenenia is close relative
of Brocadia)

Why FISH analysis and BLASTX annotation do not agree?

80% of the cells are Brocadia, but. . .
Brocadia seems underrepresented
Are we sure?
Can we still extract signicant information?
Shotgun Fosmid 454
Brocadia reads 9.68% 13.76% 12.92%
Brocadia bp 9.76% 14.33% 11.34%
Let's do some composition-based analyses. . .

Dierent point of view: GC content
[ Bernaola-Galvan et al., Gene, 2004 ]
Dierent organisms can have
dierent GC content
(16.6% - 74.9%)
If genome is partitioned in
equally sized, non-overlapping
sequences:
GC content has normal
distribution (approximately)
Distribution is centered on
organism GC content

Bias toward high GC-content organisms
Raw
Annotated
Brocadia
Alphaproteobacteria
Betaproteobacteria
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
2000
4000
6000
8000
10000
GC−content
Frequency
454
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
200
400
600
800
1000
1200
1400
1600
GC−content
Frequency
Fosmid
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
200
400
600
800
1000
1200
1400
1600
GC−content
Frequency
Shotgun

We saw that Brocadia is underrepresented. . .
How can we cope with that?

Sets of well-recovered Kuenenia ORFs dier
Technologies:
Shotgun (Sanger):
Fosmid (Sanger):
454:
Extended Venn-diagram of Brocadia Open Reading Frames
retrieved for 80% of their length

Depth of coverage: correlation on the same ORF
Shotgun Fosmid Shotgun 454 Fosmid 454
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
−1, −0.7
−0.7, −0.3
−0.3, 0
0, 0.3
0.3, 0.7
0.7, 1
Correlation
Similar: Dierent:

Future Developments
Tuning sequencing for the specic
community
Integration of composition-based analysis
and BLASTX annotation

The antibiotic alarm, Nature, 14 March 2013
Rise of
resistance
(inevitable)
Decline of
development
(economics)

Waiting for new drugs. . . How can we cope with it?
Multi-drug
treatments
New therapies
(dosage, duration)
Personalised medicine
(e.g. infecting strain,
patient PK/PD,
patient genotype)

Idea: Drug Switching
Experiments:
Treatments:
Sequential switch drug
50%50% cocktail
Control no drugs
Protocol:
For each season
bacteria grow
in liquid medium
with drug
1% bacteria transfer
3 replicates
Duration: 96 hours
8 seasons of 12 hours
Drugs: Doxycycline,
Erythromycin
Sequencing: after 24h and 96h
18 datasets (red border)

My Role
Construct annotated reference genome [custom pipeline]
For each replicate, identify:
Structural Variations (SVs)
[Pindel]
Copy Number Variations (CNVs)
[CNVnator]
Single Nucleotide Polymorphisms (SNPs)
[VarScan]

Results CNV: 412kb duplicated region at 96 hours
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 Mb
96 hours
Control
ERY/DOX
50−50
ampE
rrnH
paoC
tauA
ybbJ
mdtG
atoB
rrnG
yqhC
rng
rrnD
rrnC
ubiDrrnA
rh
aM
rrnB
rrnE
slt
Normalised Coverage (1000 bins)
Mean +1
Eux-pump duplications (!)
This region includes the
multidrug eux pump
AcrRAB-TolC
[Peña-Miller et al, PLOS Biology, 2013]
24 96
Time
1
2CoverageRatioInside/OutsideDuplicatedRegion
Dox/Ery
p 0.0001
24 96
Time
50%-50%
p 0.0001
24 96
Time
Control
p = 0.303
(p-value is for t-test)

Conclusion
Sequential treatments work well in vitro when cocktail fail
Genomics: antibiotics prevent mutations
Futher developments (omics):
Phage role in region duplication
Timing of region duplication
NGS of additional treatments
Transcriptomics

Standard approach: de novo assembly annotation
Solve the jigsaw puzzle
Functional annotation
Done with software and
manual work
Problems (common)
Errors:
Repetitive regions
misassembled
Wrong order/orientation
Annotation quality
Fragmentation
Quality depends on
timemoney
2014: Automated genome
assembly for less than $1,000
[KorenPhillippy, Curr Opin Microbiol, 2015]

The alternative: tailored reference
Take the reference genome
of a close relative
Modify it according to
sequencing data
Import annotation from
reference
Pros
Less fragmentation
Higher quality
Better annotation
Cons
You need a close relative
Visually check steps
Ad hoc scripting
Conservative approach

Our case
Sequenced organism:
E. coli K-12 AG100 growing 24h in M9 medium
Reference genome:
E. coli K-12 MG1655 (available online)
Data (preprocessed):
Reads mapping to reference MG1655: 95.84%
Mean coverage depth: 88.19x (based on MG1655)
Read min/max/mean length (bp): 15 / 99 / 72.17

The pipeline
Read preprocessing (standard)
Mapping to reference MG1655 (standard)
Call Structural Variations (SVs)
Assemble unmapped and mapped data
Make intermediate reference
Check SVs and call SNPs
Functional annotation

Clean align reads
Reads preprocessing
[fastq-mcf, samtools]
Mapping to reference
[BWA, IGV]

Structural Variations (SVs)
Use Pindel to call SVs
Deletions, Insertions,
Inversions, Translocations
Indels
Break points
Visually checked [IGV]:
Deletions: 5 (total 47kbp)
Indels: 9
Break points: 9

SVs application and assemble unmapped reads
Take the close relative genome
Break in sequences by applying SVs
Extract reads around removed regions
Extract reads not mapped to reference
Assemble ∪ −→
Scaold ∪
[PythonBash scripting, Samtools, Velvet,
SSpace, Gapller]

Making intermediate and tailored references
Making Intermediate reference
Order scaolds w.r.t reference [Mauve]
Concatenate the 13 aligned scaolds
[Bash one-liner]
Making tailored reference
Look for SVs (none should be present)
Call SNPs [VarScan, vcftools]
Annotation
Export annotation from reference [RATT]
Adjust and annotate missing parts [RAST,
manually edit]
Make le ENA compatible [Python script]

In my experience, people do not look at assemblies critically
enough [Nature Methods, 2012]
Clean results need designed protocols, time, and money
Leap forwards has been done recently,
but the sequencing cost is still not very low
[Nature Methods, June 2013; KorenPhillippy, Curr Opin
Microbiol, 2015]

Combining technologies improved Kuenenia ORFs retrieval
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Threshold of Mapping Percentage
NumberofORFs
Shotgun, Fosmid
Shotgun, 454
Fosmid, 454
All
Shotgun
Fosmid
454

SNPs ribosomal: mutations in the control
Hypothesis: antibiotics slow down adaptation for optimal growth
in culture
Heightened ribosomal demand due to rapid growth
[Condon et al., J Bacteriol 1995]
% mean variant frequency(replicates, if not all)
50%-50% Dox/Ery Control
operon position relative posn
24h 96h 24h 96h 24h 96h
rrnH 226,521 595 5(2)
227,791 1,865 3(1)
17
rrnG 2,723,624 1,865 3(1)
9
2,724,894 595 8
rrnD 3,421,431 1,865 4(1)
13
3,422,701 595 8
rrnC 3,940,810 595 4(1)
17
rrnA 4,034,586 555 7
rrnB 4,165,708 595 4(1)
8
4,166,978 1,865 10
rrnE 4,207,110 595 3(1)
9
4,208,380 1,865 5(1)
7
SNPs signicantly dierent in frequency (ANOVA)
Maybe these ribosomal mutations helps with α-amino acid
starvation, because. . .

tauA expressed only under condition of sulfate or cysteine
(α-amino acid) starvation [Eichhorn et al, J Bacteriol, 2000]
yqhC regulates a scavenger of toxic aldehydes produced by lipid
peroxidation [Jarobe et al, Appl Microbiol Biotechnol, 2011]
% mean variant frequency(replicates, if not all)
50%-50% Dox/Ery Control
gene position 24h 96h 24h 96h 24h 96h annotation
DUPLICATED REGION
tauA 384,897 19(1)
68 taurine transport system
yqhC 3,151,384 45 putative ARAC-type
regulatory protein

Metagenomic Data Analysis and Microbial Genomics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Metagenomic Data Analysis and Microbial Genomics

Similar to Metagenomic Data Analysis and Microbial Genomics (20)

Recently uploaded

Recently uploaded (20)

Metagenomic Data Analysis and Microbial Genomics