Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Metagenomic Data Analysis: Computational Methods and Applications
1. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Metagenomic Data Analysis:
Computational Methods and Applications
Fabio Gori
Intelligent Systems, Institute for Computing and Information Sciences
in collaboration with
Department of Microbiology
Radboud University Nijmegen
The Netherlands
2. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Table of Contents
Introduction to Metagenomics
Taxonomic-annotation Algorithms
Genomic Signatures for Metagenomics
Metagenomics to Retrieve Anammox Bacteria
3. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Table of Contents
Introduction to Metagenomics
Taxonomic-annotation Algorithms
Genomic Signatures for Metagenomics
Metagenomics to Retrieve Anammox Bacteria
4. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
• 99% microbes
cannot be sequenced
• Understand interactions
between organisms
5. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
• 99% microbes
cannot be sequenced
• Understand interactions
between organisms
6. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
What is Metagenomics?
Metagenomics:
study of genomic
imformation obtained
directly from microbial
communities
Why?
• 99% microbes
cannot be sequenced
• Understand interactions
between organisms
7. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
What kind of data? A meta. . . jigsaw puzzle
DNA sequences
(reads)
• Original pictures are
unknown
• Pieces are similar
• Pieces have errors
8. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Annotation: discovering the original pictures of the puzzles
Assign each read
to an organism or
to a taxonomic identier
9. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Taxonomy: a biological classication
Linnean taxonomy:
• Formal system for classifying and naming
living things
• Based on a simple hierarchical structure
• Similar elements are grouped together
Rank: level in the hierarchy (left)
Taxon: unit of the hierarchy
(group of similar living things)
10. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Table of Contents
Introduction to Metagenomics
Taxonomic-annotation Algorithms
Genomic Signatures for Metagenomics
Metagenomics to Retrieve Anammox Bacteria
11. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Similarity-based methods
Algorithm scheme
1 Compare reads to reference sequences
2 Assign each read to a taxon of one of its best
matching sequences
Comparison performed with sequence alignment or composition
prole
Problems (Lowest Common Ancestor algorithm):
• Few reads at low ranks
• Many unassigned reads
How can we improve it?
Idea: assignments of reads are dependent on each other
12. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Similarity-based methods
Algorithm scheme
1 Compare reads to reference sequences
2 Assign each read to a taxon of one of its best
matching sequences
Comparison performed with sequence alignment or composition
prole
Problems (Lowest Common Ancestor algorithm):
• Few reads at low ranks
• Many unassigned reads
How can we improve it?
Idea: assignments of reads are dependent on each other
13. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
MTR: Annotation via combinatorial optimization
For each rank j: For each taxon ti or rank j:
Create cluster Ci of sequences similar to taxon ti
Set Covering Problem
Select collection of clusters (taxa) s.t.
• No sequence is left outside
• Minimal number of selected clusters
If Ci is selected, sequences of Ci will be assigned to ti
Example:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
→
Clustering Solution:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
14. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
MTR: Annotation via combinatorial optimization
For each rank j: For each taxon ti or rank j:
Create cluster Ci of sequences similar to taxon ti
Set Covering Problem
Select collection of clusters (taxa) s.t.
• No sequence is left outside
• Minimal number of selected clusters
If Ci is selected, sequences of Ci will be assigned to ti
Example:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
→
Clustering Solution:
C1 C2 C3 C4 C5 C6
s1 • • •
s2 • •
s3 • •
s4 • • •
s5 • •
s6 • •
s7 • • •
s8 • •
s9 • •
s10 • •
15. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Results
Rank MTR (#of reads) LCA (#of reads)
Kingdom 95.07 (88,537) 94.66 (73,176)
Phylum 93.21 (88,537) 92.57 (73,169)
Class 89.25 (87,635) 88.98 (60,294)
Order 89.24 (85,657) 88.44 (57,373)
Family 77.35 (81,366) 81.84 (48,760)
Genus 61.36 (77,307) 74.60 (40,823)
Table: Data name: M2, Coverage 1X,
Tot reads:288,730
Population distributions (rank
Genus) of M2, coverage 0.1X
• More sequences
annotated
at low ranks
• Better estimate
of
population
distribution
16. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Results
Rank MTR (#of reads) LCA (#of reads)
Kingdom 95.07 (88,537) 94.66 (73,176)
Phylum 93.21 (88,537) 92.57 (73,169)
Class 89.25 (87,635) 88.98 (60,294)
Order 89.24 (85,657) 88.44 (57,373)
Family 77.35 (81,366) 81.84 (48,760)
Genus 61.36 (77,307) 74.60 (40,823)
Table: Data name: M2, Coverage 1X,
Tot reads:288,730
Population distributions (rank
Genus) of M2, coverage 0.1X
• More sequences
annotated
at low ranks
• Better estimate
of
population
distribution
17. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Table of Contents
Introduction to Metagenomics
Taxonomic-annotation Algorithms
Genomic Signatures for Metagenomics
Metagenomics to Retrieve Anammox Bacteria
18. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Metagenomic annotation in two steps
DNA sequences
(strings of A, C, G, T)
ρ
−→
Rn
Classication
or Clustering
19. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Metagenomic annotation in two steps
DNA sequences
(strings of A, C, G, T)
ρ
−→
Rn
Classication
or Clustering
20. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Metagenomic annotation in two steps
DNA sequences
(strings of A, C, G, T)
ρ
−→
Rn
Classication
or Clustering
21. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Metagenomic annotation in two steps
DNA sequences
(strings of A, C, G, T)
ρ
−→
Rn
Classication
or Clustering
In this study: focus on ρ,
the data representation
22. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Typical ρ's used in binning
ρT(s) := frequencies in s of all the k-mers
k-mer := sequence of k nucleotides {A, C, G, T}k
ρT
i (s) := #wi, wi is a k-mer, i = 1, . . . , 4
k
Usually k = 4 =⇒ 4
k = 256 features: ρT(s) ∈ N256
[ Mohammed et al., Bioinformatics, 2011], [ Diaz et al., BMC Bioinformatics, 2009]
[ Chan et al., J. Biomed. Biotech., 2008], [ Teeling et al., Environ. Microb., 2004]
Example:
s = A G C A T G C A G C A T A T G T G G A G C A
23. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Typical ρ's used in binning
ρT(s) := frequencies in s of all the k-mers
k-mer := sequence of k nucleotides {A, C, G, T}k
ρT
i (s) := #wi, wi is a k-mer, i = 1, . . . , 4
k
Usually k = 4 =⇒ 4
k = 256 features: ρT(s) ∈ N256
[ Mohammed et al., Bioinformatics, 2011], [ Diaz et al., BMC Bioinformatics, 2009]
[ Chan et al., J. Biomed. Biotech., 2008], [ Teeling et al., Environ. Microb., 2004]
Example:
s = A G C A T G C A G C A T A T G T G G A G C A
ρT(s) =( . . . , #AGCA = 1, . . .
)
24. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Typical ρ's used in binning
ρT(s) := frequencies in s of all the k-mers
k-mer := sequence of k nucleotides {A, C, G, T}k
ρT
i (s) := #wi, wi is a k-mer, i = 1, . . . , 4
k
Usually k = 4 =⇒ 4
k = 256 features: ρT(s) ∈ N256
[ Mohammed et al., Bioinformatics, 2011], [ Diaz et al., BMC Bioinformatics, 2009]
[ Chan et al., J. Biomed. Biotech., 2008], [ Teeling et al., Environ. Microb., 2004]
Example:
s = A G C A T G C A G C A T A T G T G G A G C A
ρT(s) =( . . . , #AGCA = 1, . . . , #GCAT = 1, . . .
)
25. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Typical ρ's used in binning
ρT(s) := frequencies in s of all the k-mers
k-mer := sequence of k nucleotides {A, C, G, T}k
ρT
i (s) := #wi, wi is a k-mer, i = 1, . . . , 4
k
Usually k = 4 =⇒ 4
k = 256 features: ρT(s) ∈ N256
[ Mohammed et al., Bioinformatics, 2011], [ Diaz et al., BMC Bioinformatics, 2009]
[ Chan et al., J. Biomed. Biotech., 2008], [ Teeling et al., Environ. Microb., 2004]
Example:
s = A G C A T G C A G C A T A T G T G G A G C A
ρT(s) =( . . . , #AGCA = 1, . . . , #CATG = 1, #GCAT = 1, . . .
)
26. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Typical ρ's used in binning
ρT(s) := frequencies in s of all the k-mers
k-mer := sequence of k nucleotides {A, C, G, T}k
ρT
i (s) := #wi, wi is a k-mer, i = 1, . . . , 4
k
Usually k = 4 =⇒ 4
k = 256 features: ρT(s) ∈ N256
[ Mohammed et al., Bioinformatics, 2011], [ Diaz et al., BMC Bioinformatics, 2009]
[ Chan et al., J. Biomed. Biotech., 2008], [ Teeling et al., Environ. Microb., 2004]
Example:
s = A G C A T G C A G C A T A T G T G G A G C A
ρT(s) =(#AAAA = 0, . . . , #AGCA = 3, . . . , #ATAT = 1, . . .
. . . , #GCAT = 2, . . . )
29. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
What ρ should do
z
s r
ρ needs to be a
genomic signature:
[ Karlin et al., Trends. Genet., 1995 ]
ρ(s) ≈ ρ(z)
ρ(s) = ρ(r)
−→
ρ
Rn
ρ(s)
ρ(z)
ρ(r)
30. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
What ρ should do
z
s r
ρ needs to be a
genomic signature:
[ Karlin et al., Trends. Genet., 1995 ]
ρ(s) ≈ ρ(z)
ρ(s) = ρ(r)
but few connections
with metagenomics in
the literature
−→
ρ
Rn
ρ(s)
ρ(z)
ρ(r)
31. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Results
• Proposed signatures excelled
standard signature ρT used in
metagenomics
• Best signatures had fewer features
(half number of dimensions)
32. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Results
• Proposed signatures excelled
standard signature ρT used in
metagenomics
• Best signatures had fewer features
(half number of dimensions)
33. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Table of Contents
Introduction to Metagenomics
Taxonomic-annotation Algorithms
Genomic Signatures for Metagenomics
Metagenomics to Retrieve Anammox Bacteria
34. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Sequencing communities containing anammox bacteria
ANaerobic
AMMonium
OXidation
Why anammox
are important:
• Fixed nitrogen loss
• Wastewater-treatment plants
Metagenomics:
only way to retrieve anammox
• Dicult to culture
• Not isolable
35. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Sequencing communities containing anammox bacteria
ANaerobic
AMMonium
OXidation
Why anammox
are important:
• Fixed nitrogen loss
• Wastewater-treatment plants
Metagenomics:
only way to retrieve anammox
• Dicult to culture
• Not isolable
36. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Sequencing communities containing anammox bacteria
ANaerobic
AMMonium
OXidation
Why anammox
are important:
• Fixed nitrogen loss
• Wastewater-treatment plants
Metagenomics:
only way to retrieve anammox
• Dicult to culture
• Not isolable
37. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Sequencing communities containing anammox bacteria
ANaerobic
AMMonium
OXidation
Why anammox
are important:
• Fixed nitrogen loss
• Wastewater-treatment plants
Metagenomics:
only way to retrieve anammox
• Dicult to culture
• Not isolable
39. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Dierent point of view: GC content
[ Bernaola-Galvan et al., Gene, 2004 ]
• Dierent organisms can have
dierent GC content
(16.6% - 74.9%)
• If genome is partitioned in
equally sized, non-overlapping
sequences:
• GC content has normal
distribution (approximately)
• Distribution is centered on
organism GC content
41. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Comining technologies improved protein retrieval
Extended Venn-diagram
of proteins retrieved for
80% of their length
• Retrieve anammox core genes
Technologies:
Shotgun (Sanger):
Fosmid (Sanger):
454:
42. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Conclusions
• Proposed new eective methods
for improving metagenomic data analysis
• Studied in details real-life data
of anammox bacteria
43. Metagenomics Annotation Algorithms Genomic Signatures Anammox Retrieval
Conclusions
• Proposed new eective methods
for improving metagenomic data analysis
• Studied in details real-life data
of anammox bacteria