Computational Tools for
Metagenomics
Surya Saha
Twitter: @SahaSurya / LinkedIn: www.linkedin.com/in/suryasaha/
Magdalen Lindeberg
Plant Pathology & Plant-Microbe Biology
Microbial Friends & Foes, Sep 25, 2012
Temperton, Current Opinion in Microbiology, 2012
Impact of Technology on Metagenomics
Types of “Meta” genomics
16S rRNA survey of bacterial
microbiome
ITS survey of fungal
microbiome
Bellemain, BMC Microbiology 2010Slide: Julien Tremblay, JGI
Types of “Meta” genomics
Whole genome shotgun
• Varying complexity of microbial communities
• High coverage sequencing
• Sophisticated informatics
• Host associated metagenomes
– Deep sequencing of host meta-genome
– Bioinformatic screening of host sequences
• Environmental metagenomes
– Eg. Soil samples
– Requires very high depth of coverage
– Complicated to assemble
Big picture!!
Big picture!!
What users see
Big picture!!
What users see
What users want!!
16S/ITS community surveys
• Multiple target regions in 16S gene and ITS region
• Comparison of results requires amplification of same region
• Advantages
– Fast survey of large communities
– Mature set of tools and statistics for analysis
– Good for first round survey
• 454 16S tags or pyrotags (~ 700 bp) have been the
preferred method
• Illumina Miseq (2x150bp, 2x250 bp) are the next
workhorses
• Depth of sampling
– 2-6000 reads/sample for simple communities
– 20000 reads /sample for complex soil metagenomes
16S/ITS issues
• Lack of tools for processing ITS/Fungal microbiome data
sets
– RDP classifier targets only ITS
– No ITS reconstruction tools
• Amplification bias effects accuracy and replication
• Use of short reads prevents disambiguation of similar
strains
• 16S or ITS may not differentiate between similar strains
– Clustering is done at 97%
– Regions may be >99% similar
• Sequencing error inflates number of OTUs
• Chloroplast 16S sequences can get amplified in plant
metagenomes
16S/ITS sequence processing workflow
Filter for
contaminants and
low quality reads
Assemble
overlapping reads
Reduce datasets
(clustering)
Perform taxonomic
classification and
compute diversity
metrics
16S/ITS sequence processing workflow
Filter for
contaminants and
low quality reads
Assemble
overlapping reads
Reduce datasets
(clustering)
Perform taxonomic
classification and
compute diversity
metrics
• Quality plots and read trimming
– FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
– FASTX
http://hannonlab.cshl.edu/fastx_toolkit/
• Chimera removal
– AmpliconNoise
http://code.google.com/p/ampliconnoise/
– UCHIME
http://www.drive5.com/uchime/
Impact of Sequence Length
Slide: Feng Chen, JGI
16S/ITS sequence processing workflow
Filter for
contaminants and
low quality reads
Assemble
overlapping reads
Reduce datasets
(clustering)
Perform taxonomic
classification and
compute diversity
metrics
• Merge overlapping paired end reads
– FLASH
http://www.genomics.jhu.edu/software/FLASH/index.shtml
– FastqJoin
http://code.google.com/p/ea-utils/wiki/FastqJoin
– CD-HIT read-linker
http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit-
auxtools-manual
16S/ITS sequence processing workflow
Filter for
contaminants and
low quality reads
Assemble
overlapping reads
Reduce datasets
(clustering)
Perform taxonomic
classification and
compute diversity
metrics
• Clustering with high stringency
– UCLUST/USEARCH (16S only)
http://www.drive5.com/usearch/
– CD-HIT-OTU (16S only)
http://weizhong-lab.ucsd.edu/cd-hit-otu/
– phylOTU (16S only)
https://github.com/sharpton/PhylOTU
16S/ITS sequence processing workflow
Filter for
contaminants and
low quality reads
Assemble
overlapping reads
Reduce datasets
(clustering)
Perform
taxonomic
classification and
compute diversity
metrics
• Composition based classifiers
– RDP database + classifier
http://rdp.cme.msu.edu/classifier/classifier.jsp
• Homology based classifiers
– ARB + Silva database (16S only)
http://www.arb-home.de/
– GreenGenes database (16S only)
http://greengenes.lbl.gov/cgi-bin/nph-index.cgi
– UNITE database (ITS only)
http://unite.ut.ee/
– FungalITSPipeline (ITS only)
http://www.emerencia.org/fungalitspipeline.html
• http://www.qiime.org/
• Comprehensive suite of tools
– OTU picking
– Taxonomic classification
– Construction of phylogenetic
trees
– Visualization
– Compute diversity statistics
• Available as Amazon EC2
image
Whole Genome Shotgun (WGS)
Metagenomics
• Better classification with Increasing number of
complete genomes
• Focus on whole genome based phylogeny (whole
genome phylotyping)
• Advantages
– No amplification bias like in 16S/ITS
• Issues
– Poor sampling of fungal diversity
– Assembly of metagenomes is complicated due to
uneven coverage
– Requires high depth of coverage
WGS sequence processing workflow
Filter for low
quality reads
Assemble
reads
Perform taxonomic
classification and
compute diversity
metrics
WGS sequence processing workflow
Filter for low
quality reads
Assemble
reads
Perform taxonomic
classification and
compute diversity
metrics
• Quality plots and read trimming
– FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
– FASTX
http://hannonlab.cshl.edu/fastx_toolkit/
WGS sequence processing workflow
Filter for low
quality reads
Assemble
reads
Perform taxonomic
classification and
compute diversity
metrics
• NGS assembly with uneven depth
– IDBA-UD
http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
– MIRA
http://www.chevreux.org/projects_mira.html
– Velvet / MetaVelvet
http://www.ebi.ac.uk/~zerbino/velvet/
http://metavelvet.dna.bio.keio.ac.jp/
WGS sequence processing workflow
Filter for low
quality reads
Assemble
reads
Perform taxonomic
classification and
compute diversity
metrics
• Hybrid composition/homology based
classifiers
– FCP
http://kiwi.cs.dal.ca/Software/FCP
– Phymm/PhymmBL
http://www.cbcb.umd.edu/software/phymm/
– AMPHORA2
http://wolbachia.biology.virginia.edu/WuLab/Software.html
– NBC
http://nbc.ece.drexel.edu/
– MEGAN
http://ab.inf.uni-tuebingen.de/software/megan/
WGS sequence processing workflow
Filter for low
quality reads
Assemble
reads
Perform taxonomic
classification and
compute diversity
metrics
• Web based classifiers
– MG-RAST
http://metagenomics.anl.gov/
– CAMERA
http://camera.calit2.net/
– IMG/M
http://img.jgi.doe.gov/cgi-bin/m/main.cgi
MetaPhAln
• Unique clade-specific markers for sequenced bacteria and archaea
• 400 genuses/4000 genomes including HMP genomes
• Species level resolution
• MetaPhAln 2 in the works
– Eukaryotes including Fungi
– Viruses
– Higher coverage of archaea
• Krona and GraphAln for visualization
of output
• Websites
– https://bitbucket.org/nsegata/metaphlan
– http://huttenhower.sph.harvard.edu/metaphlan
PhyloSift/pplacer
• Reference database of marker genes
• Places reads on tree of life based on homology to
reference protein
• Integration with metAMOS for pre-assembling next-
generation datasets
• Bacterial and Archaeal classification only
• Plant and Fungi marker genes are being added
• Websites
– http://phylosift.wordpress.com/
– https://github.com/gjospin/PhyloSift
Real cost of Sequencing!!
Sboner, Genome Biology, 2011
Acknowledgements
Funding
Magdalen Lindeberg
Cornell University
Dave Schneider
USDA-ARS, Ithaca
Citrus greening / Wolbachia (wACP)
Thank you!
Surya Saha ss2489@cornell.edu
Suggestions
• Plan informatics workflow as early as possible
• Incorporate statistics at different stages in the workflow

Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences

  • 1.
    Computational Tools for Metagenomics SuryaSaha Twitter: @SahaSurya / LinkedIn: www.linkedin.com/in/suryasaha/ Magdalen Lindeberg Plant Pathology & Plant-Microbe Biology Microbial Friends & Foes, Sep 25, 2012
  • 2.
    Temperton, Current Opinionin Microbiology, 2012 Impact of Technology on Metagenomics
  • 3.
    Types of “Meta”genomics 16S rRNA survey of bacterial microbiome ITS survey of fungal microbiome Bellemain, BMC Microbiology 2010Slide: Julien Tremblay, JGI
  • 4.
    Types of “Meta”genomics Whole genome shotgun • Varying complexity of microbial communities • High coverage sequencing • Sophisticated informatics • Host associated metagenomes – Deep sequencing of host meta-genome – Bioinformatic screening of host sequences • Environmental metagenomes – Eg. Soil samples – Requires very high depth of coverage – Complicated to assemble
  • 5.
  • 6.
  • 7.
    Big picture!! What userssee What users want!!
  • 8.
    16S/ITS community surveys •Multiple target regions in 16S gene and ITS region • Comparison of results requires amplification of same region • Advantages – Fast survey of large communities – Mature set of tools and statistics for analysis – Good for first round survey • 454 16S tags or pyrotags (~ 700 bp) have been the preferred method • Illumina Miseq (2x150bp, 2x250 bp) are the next workhorses • Depth of sampling – 2-6000 reads/sample for simple communities – 20000 reads /sample for complex soil metagenomes
  • 9.
    16S/ITS issues • Lackof tools for processing ITS/Fungal microbiome data sets – RDP classifier targets only ITS – No ITS reconstruction tools • Amplification bias effects accuracy and replication • Use of short reads prevents disambiguation of similar strains • 16S or ITS may not differentiate between similar strains – Clustering is done at 97% – Regions may be >99% similar • Sequencing error inflates number of OTUs • Chloroplast 16S sequences can get amplified in plant metagenomes
  • 10.
    16S/ITS sequence processingworkflow Filter for contaminants and low quality reads Assemble overlapping reads Reduce datasets (clustering) Perform taxonomic classification and compute diversity metrics
  • 11.
    16S/ITS sequence processingworkflow Filter for contaminants and low quality reads Assemble overlapping reads Reduce datasets (clustering) Perform taxonomic classification and compute diversity metrics • Quality plots and read trimming – FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ – FASTX http://hannonlab.cshl.edu/fastx_toolkit/ • Chimera removal – AmpliconNoise http://code.google.com/p/ampliconnoise/ – UCHIME http://www.drive5.com/uchime/
  • 12.
    Impact of SequenceLength Slide: Feng Chen, JGI
  • 13.
    16S/ITS sequence processingworkflow Filter for contaminants and low quality reads Assemble overlapping reads Reduce datasets (clustering) Perform taxonomic classification and compute diversity metrics • Merge overlapping paired end reads – FLASH http://www.genomics.jhu.edu/software/FLASH/index.shtml – FastqJoin http://code.google.com/p/ea-utils/wiki/FastqJoin – CD-HIT read-linker http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit- auxtools-manual
  • 14.
    16S/ITS sequence processingworkflow Filter for contaminants and low quality reads Assemble overlapping reads Reduce datasets (clustering) Perform taxonomic classification and compute diversity metrics • Clustering with high stringency – UCLUST/USEARCH (16S only) http://www.drive5.com/usearch/ – CD-HIT-OTU (16S only) http://weizhong-lab.ucsd.edu/cd-hit-otu/ – phylOTU (16S only) https://github.com/sharpton/PhylOTU
  • 15.
    16S/ITS sequence processingworkflow Filter for contaminants and low quality reads Assemble overlapping reads Reduce datasets (clustering) Perform taxonomic classification and compute diversity metrics • Composition based classifiers – RDP database + classifier http://rdp.cme.msu.edu/classifier/classifier.jsp • Homology based classifiers – ARB + Silva database (16S only) http://www.arb-home.de/ – GreenGenes database (16S only) http://greengenes.lbl.gov/cgi-bin/nph-index.cgi – UNITE database (ITS only) http://unite.ut.ee/ – FungalITSPipeline (ITS only) http://www.emerencia.org/fungalitspipeline.html
  • 16.
    • http://www.qiime.org/ • Comprehensivesuite of tools – OTU picking – Taxonomic classification – Construction of phylogenetic trees – Visualization – Compute diversity statistics • Available as Amazon EC2 image
  • 17.
    Whole Genome Shotgun(WGS) Metagenomics • Better classification with Increasing number of complete genomes • Focus on whole genome based phylogeny (whole genome phylotyping) • Advantages – No amplification bias like in 16S/ITS • Issues – Poor sampling of fungal diversity – Assembly of metagenomes is complicated due to uneven coverage – Requires high depth of coverage
  • 18.
    WGS sequence processingworkflow Filter for low quality reads Assemble reads Perform taxonomic classification and compute diversity metrics
  • 19.
    WGS sequence processingworkflow Filter for low quality reads Assemble reads Perform taxonomic classification and compute diversity metrics • Quality plots and read trimming – FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ – FASTX http://hannonlab.cshl.edu/fastx_toolkit/
  • 20.
    WGS sequence processingworkflow Filter for low quality reads Assemble reads Perform taxonomic classification and compute diversity metrics • NGS assembly with uneven depth – IDBA-UD http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/ – MIRA http://www.chevreux.org/projects_mira.html – Velvet / MetaVelvet http://www.ebi.ac.uk/~zerbino/velvet/ http://metavelvet.dna.bio.keio.ac.jp/
  • 21.
    WGS sequence processingworkflow Filter for low quality reads Assemble reads Perform taxonomic classification and compute diversity metrics • Hybrid composition/homology based classifiers – FCP http://kiwi.cs.dal.ca/Software/FCP – Phymm/PhymmBL http://www.cbcb.umd.edu/software/phymm/ – AMPHORA2 http://wolbachia.biology.virginia.edu/WuLab/Software.html – NBC http://nbc.ece.drexel.edu/ – MEGAN http://ab.inf.uni-tuebingen.de/software/megan/
  • 22.
    WGS sequence processingworkflow Filter for low quality reads Assemble reads Perform taxonomic classification and compute diversity metrics • Web based classifiers – MG-RAST http://metagenomics.anl.gov/ – CAMERA http://camera.calit2.net/ – IMG/M http://img.jgi.doe.gov/cgi-bin/m/main.cgi
  • 23.
    MetaPhAln • Unique clade-specificmarkers for sequenced bacteria and archaea • 400 genuses/4000 genomes including HMP genomes • Species level resolution • MetaPhAln 2 in the works – Eukaryotes including Fungi – Viruses – Higher coverage of archaea • Krona and GraphAln for visualization of output • Websites – https://bitbucket.org/nsegata/metaphlan – http://huttenhower.sph.harvard.edu/metaphlan
  • 24.
    PhyloSift/pplacer • Reference databaseof marker genes • Places reads on tree of life based on homology to reference protein • Integration with metAMOS for pre-assembling next- generation datasets • Bacterial and Archaeal classification only • Plant and Fungi marker genes are being added • Websites – http://phylosift.wordpress.com/ – https://github.com/gjospin/PhyloSift
  • 25.
    Real cost ofSequencing!! Sboner, Genome Biology, 2011
  • 26.
    Acknowledgements Funding Magdalen Lindeberg Cornell University DaveSchneider USDA-ARS, Ithaca Citrus greening / Wolbachia (wACP)
  • 27.
    Thank you! Surya Sahass2489@cornell.edu Suggestions • Plan informatics workflow as early as possible • Incorporate statistics at different stages in the workflow