Folker Meyer Argonne National Laboratory and  University of Chicago June 14 th , 1 st  EMP meeting Shenzhen, China Metagen...
Metagenomics needs the magic wand.. <ul><li>==  “shotgun genomics applied directly to various environments”  </li></ul><ul...
Portals help with computational analysis <ul><li>MG-RAST and IMG/M and CAMERA for metagenomes </li></ul><ul><ul><li>Provid...
2010 state of metagenomics <ul><li>8492  metagenomes from > 500 groups </li></ul><ul><li>Over 20GB per week  (rapid growth...
2011: many small scale projects V3 03/2011  <ul><li>~25,000 data sets, hundreds of groups </li></ul><ul><li>~4000 public ,...
Even data upload is hard!    Jumploader Thanks to  Rob Knight ’s  team to pointing us there  
Part of an emerging digital biology <ul><li>Users ( dots ) sharing pre-publication metagenomes (edges) </li></ul>Source: M...
Computing cost dominate <ul><li>“ Living on the log scale” (Guy Cochrane, EBI, UK) </li></ul>Source: Rob Knight, UColorado...
Challenges during shotgun metagenome analysis <ul><li>Quality Control </li></ul><ul><li>Finding features </li></ul><ul><li...
Quality control  for de-novo sequencing <ul><li>Question is simple: How trustworthy is my data? </li></ul><ul><ul><li>“ ra...
Tell me if my data set is of type A or B <ul><li>A)  </li></ul><ul><li>Lots of error ~10% at 70bp </li></ul>Real data sets...
Finding features <ul><li>Protein coding features </li></ul><ul><ul><li>Statistics  based approaches: </li></ul></ul><ul><u...
Performance Analysis on simulated data sets w/ errors W. Trimble, in preparation
Characterizing features <ul><li>Describe sequences by comparison to existing databases </li></ul><ul><ul><li>GenBank, GO, ...
Presentation layer
MG-RAST v3 workflow (simplified) SFF, fastq and fasta data find emPCR and BridgePCR artifacts find coding regions/peptides...
The future <ul><li>“ Living on the log scale” (Guy Cochrane, EBI, UK) </li></ul><ul><li>“ Data bonanza” (Dawn Field, Oxfor...
Future 1: World is not clonal, study strain/species variation  <ul><li>“ Pangenome view ”   allows definition of strains <...
Future 2: Expand metadata (1): MIMS/MIMARKS <ul><li>Genomics Standards Consortium (GSC) provides </li></ul><ul><ul><li>Ext...
Expand metadata support (2): Capture metadata early Imagine adding metadata to the plot below: Very hard after the fact! c...
Many  current challenges and pitfalls <ul><li>Assembly  (state of the art: hard) </li></ul><ul><ul><li>Several groups are ...
<ul><li>Metagenome transport format (MTF) </li></ul><ul><li>•  Input Sequences  ( “from the machine”) </li></ul><ul><li>  ...
Acknowledgements <ul><li>MG-RAST team </li></ul><ul><ul><li>Daniela Bartels </li></ul></ul><ul><ul><li>Narayan Desai  </li...
Thank you for your attention
Upcoming SlideShare
Loading in …5
×

Folker Meyer: Metagenomic Data Annotation

3,005 views

Published on

Folker Meyer's talk from the 1st Earth Microbiome Project meeting in Shenzen.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,005
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
59
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Folker Meyer: Metagenomic Data Annotation

  1. 1. Folker Meyer Argonne National Laboratory and University of Chicago June 14 th , 1 st EMP meeting Shenzhen, China Metagenome Annotation
  2. 2. Metagenomics needs the magic wand.. <ul><li>== “shotgun genomics applied directly to various environments” </li></ul><ul><ul><li> “ shotgun metagenomics” </li></ul></ul><ul><li>!= sequencing of BAC clones with env. DNA </li></ul><ul><ul><li> “ functional metagenomics” </li></ul></ul><ul><li>!= sequencing single genes (16 rDNA) </li></ul><ul><ul><li> “ gene surveys” </li></ul></ul>data Who are they? What are they doing?
  3. 3. Portals help with computational analysis <ul><li>MG-RAST and IMG/M and CAMERA for metagenomes </li></ul><ul><ul><li>Provide complete project support including metadata input </li></ul></ul><ul><ul><li>Systems allow upload of sequence runs and provide QC, feature identification, feature annotation, views and comparison </li></ul></ul><ul><ul><li>Systems provide lots of public samples to compare to </li></ul></ul><ul><ul><ul><li>MG-RAST: 4,000+ public samples (June 2011) </li></ul></ul></ul><ul><ul><li>Google will reveal URLs </li></ul></ul><ul><li>QIIME for amplicon studies </li></ul><ul><ul><li>Provides support for amplicon analysis </li></ul></ul><ul><ul><li>Large number of public amplicon samples </li></ul></ul><ul><ul><li>Advanced visualization capabilities with rich metadata </li></ul></ul><ul><ul><li>Integration with other tools including MG-RAST </li></ul></ul>
  4. 4. 2010 state of metagenomics <ul><li>8492 metagenomes from > 500 groups </li></ul><ul><li>Over 20GB per week (rapid growth) </li></ul><ul><li>Many centers produce data </li></ul><ul><li>This was a few weeks ago </li></ul>
  5. 5. 2011: many small scale projects V3 03/2011 <ul><li>~25,000 data sets, hundreds of groups </li></ul><ul><li>~4000 public , with metadata, 45GBp </li></ul><ul><li>>> 1Terabase (10^12 basepairs) </li></ul>
  6. 6. Even data upload is hard!  Jumploader Thanks to Rob Knight ’s team to pointing us there 
  7. 7. Part of an emerging digital biology <ul><li>Users ( dots ) sharing pre-publication metagenomes (edges) </li></ul>Source: MG-RAST, 800+ shared metagenomes
  8. 8. Computing cost dominate <ul><li>“ Living on the log scale” (Guy Cochrane, EBI, UK) </li></ul>Source: Rob Knight, UColorado From: Wilkening et al., IEEE Cluster09, 2009 computing sequencing
  9. 9. Challenges during shotgun metagenome analysis <ul><li>Quality Control </li></ul><ul><li>Finding features </li></ul><ul><li>Characterizing features </li></ul><ul><li>Presentation </li></ul>
  10. 10. Quality control for de-novo sequencing <ul><li>Question is simple: How trustworthy is my data? </li></ul><ul><ul><li>“ rare biosphere debate”  de-noising for amplicon runs </li></ul></ul><ul><ul><li>No such tool for shotgun data </li></ul></ul><ul><li>Existing QC approaches rely on: </li></ul><ul><ul><li>Using reference sequences </li></ul></ul><ul><ul><li>Using vendor specific scores </li></ul></ul><ul><ul><ul><li>Includes e.g. phred scores </li></ul></ul></ul><ul><li>None of those are suitable to what we are doing </li></ul><ul><li>EMP needs novel quality control to ensure comparisons work </li></ul><ul><li> Approaches utilizing artifacts of sequencing and library prep processes show promising results </li></ul>
  11. 11. Tell me if my data set is of type A or B <ul><li>A) </li></ul><ul><li>Lots of error ~10% at 70bp </li></ul>Real data sets from MG-RAST <ul><li>B) </li></ul><ul><li>Errors only at tail </li></ul>K. Keegan, in preparation % duplicates varies also
  12. 12. Finding features <ul><li>Protein coding features </li></ul><ul><ul><li>Statistics based approaches: </li></ul></ul><ul><ul><ul><li>Using e.g. codon usage trained on existing genomes </li></ul></ul></ul><ul><ul><ul><li>MGA, Metagene, FragGeneScan, Prodigal, MetageneMarkHMM </li></ul></ul></ul><ul><ul><ul><li>Limitation : novel proteins are harder, islands and transferred also </li></ul></ul></ul><ul><ul><li>Similarity based approaches </li></ul></ul><ul><ul><ul><li>Blastx search against </li></ul></ul></ul><ul><ul><ul><li>Limitation : Runtime + Novel proteins will never be found…. </li></ul></ul></ul><ul><li>Running more specialized tools e.g. RFAM is often not feasible for large scale data sets </li></ul><ul><li> EMP will enable systematic search for novel proteins (think of CRISPRs from AMD) </li></ul>
  13. 13. Performance Analysis on simulated data sets w/ errors W. Trimble, in preparation
  14. 14. Characterizing features <ul><li>Describe sequences by comparison to existing databases </li></ul><ul><ul><li>GenBank, GO, KEGG, COGs, SEED, STRINGS, .. </li></ul></ul><ul><ul><li>Use sequence similarity to define </li></ul></ul><ul><ul><li>Function: function string(s), EC number, GO number, … </li></ul></ul><ul><ul><li>Taxonomic origin </li></ul></ul><ul><li>Algorithms (not exhaustive) </li></ul><ul><ul><li>BLAST (default, sensitive, too expensive ) </li></ul></ul><ul><ul><li>BLAT (well tested, no parallel, a bit less sensitive) </li></ul></ul><ul><ul><li>Suffix array based (fast, limited mis-matches) </li></ul></ul><ul><ul><li>HMM based (HMMer 3.0 is as fast a BLAST) </li></ul></ul><ul><ul><li>We haven’t tested RAPsearch2 </li></ul></ul><ul><li>Similarity search cost is high, repeat searches are required </li></ul><ul><ul><li>Think of Nikos’ MEP (next talk) </li></ul></ul>
  15. 15. Presentation layer
  16. 16. MG-RAST v3 workflow (simplified) SFF, fastq and fasta data find emPCR and BridgePCR artifacts find coding regions/peptides using FragGeneScan (Ye, NAR 2010) Many databases integrated GSC’s M5nr Upload QC / normalization Similarities (Parallel Blat) Metabolic reconstruction Community reconstruction Metadata Feature prediction (FGS) Abundance profiles Metabolic model
  17. 17. The future <ul><li>“ Living on the log scale” (Guy Cochrane, EBI, UK) </li></ul><ul><li>“ Data bonanza” (Dawn Field, Oxford UK) </li></ul><ul><li>“ Metadata are essential for turning data into knowledge” (Rob Knight, U Colorado, USA) </li></ul>Source: Rob Knight, UColorado Driving force 600 GBp / run 60 GBp / run
  18. 18. Future 1: World is not clonal, study strain/species variation <ul><li>“ Pangenome view ” allows definition of strains </li></ul>new strain? new strain?
  19. 19. Future 2: Expand metadata (1): MIMS/MIMARKS <ul><li>Genomics Standards Consortium (GSC) provides </li></ul><ul><ul><li>Extensible metadata standards </li></ul></ul><ul><ul><li>Environmental packages allow domain specific extension </li></ul></ul><ul><li>Groups starting to build environmental packages </li></ul><ul><li>MG-RAST v3 supports GSC metadata standards </li></ul><ul><li>Use metadata </li></ul><ul><li>Select data sets to compare to based on: </li></ul><ul><li>Biome, location, sampling procedure, … </li></ul>Capture metadata Extensive metadata questionnaire supporting offline editors // input
  20. 20. Expand metadata support (2): Capture metadata early Imagine adding metadata to the plot below: Very hard after the fact! capture metadata early Aanensen et al, Plos ONE, 2009
  21. 21. Many current challenges and pitfalls <ul><li>Assembly (state of the art: hard) </li></ul><ul><ul><li>Several groups are working actively on metagenome assemblers </li></ul></ul><ul><ul><li>Quotes Mihai Pop (UMaryland) </li></ul></ul><ul><ul><ul><li>“ metagenomes can ’ t be assembled ” and “ all assemblers are equal ” </li></ul></ul></ul><ul><li>Rare k-Filtering (state of the art: DO NOT) </li></ul><ul><ul><li>C. Titus Brown (MSU): “Friends don’t let friends filter rare k-mers” </li></ul></ul><ul><li>Binning (state of the art: use k-mers) </li></ul><ul><ul><li>Traditional binning does not work for short reads (Alice C. McHardy) </li></ul></ul><ul><ul><li>K-mer based binning can produce organism sized bins  Titus’ work </li></ul></ul><ul><li>Sequence quality </li></ul><ul><ul><li>Quality really matters and vendors lie all the time </li></ul></ul><ul><li>Metadata </li></ul><ul><ul><li>challenge for the next few years is to add metadata </li></ul></ul><ul><li>Cloud computing </li></ul><ul><ul><li>does not change the cost structure </li></ul></ul>
  22. 22. <ul><li>Metagenome transport format (MTF) </li></ul><ul><li>• Input Sequences ( “from the machine”) </li></ul><ul><li>    ▫ FASTA, FASTQ, SFF (maybe Archive BAM) </li></ul><ul><li>• Transformed sequence s ( “after QC”) </li></ul><ul><li>    ▫ FASTA </li></ul><ul><li>• Feature coordinates ( “after genefinding”) </li></ul><ul><li>    ▫ GFF3/GTF </li></ul><ul><li>• Similarities ( “ the BIG computation “) </li></ul><ul><li>    ▫ Blast/BLAT/.. results </li></ul><ul><li>• Metadata ( context, “ the important stuff “) </li></ul><ul><li>    ▫ GSC compliant MIMS format  </li></ul><ul><li>• Workflow description ( “provenance” ) </li></ul><ul><li>    ▫ What did we do? (not in shell script !) </li></ul><ul><li>    ▫ What version of code // databases did we use </li></ul><ul><li>    ▫ Who computed where </li></ul>
  23. 23. Acknowledgements <ul><li>MG-RAST team </li></ul><ul><ul><li>Daniela Bartels </li></ul></ul><ul><ul><li>Narayan Desai </li></ul></ul><ul><ul><li>Mark d ’Souza </li></ul></ul><ul><ul><li>Elizabeth M. Glass </li></ul></ul><ul><ul><li>Travis Harrison </li></ul></ul><ul><ul><li>Kevin Keegan </li></ul></ul><ul><ul><li>Tobias Paczian </li></ul></ul><ul><ul><li>William Trimble </li></ul></ul><ul><ul><li>Andreas Wilke </li></ul></ul><ul><ul><li>Jared Wilkening </li></ul></ul><ul><li>Metadata: </li></ul><ul><ul><li>Dawn Field, Oxford </li></ul></ul><ul><ul><li>Renzo Kottmann, MPI Bremen </li></ul></ul><ul><ul><ul><li>and all of GSC </li></ul></ul></ul><ul><li>M5/QC collaboration with  </li></ul><ul><ul><li>Nikos Kyrpides, JGI </li></ul></ul><ul><ul><li>Kostas Konstantinidis, JGI </li></ul></ul><ul><li>M5 standards </li></ul><ul><ul><li>Sarah Hunter, EBI </li></ul></ul><ul><li>CLOVR </li></ul><ul><ul><li>Sam Anguielo, Owen White (HMP DACC) </li></ul></ul><ul><li>QIIME </li></ul><ul><ul><li>Rob Knight (Colorado) </li></ul></ul><ul><li>INSDC submission/archiving </li></ul><ul><ul><li>Guy Cochrane/EBI </li></ul></ul>
  24. 24. Thank you for your attention

×