Loading...
Flash Player 9 (or above) is needed to view slideshows. We have detected that you do not have it on your computer.To install it, go here
 
Post to Twitter Post to Twitter
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons
SlideShare is now available on LinkedIn. Add it to your LinkedIn profile.

CAMERA Presentation at KNAW ICoMM Colloquium May 2008

From saul.kravitz, 5 months ago Add as contact

CAMERA Presentation by Saul Kravitz at KNAW ICoMM Colloquium May 2008 in Amsterdam, Netherlands. See http://camera.calit2.net

307 views | 0 comments | 0 favorites | 2 downloads | 0 embeds (Stats)

Categories

Business & Mgmt

Groups/Events

Embed in your blog options close
Embed (wordpress.com) Exclude related slideshows Embed in your blog

More Info

This slideshow is Public
Total Views: 307 on Slideshare: 307 from embeds: 0
Flagged as inappropriate Flag as inappropriate

Flag as inappropriate

Select your reason for flagging this slideshow as inappropriate.

If needed, use the feedback form to let us know more details.

Slideshow Transcript

  1. Slide 1: CAMERA A Metagenomics Resource for Microbial Ecology Saul A. Kravitz J. Craig Venter Institute Rockville, Maryland USA KNAW Colloquium May 29, 2008
  2. Slide 2: Goals • Introduce you to CAMERA • Encourage you to use CAMERA • What can CAMERA do for you?
  3. Slide 3: Presentation Outline • Introduction to Metagenomics • Global Ocean Sampling (GOS) Expedition • CAMERA Capabilities and Features - Compute Resources - Data Resources - Tools Resources • Looking Forward
  4. Slide 4: Metagenomic Questions • Within an environment - What biological functions are present (absent)? - What organisms are present (absent) • Compare data from (dis)similar environments - What are the fundamental rules of microbial ecology • Adapting to environmental conditions? - How? - Evidence and mechanisms for lateral transfer • Search for novel proteins and protein families - And diversity within known families
  5. Slide 5: Genomics vs Metagenomics • Genomics – ‘Old School’ - Study of a single organism's genome - Genome sequence determined using shotgun sequencing and assembly - >1300 microbes sequenced, first in 1995 - DNA usually obtained from pure cultures (<1%) • Metagenomics - Application of genome sequencing methods to environmental samples (no culturing) - Environmental shotgun sequencing is the most widely used approach - Environmental Metadata provides key context
  6. Slide 6: Complexity of Microbial Communities • Simple (e.g., AMD, gutless worm) - Few species present (<10) - Diverse  Variations on standard genomics techniques • Complex (e.g., Soil or Marine) - Many species present (>10, often >1000) - Many closely related  New techniques
  7. Slide 7: Global Ocean Sampling Expedition
  8. Slide 8: Global Ocean Sampling (GOS) • 178 Total Sampling Locations - Phase 1: 7.7M reads, >6M proteins 3/07 - Phase 2-IO: 2.2M reads 3/08 - Phase 2: ~10M reads future • Diverse Environments - Open ocean, estuary, embayment, upwelling, fringing reef, atoll… 4/04 3/07 3/08
  9. Slide 9: GOS: Sequence Diversity in the Ocean Rusch et al (PLoS 2007) • Most sequence reads are unique - Very limited assembly - Most sequences not taxonomically anchored - Relating shotgun data to reference genomes - Annotation challenging • New Techniques Needed - Fragment Recruitment - Extreme Assembly to find pan genomes - Sample to Sample Comparisons
  10. Slide 10: Comparing of Dominant Ribotypes
  11. Slide 11: Comparison of Total Genomic Content
  12. Slide 12: GOS Protein Analysis Yooseph et al (PLoS 2007) • Novel clustering process • Sequence similarity based • Predict proteins and group into related clusters • Include GOS and all known proteins • Findings • GOS proteins • cover ~all existing prokaryotic families • expands diversity of known protein families • ~10% of large clusters are novel • Many are of viral origin • No saturation in the rate of novel protein family discovery
  13. Slide 13: Added Protein Family Diversity Yooseph et al (PLoS 2007) Rubisco homologs Known eukaryotes Known prokaryotes GOS prokaryotes New Groups
  14. Slide 14: GOS Viral Analysis (Williamson et al PLoSOne 2008) • Study of dsDNA viruses from shotgun data - 155k viral proteins identified from 37 GOS I sites (~2.5%) - 59% of viral sequences were bacteriophage • Viral acquisition and retention of host metabolic genes is common and widespread - Viruses have made these genes “their own” - Clade tightly with viral genes • Codistribution of P-SSM4-like cyanophage and the dominant ecotype of Prochlorococcus in GOS samples.
  15. Slide 15: Viral acquisition of host genes talC Gene GOS Viral Public Viral GOS Bacterial Public Bacterial Public Euk
  16. Slide 16: Reference Genomes • Overview - 150+ reference marine microbes (101 released) - Scaffold for GOS - Sequenced, assembled, autoannotated • Isolation Metadata - Incomplete • Bottlenecks - Availability of DNA - Purity of DNA • Status and Data - https://research.venterinstitute.org/moore/
  17. Slide 17: Motivations for CAMERA • Significant investment in sequencing - Only accessible to bioinformatics elite - Diversity of user sophistication and needs • Bioinformatics and Computation Challenges - Assembly, annotation, comparative analysis, visualization - Dedicated compute resources • Importance of Metadata - Metadata required for environmental analysis - Need to drive standards • Compliance with Convention on Biodiversity
  18. Slide 18: Convention on Biological Diversity • Sample in territorial waters? - Country granted certain rights by CBD - Sampling agreements may contain restrictions • CAMERA users must acknowledge potential restrictions on commercial data use • CAMERA maintains mapping of country- of-origin for all data objects
  19. Slide 19: CAMERA – http://camera.calit2.net • “Convenient acronym for cumbersome name…” - Henry Nichols, PLoS Biology • Mission - Enable Research in Marine Microbiology • Debuted March 2007 camera-info@calit2.net
  20. Slide 20: CAMERA Capabilities • Compute Resources - 512 node compute grid + 200 Tb storage • Data and Metadata Resources - Annotated Metagenomic and genomic data • Tools Resources - Scalable BLAST - Fragment Recruitment - Metagenomic Annotation - Text Search
  21. Slide 21: CAMRA Compute and Storage Complex at UCSD/Calit2 512 Processors ~5 Teraflops ~ 200 Terabytes Storage Source: Larry Smarr, Calit2
  22. Slide 22: CAMERA Metagenomic Data Volume by Project
  23. Slide 23: CAMERA Metagenomic Samples
  24. Slide 24: CAMERA Users >2000 Registered Since March 2007
  25. Slide 25: CAMERA Data Collections • Metagenomic Sequence Collection - Reads and assemblies w/associated metadata - CAMERA-computed annotation • Protein Clusters - Maintaining clusters from Yooseph et al (Yooseph and Li, ’08) • Genomic Data - Viral, Fungal, pico-Eukaryotes, Microbial - Moore Marine Genomes with Metadata • Non-redundant sequence Collection - Genbank, Refseq, Uniprot/Swissprot, PDB etc
  26. Slide 26: Standardizing Contextual Metadata • Genome Standards Consortium - Led by Dawn Field, NIEeS - Members from EU, UK, US • Goals are to promote - Standardization of genomic descriptions - Exchange & Integration of genomic data • Metadata standardization key enabler - MIMS: Min Info for Metagenomic Sample - GCDML: Standard format
  27. Slide 27: Contextual Metadata Challenges • Researchers Need to Collect and Submit • Relevant metadata depends on study – MIMS - Specification of minimum metadata • Standardize Exchange Format - GCDML - Comprehensive and extensible - Leverages Existing Ontologies, Validatable And… - Easy for a scientist to use... • Need ongoing software support for tools
  28. Slide 28: CAMERA Core Metadata by Project • Defacto Core •Lattitude and Longitude •Collection date •Habitat and Geographic Location • Missing metadata =
  29. Slide 29: CAMERA Contextual Metadata
  30. Slide 30: CAMERA 1.3 http://camera.calit2.net
  31. Slide 31: Scalable BLAST with Metadata • Large searches permitted and encouraged • 454 FLX run vs “All Metagenomic” • Some larger tblastx jobs have run >20 hrs • 10kbp BLASTN vs All Metagenomic – 1 min • BLAST XML or Tabular Export • Searches against NRAA • BLAST XML output feeds MEGAN • Searches against ‘All Metagenomic’ • GUI with metdata • Tabular with metadata
  32. Slide 32: Scalable BLAST with Metadata
  33. Slide 33: Integration of Metadata and Data
  34. Slide 34: Browsing Large Data Collections: Fragment Recruitment Viewer • Microbial Communities vs Reference Genomes - Millions of sequence reads vs Thousands of genomes • Definition: A read is recruited to a sequence if: - End-to-end blastN alignment exists • Rapid Hypothesis Generation and Exploration - How do cultured and wildtype genomes differ? - Insertions, deletion, translocations - Correlation with environmental factors • Export sequence and annotation • Credits: Doug Rusch and Michael Press
  35. Slide 35: Fragment Recruitment Viewer Sequence Similarity Genomic Position Doug Rusch, JCVI
  36. Slide 36: Sequence Similarity Geographic Legend Genomic Position Annotation
  37. Slide 40: Prochlorococcus marinus str. MIT 9312 • Coloring by geography • 80-95% identity cloud • = GOS Indian Ocean • Regions with no coverage • Where? • Real?
  38. Slide 41: Mate Status Highlights Differences • Paired end (mate) sequencing • Coloring by mate status • Highlights cultured vs metagenomic differences • Selective display of - Mates by status - Reads by sample
  39. Slide 42: Mate Pairs Highlight Variation
  40. Slide 43: What Genes are Involved
  41. Slide 44: View by Sample
  42. Slide 45: View by Sample Filter by mate status
  43. Slide 46: Annotation of Environmental Shotgun Data • Gene Finding - Using Yooseph’s Protein Clusters, and/or - Metagene • Functional Assignment - Variation of JCVI prok annotation pipeline* - Leverages protein cluster annotation -- soon • Quality Nearly Comparable to Prokaryotic Genomic Annotation
  44. Slide 47: Protein Clusters as Gene Finder • Identification and soft mask of ncRNAs • Naïve identification of ORFs (60aa min) • Add peptides to clusters incrementally - Yooseph and Li, 2008 • Predicted Genes based on ORFS in - Clusters of sufficient size - Clusters that satisfy additional filters
  45. Slide 48: Protein Clusters Advantages and Disadvantages • Weaknesses - Homology-based - Stateful (also a strength) - Less sensitive (for now) • Strengths - More specific - Transitive Annotation - Learns over time - Easy to maintain
  46. Slide 49: Search for Dehalogenase
  47. Slide 50: Browse Clusters
  48. Slide 51: Near Future • More extensive data collection • Summary views of data sets by - Annotation - Samples - Mate Status - Taxonomy - Habitat and other contextual metadata • 16S datasets?
  49. Slide 52: Credits • JCVI CAMERA Team - Leonid Kagan, Michael Press, Todd Safford, Cristian Goina, Qi Yang, Sean Murphy, Jeff Hoover, Tanja Davidsen, Ramana Madupu, Sree Nampally, Nikhat Zhafar, Prateek Kumar - Doug Rusch, Shibu Yooseph, Aaron Halpern*, Granger Sutton, Shannon Williamson - Marv Frazier and Bob Friedman • Calit2 CAMERA Team - Adam Brust, Michael Chiu, Brian Fox, Adam Dunne, Kayo Arima - Larry Smarr and Paul Gilna http://camera.calit2.net