Your SlideShare is downloading. ×

Jeff Grethe: CAMERA

2,544

Published on

Published in: Education, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,544
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Jeffrey S. Grethe, Ph.D. Center for Research in Biological Systems University of California, San Diego
    • Standards in the Context of a Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA)
  • 2. Global Scientific Research Cyber-Community
  • 3. CAMERA 2.0
  • 4. CAMERA 2.0 Objectives
    • CAMERA serves as one representation of a specific research community’s need for a system to
      • Provide a metadata rich family of scalable databases and make them available to the community
        • Collect and reference increasing metadata relevant to environmental metagenome datasets
        • Exploit the power of querying on metadata across multiple geospatial locations
      • Provide a facility that allows for a diversity of software tools to be easily integrated into the system (and sufficient compute resources to support these analyses)
  • 5. Creating CAMERA 2.0 - Advanced Cyberinfrastructure Service Oriented Architecture
  • 6. CAMERA 2.0 Objectives
    • CAMERA serves as one representation of a specific research community’s need for a system to
      • Provide a metadata rich family of scalable databases and make them available to the community
        • Collect and reference increasing metadata relevant to environmental metagenome datasets
        • Exploit the power of querying on metadata across multiple geospatial locations
      • Provide a facility that allows for a diversity of software tools to be easily integrated into the system (and sufficient compute resources to support these analyses)
  • 7. The Semantically Aware DB Schema
    • Some key features of the semantically aware DB schema
      • Environmental parameters : Modeled more generally, to accommodate any environment and any parameter within an environment
      • Sequence : Separate “registries” for DNA, rRNA, mRNA, viral segments, reference genomes etc. Sequence annotations are independently searchable.
      • Workflow Connection : Every computed property is associated with the workflow instance that created it.
      • Associated Data : Data not produced in CAMERA but often used for analysis and comparison
      • Ontologies : All metadata, measured and observed parameters are connected to ontologies, whenever possible.
  • 8. Integration of External Data
    • Warehousing
      • Reference genomes
      • Homologs, CoG clusters
      • Raster data from slow/complex servers
    • Remote Data
      • KEGG pathways
      • NASA MODIS data
      • World Ocean Atlas
      • Other data that come as “data sets” that do not conform to the schema
  • 9. NASA Aqua-MODIS satellite data
    • Metadata: beyond data collected at sampling site
    Sea Surface Temp Chlorophyll MODIS Images covering GOS sites #8 – 12, mid November, 2003
  • 10. Integration of Enhanced Metadata
  • 11. Integrate and browse additional sources of microbial data
  • 12. Community Data Requirements
    • A simple submission process (web based entry or template upload)
    • Support from CAMERA staff during process (collaborative environment)
    • Large variety (metadata) and quantity of data should not mean a long submission (choice of interfaces)
    • Compliance with community stadards
    • Support pre-registration of samples for sequencing
  • 13. CAMERA 2.0 (Data Submission) Growing the CAMERA Community and Resource…
  • 14. Data Standards
    • Minimal Information for (Meta)Genomic Sequences: MIGS/MIMS
    • A Metadata standard, developed by the Genomics Standards Consortium
        • Controlled vocabularies e.g. EnvO, PATO
        • Common language: GCDML
    • Submissions shall comply with a MIMS/MIGS core, but any metadata can be entered via keywords and free text
    • Different metadata submission forms for different habitats: (water, soil, air, hosts)
  • 15. CAMERA 2.0 Objectives
    • CAMERA serves as one representation of a specific research community’s need for a system to
      • Provide a metadata rich family of scalable databases and make them available to the community
        • Collect and reference increasing metadata relevant to environmental metagenome datasets
        • Exploit the power of querying on metadata across multiple geospatial locations
      • Provide a facility that allows for a diversity of software tools to be easily integrated into the system (and sufficient compute resources to support these analyses)
  • 16. User Friendly Compute Environment
  • 17. CAMERA 2.0 (Computation) From simple job submission to community developed and published workflows…
  • 18. The Big Picture: Supporting the Scientist Conceptual Workflow Executable Workflow From “Napkin Drawings” … … to Executable Workflows Source: Mladen Vouk (NCSU)
  • 19. Scientific Workflow Systems …
    • … and a cross-project collaboration
      • … initiated August 2003
    • 1 st release: May 13 th , 2008
      • More than 20 thousand downloads!
    www.kepler-project.org
    • Builds upon the open-source Ptolemy II framework
    • Different Scientific Workflows
      • Visual component integration
        • Taverna, Triana
      • Grid-base distributed execution
        • Pegasus, Askalon
      • Visualization
        • Vistrails, SciRUN
      • Transaction-oriented
        • BPEL, mostly industrial
    • Execution Platforms
      • Portals, e.g., GEON, CAMERA
      • Web 2.0, e.g., myExperiment
    Ptolemy II: A laboratory for investigating design KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows
  • 20. Personalized (Collaborative) Workflow and Data Spaces
  • 21. Default and Advanced UI
  • 22. RAMMCAP – Rapid clustering and functional annotation for metagenomic sequences
    • RNA finding/filtering
    • DNA Clustering
    • Unique sequence
    • Taxonomy / population analysis
    • ORF clustering
    • ORF calling
    • Unique sequences
    • Protein families
    • ORF and cluster annotation
    • Pfam, Tigrfam, COG, etc.
    • Features
    • Very fast (10-100x) as compared to BLAST-based methods
    • Effective tools: CD-HIT, HMMERHEAD, meta_RNA, and RPS-BLAST
    • Focused functional annotation via curated protein families
    CD-HIT, 90-95% More in-depth analysis and further annotation Metagenomic Raw reads CD-HIT-EST, 95% DNA clusters Protein clusters Representative sequences Unique DNA sequences ORF Annotation 1. ORF_finder 2. Metagene CD-HIT, 60 or 30% COG Pfam Tigrfam HMMER HMMERHEAD RPS-BLAST Cluster Annotation 1. tRNA scan 2. rRNA scan 3. meta_RNA ORFs Non-redundant ORFs tRNAs rRNAs
  • 23. Annotation workflow A green box is called an ‘actor’ , which performs a task. This special actor represents an annotation component, such as BLAST search. Workflow parameters, which can be specified by users in the portal, are passed to workflow components. Data flow is divided.
  • 24. Run branches within workflow A ORF clustering branch A functional annotation branch
  • 25. Provenance of Workflow Related Data
    • Provenance: A concept from art history and library
      • Inputs, outputs, intermediate results, workflow design, workflow run
    • Collected information
      • Can be used in a number of ways
        • Validation, reproducibility, fault tolerance, etc…
      • Linked to the semantic database
      • Viewable and searchable from CAMERA 2.0
  • 26. Provenance Schema and Viewer in CAMERA 2.0
  • 27. http://camera.calit2.net

×