The document summarizes Dr. Larry Smarr's presentation about the CAMERA project, which uses supercomputers and networks to explore microbial genomic data from ocean samples. The CAMERA infrastructure provides researchers worldwide with access to over 1 billion base pairs of metagenomic sequence data through an online portal. Analysis of this data has expanded knowledge of microbial biodiversity and gene families, providing insights into evolution and relationships between microbes and human health.
Using Supercomputers and Supernetworks to Explore the Ocean of Life
1. Using Supercomputers and Supernetworks to Explore the Ocean of Life Director's Colloquium Los Alamos National Laboratory Los Alamos, New Mexico June 7, 2007 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD
2. Abstract Calit2, in partnership with J. Craig Venter Institute in Rockville, MD, and UCSD's SDSC and Scripps Institution of Oceanography, is creating a Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), funded by the Gordon and Betty Moore Foundation. CAMERA collaborates closely with DoE's Joint Genome Institute. The CAMERA computational and storage cluster containing the metagenomic data can be accessed via the web over novel dedicated 10 Gb/s light pipes (termed "lambdas") through the National LambdaRail, providing direct connection to the scalable Linux clusters in individual user laboratories. These clusters are reconfigured as "OptIPortals," providing the end user with local scalable visualization, computing, and storage. Scientists will use CAMERA for metagenomics research -- analyzing microbial genomic sequence data in the context of other microbial species, as well as in relation to the chemical and physical conditions in which microbes are sampled. The CAMERA project contains the results of the Venter Institute's Sorcerer II Expedition, which carried out the first large-scale genomic survey of microbial life in the world's oceans to produce the largest gene catalogue ever assembled, doubling the number of protein sequences currently available in GenBank. In addition to Sorcerer II's ecological genomic data, the CAMERA database will be augmented by the full genomes of more than 150 critical marine microbes enabling new comparative genomics studies. Currently over 1000 users are registered from over 40 countries.
3. Calit2--A Systems Approach to the Future of the Internet and its Transformation of Our Society www.calit2.net Calit2 Has Assembled a Complex Social Network of Over 350 UC San Diego & UC Irvine Faculty Working in Multidisciplinary Teams With Staff, Students, Industry, and the Community Over 130 Companies and 300 Federal Grants in Collaboration with Calit2
9. Moore Microbial Genome Sequencing Project Selected Microbes Throughout the World’s Oceans www.moore.org/microgenome/worldmap.asp Microbes Nominated by Leading Ocean Microbial Biologists
10. Moore Foundation Funded the Venter Institute to Provide the Full Genome Sequence of 155 Marine Microbes Phylogenetic Trees Created by Uli Stingl, Oregon State Blue Means Contains One of the Moore 155 Genomes www.moore.org/microgenome/trees.aspx
11. Moore 155 Marine Microbial Genomes Gives Broad Coverage of Microbial “Tree of Life” www.moore.org/microgenome/alpha-proteobacteria.aspx Phylogenetic Trees Created by Uli Stingl, Oregon State
12. Full Genome Sequencing is Exploding: Most Sequenced Genomes are Bacterial www.genomesonline.org 90 Metagenomes First Genome 1995 6 Genomes/ Year 2000
13.
14. Microbial Metagenomics is a Rapidly Emerging Field of Research “ Despite their ubiquity, relatively little is known about the majority of environmental microorganisms, largely because of their resistance to culture under standard laboratory conditions.” “ The application of high-throughput shotgun sequencing environmental samples has recently provided global views of those communities not obtainable from 16S rRNA or BAC clone–sequencing surveys .” Comparative Metagenomics of Microbial Communities Susannah Green Tringe, Christian von Mering, Arthur Kobayashi, Asaf A. Salamov, Kevin Chen, Hwai W. Chang, Mircea Podar, Jay M. Short, Eric J. Mathur, John C. Detter, Peer Bork, Philip Hugenholtz, Edward M. Rubin Science 22 April 2005
15. Enormous Increase in Scale of Known Genes Over Last Decade 3 Billion Bases 30,000 Genes 6.3 Billion Bases 5.6 Million Genes 1.8 Million Bases 1749 Genes ~3300x 1995 First Microbe Genome 2001 Human Genome 2007 Ocean Metagenomics
16. Microbial Genomics Allow Us to Look Back Nearly 4 Billion Years In the Evolution of Life Falkowski and Vargas Science 304 (5667) 2004
17.
18. Marine Genome Sequencing Project – Measuring the Genetic Diversity of Ocean Microbes Sorcerer II Data Will Double Number of Proteins in GenBank! Specify Ocean Data Each Sample ~2000 Microbial Species
19. Environmental Metadata: Beyond Data Collected at Sampling Site NASA AQUA-MODIS Images covering GOS sites #8 – 12, mid November, 2003 Sea Surface Temp Chlorophyll
20. GOS Predicted Proteins are Largely Bacterial Source: Shibu Yooseph, et al. (PLOS Biology March 2007) ~3 Million Previously Known Proteins ~5.1 Million GOS Predicted Proteins NCBI-nr, PG, TGI-EST, ENS
21. Current Universe of Medium/ Large Protein Families Source: Shibu Yooseph, et al. (PLOS Biology March 2007) Protein Families Conserved Across Tree of Life Protein Families Unique to GOS 17,067 Protein Family Clusters 1 Million CPU-Hour Computation !
22.
23.
24. Self Organizing Maps Identifies Species Using Japanese Earth Simulator Human Fugu Arabidopsis Rice C. Elegans Drosophilia www.es.jamstec.go.jp/publication/journal/jes_vol.6/pdf/JES6_22-Abe.pdf T. Abe, H. Sugawara, S. Kanaya, T. Ikemura Journal of the Earth Simulator, Volume 6, October 2006, 17–23 SOM Created from an Unsupervised Neural Network Algorithm to Analyze Tetranucleotide Frequencies in a Wide Range of Genomes 10kb Moving Window
25. Using SOM, Sargasso Sea Metagenomic Data Yields 92 Microbial Genera ! Eukaryotes Prokaryotes Viruses Mitochondria Chloroplasts Input Genomes: 1500 Microbes 40 Eukaryotes 1065 Viruses 642 Mitochondria 42 Chloroplasts 5kb Window T. Abe, H. Sugawara, S. Kanaya, T. Ikemura Journal of the Earth Simulator, Volume 6, October 2006, 17–23
26. The Human Kinome: A Protein Family Implicated In Many Human Diseases Crystal Structures EPKs Manning, et al (2002) Science 298 :1912 Over 500 Protein Kinases 2% of the Human Genome Many splice variants Source: Susan Taylor, SOM, UCSD YEAST Mouse C.elegans Drosoph Arabid. Sea Urchin Dicty. Tetrahy.
27.
28. The OptIPuter Project: Creating High Resolution Portals Over Dedicated Optical Channels to Global Science Data Picture Source: Mark Ellisman, David Lee, Jason Leigh Calit2 (UCSD, UCI) and UIC Lead Campuses—Larry Smarr PI Univ. Partners: SDSC, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent $13.5M Over Five Years Now In the Fifth Year
29. Dedicated Optical Channels Makes High Performance Cyberinfrastructure Possible Parallel Lambdas are Driving Optical Networking The Way Parallel Processors Drove 1990s Computing 10 Gbps per User ~ 200x Shared Internet Throughput ( WDM) Source: Steve Wallach, Chiaro Networks “ Lambdas”
30. National Lambda Rail (NLR) and TeraGrid Provides Cyberinfrastructure Backbone for U.S. Researchers San Francisco Pittsburgh Cleveland San Diego Los Angeles Portland Seattle Pensacola Baton Rouge Houston San Antonio Las Cruces / El Paso Phoenix New York City Washington, DC Raleigh Jacksonville Dallas Tulsa Atlanta Kansas City Denver Ogden/ Salt Lake City Boise Albuquerque UC-TeraGrid UIC/NW-Starlight Chicago International Collaborators NLR 4 x 10Gb Lambdas Initially Capable of 40 x 10Gb wavelengths at Buildout NSF’s TeraGrid Has 4 x 10Gb Lambda Backbone Links Two Dozen State and Regional Optical Networks DOE, NSF, & NASA Using NLR
31. OptIPortal–Termination Device for the Dedicated Gigabit/sec Lightpaths Photo Source: David Lee, Mark Ellisman NCMIR, UCSD Collaborative Analysis of Large Scale Images of Cancer Cells Integration of High Definition Video Streams with Large Scale Image Display Walls
32.
33. PI Larry Smarr Paul Gilna Ex. Dir. Announced January 17, 2006 $24.5M Over Seven Years
34.
35. The Calit2 CAMERA Microbial Metagenomics Server is Open to the Community PLOS Biology March 2007
36. CAMERA Builds on Cyberinfrastructure Grid, Workflow, and Portal Projects in a Service Oriented Architecture Cyberinfrastructure: Raw Resources, Middleware & Execution Environment NBCR Rocks Clusters Virtual Organizations Web Services KEPLER Workflow Management Vision Telescience Portal Located in Calit2@UCSD Building National Biomedical Computation Resource an NIH supported resource center
37.
38. Calit2 CAMERA Production Compute and Storage Complex 512 Processors ~5 Teraflops ~ 200 Terabytes Storage
39. The Calit2 CAMERA Metagenomics Site is Now Active http://camera.calit2.net/
41. Distribution of CAMERA User Registrations Nearly 1000 Registered Users From 45 Countries
42. Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome Acidobacteria bacterium Ellin345 Soil Bacterium 5.6 Mb
43. Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome Source: Raj Singh, UCSD
44. Use of Tiled Display Wall OptIPortal to Interactively View Microbial Genome Source: Raj Singh, UCSD
45. Interactive Exploration of Marine Genomes Using 100 Million Pixels Ginger Armburst (UW), Terry Gaasterland (UCSD SIO)
46. Calit2 is Now OptIPuter Connecting Remote OptIPortals for Moore-Funded Microbial Researchers via NLR NW! CICESE UW JCVI MIT SIO UCSD SDSU UIC EVL UCI OptIPortals OptIPortal CAMERA Servers
47. Countries are Aggressively Creating Gigabit Services: Interactive Access to CAMERA Data System www.glif.is Created in Reykjavik, Iceland 2003 Visualization courtesy of Bob Patterson, NCSA.