Aaas Data Intensive Science And Grid

1,654 views

Published on

These slides were presented in a session that we organized at the American Association for Advancement of Science (AAAS) meeting in Chicago, February 2009.

Abstract: New laboratory devices, sensor networks, high-throughput instruments, and numerical simulation systems are producing data at rates that are both without precedent and rapidly growing. The resulting increases in the size, number, and variety of data are revolutionizing scientific practice. These changes demand new computing infrastructures and tools. Until recently, most laboratories and collaborations managed their own data, operated their own computers, and used remote high-performance computers only when required. We are moving to a paradigm in which data will primarily be located and managed on remote clusters, grids, and data centers. In this symposium, we will examine the computing infrastructure designed to serve this emerging era of data-intensive computing from three perspectives: (1) that of grid computing, which enables the creation of virtual organizations that can share remote and distributed resources over the Internet; (2) that of data centers, which are transitioning to providers of integrated storage, data, compute, and collaboration services (the offering of one or more of these integrated services over the Internet is beginning to be called cloud computing); and (3) that of e-science, in which grids, Web 2.0 technologies, and new collaboration and analysis services are merging and changing the way science is conducted. Each speaker will focus on one perspective but also compare and contrast with the others.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,654
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Aaas Data Intensive Science And Grid

    1. 1. New computing platforms for data-intensive science Ian Foster Computation Institute Argonne National Lab & University of Chicago
    2. 2. Abstract <ul><li>New laboratory devices, sensor networks, high-throughput instruments, and numerical simulation systems are producing data at rates that are both without precedent and rapidly growing. The resulting increases in the size, number, and variety of data are revolutionizing scientific practice. These changes demand new computing infrastructures and tools. Until recently, most laboratories and collaborations managed their own data, operated their own computers, and used remote high-performance computers only when required. We are moving to a paradigm in which data will primarily be located and managed on remote clusters, grids, and data centers. In this symposium, we will examine the computing infrastructure designed to serve this emerging era of data-intensive computing from three perspectives: (1) that of grid computing, which enables the creation of virtual organizations that can share remote and distributed resources over the Internet; (2) that of data centers, which are transitioning to providers of integrated storage, data, compute, and collaboration services (the offering of one or more of these integrated services over the Internet is beginning to be called cloud computing); and (3) that of e-science, in which grids, Web 2.0 technologies, and new collaboration and analysis services are merging and changing the way science is conducted. Each speaker will focus on one perspective but also compare and contrast with the others. </li></ul>
    3. 4. Growth of Genbank (1982-2005) Broad Institute
    4. 5. <ul><li>Proteomics </li></ul><ul><li>Genomics </li></ul><ul><li>Transcriptomics </li></ul><ul><li>Protein sequence prediction </li></ul><ul><li>Phenotypic studies </li></ul><ul><li>Phylogeny </li></ul><ul><li>Sequence analysis </li></ul><ul><li>Protein structure prediction </li></ul><ul><li>Protein-protein interaction </li></ul><ul><li>Metabolomics </li></ul><ul><li>Model organism collections </li></ul><ul><li>Systems biology </li></ul><ul><li>Health epidemiology </li></ul><ul><li>Organisms </li></ul><ul><li>Disease …. </li></ul>1070 molecular bio databases Nucleic Acids Research Jan 2008 (96 in Jan 2001) Slide: Carole Goble
    5. 6. New problem solving methodologies <0 1700 1950 1990 “ Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework and exploratory apparatus for other sciences” – G. Djorgovski Empirical Data Theory Simulation
    6. 8. More data does not always mean more knowledge Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch , August 2006.
    7. 9. <ul><li>enormous </li></ul>Data is Infrastructure Storage & computing Economics of scale Aggregation Data & software People & disciplines Algorithms Scalable, probabilistic Errors & ambiguity distributed noisy Cloud Grid
    8. 10. An incomplete list of process steps <ul><li>Discover </li></ul><ul><li>Access </li></ul><ul><li>Integrate </li></ul><ul><li>Analyze </li></ul><ul><li>Mine </li></ul><ul><li>Publish </li></ul><ul><li>Annotate </li></ul><ul><li>Validate </li></ul><ul><li>Curate Share </li></ul>Data Artisanal Industrial Data Analyses Models Experiments Literature
    9. 11. SOA as an integrating framework? <ul><li>We expose data and software as services … </li></ul><ul><li>which others discover , decide to use, … </li></ul><ul><li>and compose to create new functions ... </li></ul><ul><li>which they publish as new services. </li></ul><ul><li>Technical … </li></ul><ul><li>Complexity </li></ul><ul><li>Semantics </li></ul><ul><li>Distribution </li></ul><ul><li>Scale </li></ul><ul><li>socio-technical challenges </li></ul><ul><li>Incentives </li></ul><ul><li>Policy, trust </li></ul><ul><li>Reproducibility </li></ul><ul><li>Life cycle </li></ul>“ Service-oriented science”, Science , 2005 and
    10. 12. Grid technology
    11. 13. NAE Grand Challenges
    12. 14. The future of multi-site data integration: An example fMRI Are positive symptom schizophrenics associated with more severe superior temporal gyrus dysfunction? Receptor Density ERP Web PubMed, Expasy, Brain Map, Etc. Structure Clinical Portal
    13. 15. caBIG: sharing of infrastructure, applications, and data. Aggregation in cancer biology Globus
    14. 16. As of Feb 16 , 2009 123 participants 104 services 65 data 39 analytical
    15. 17. Microarray clustering in caBIG <ul><li>Query and retrieve microarray data from a caArray data service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub </li></ul><ul><li>Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService </li></ul><ul><li>Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage </li></ul>Workflow in/output caGrid services “ Shim” services others Wei Tan (Taverna workflow)
    16. 18. Children’s Oncology Group clinical imaging irials (Erberich)
    17. 19. Wide-area medical interface service <ul><li>Converts local medical workflow actions into wide area operations </li></ul><ul><ul><li>Image workflow, EHR, … </li></ul></ul><ul><li>Transparently manages federation of </li></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Data replication and recovery </li></ul></ul><ul><ul><li>Data discovery </li></ul></ul>Enterprise/Grid Interface Service DICOM Protocols Grid Protocols (Web services) DICOM XDS HL7 Vendor Specific Wide Area Service Actor Plug-in Adapters
    18. 20. Earth System Grid www.earthsystemgrid.org Main ESG Portal CMIP3 (IPCC AR4) ESG Portal <ul><li>198 TB of data at four locations </li></ul><ul><li>1,150 datasets </li></ul><ul><li>1,032,000 files </li></ul><ul><li>Includes the past 6 years of joint DOE/NSF climate modeling experiments </li></ul><ul><li>35 TB of data at one location </li></ul><ul><li>74,700 files </li></ul><ul><li>Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change </li></ul><ul><li>Data from 13 countries, representing 25 models </li></ul>8,000 registered users 1,900 registered projects <ul><li>Downloads to date </li></ul><ul><li>49 TB </li></ul><ul><li>176,000 files </li></ul><ul><li>Downloads to date </li></ul><ul><li>387 TB </li></ul><ul><li>1,300,000 files </li></ul><ul><li>500 GB/day (average) </li></ul>400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data ESG usage: over 500 sites worldwide ESG monthly download volumes Globus
    19. 21. Understanding interactions between human and natural systems IPCC Emissions scenarios Numerical Simulations IPCC 4 th Assessment 2007 IPCC process: Bill Collins, LBNL Mitigation Adaptation
    20. 22. A C ommunity I ntegrated M odel for E conomic a nd R esource T rajectories for H umankind ( CIM-EARTH ) Dynamics, foresight, uncertainty, resolution, … Agriculture, transport, taxation, … Data (global, local, …) (Super) computers CIM-EARTH Framework Community process Open code, data www.cim-earth.org
    21. 23. Alleviating Poverty in Thailand: Modeling Entrepreneurship Consider only wealth, access to capital Consider also distance to 6 major cities Rob Townsend, Tibi Stef-Praun, Victor Zhorin Match High Low
    22. 24. <ul><li>enormous </li></ul>Data is Infrastructure Storage & computing Economics of scale Aggregation Data & software People & disciplines Algorithms Scalable, probabilistic Errors & ambiguity distributed noisy Cloud Grid
    23. 25. Thank you! Computation Institute www.ci.uchicago.edu

    ×