Computation  and  Knowledge Ian Foster Computation Institute Argonne National Lab & University of Chicago
Abstract <ul><li>I speak to the question of how computation can contribute to the generation of new knowledge by accelerat...
 
Knowledge Generation in Astronomy ~1600 30 years ? years 10 years 6 years 2 years
Astronomy, from 1600  to 2000 Automation 10 -1    10 8  Hz data capture Community 10 0    10 4 astronomers (10 6  amateu...
Astronomy, from 1600  to 2000 Automation 10 -1    10 8  Hz data capture Community 10 0    10 4 astronomers (10 6  amateu...
FLASH Turbulence Simulation   (R.Fisher, D.Lamb, et al.)  LLNL BG/L External users access processed dataset 74 million fil...
An  End-to-End Systems  Problem  That Includes … <ul><li>Massively parallel simulation </li></ul><ul><li>High-performance ...
Data Delivery  as a Systems Problem <ul><li>Data complexity </li></ul><ul><li>Many components </li></ul><ul><li>Parallelis...
Send 1 GB  partitioned into  equi-sized files over 60 ms RTT, 1 Gbit/s WAN Megabit/sec File size (Kbyte) (16MB TCP buffer)...
LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replic...
Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO
 
 
Cancer Biology Globus
Service-Oriented Science <ul><li>People  create  services (data, code, instr.) … </li></ul><ul><li>which I  discover  (& d...
Discovering Services <ul><li>Assume success </li></ul><ul><li>Syntax,   semantics </li></ul><ul><li>Permissions </li></ul>...
Discovery (1): Registries Duke OSU NCI NCI Globus
Discovery (2): Standardized Vocabularies Core Services Grid  Service Uses Terminology Described In Cancer Data Standards R...
 
Text Mining
More Knowledge (?) <ul><li>US papers/year in ApJ+AJ+PASP </li></ul>
GeneWays Online  Journals Pathways GeneWays Andrey Rzhetsky  et al. Screening 250,000 journal articles 2.5M reasoning chai...
Image: Andrey Rzhetsky
 
Image: Andrey Rzhetsky
Image: Andrey Rzhetsky
Image: Andrey Rzhetsky
Evidence Integration: Genetics & Disease Susceptibility Identify Genes Phenotype 1  Phenotype 2  Phenotype 3  Phenotype 4 ...
 
The Data Deluge
Images courtesy Mark Ellisman, UCSD
Images courtesy Mark Ellisman, UCSD
Images courtesy Mark Ellisman, UCSD
Images courtesy Mark Ellisman, UCSD A human brain  @ 1 micron voxel =  3.5 peta (10 15 )bytes
<ul><li>Understanding increases  far  more slowly  </li></ul><ul><li>Methodological bottlenecks? </li></ul><ul><ul><li>  ...
Data Analysis gets Fuzzy <ul><li>Global statistics? </li></ul><ul><ul><li>Correlation functions:  N 2 </li></ul></ul><ul><...
Towards an Open Analytics Environment Data in <ul><li>“ No limits” </li></ul><ul><li>Storage </li></ul><ul><li>Computing <...
Tagging &  Social Networking GLOSS :  Generalized  Labels Over Scientific  data Sources    (Foster, Nestorov)
High-Performance Data Analytics Functional MRI Ben Clifford,  Mihael Hatigan,  Mike Wilde, Yong Zhao Globus
Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein  x target(s)  (Mike Kubal, Benoit Roux, and others)
start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script pa...
DOCK on SiCortex <ul><li>CPU cores: 5760 </li></ul><ul><li>Power: 15,000 W </li></ul><ul><li>Tasks: 92160 </li></ul><ul><l...
An NSF MRI Proposal: PADS: Petascale Active Data Store 500 TB  reliable  storage  (data & metadata) 180 TB,  180 GB/s  17 ...
Using Computation to Accelerate Science Complex modeling Experiment automation Data analysis Collaboration & federation Hy...
Integrated View of Simulation, Experiment, & Informatics *Simulation Information Management System + Laboratory Informatio...
Robot Scientist (University of Wales)
A Concluding Thought <ul><li>“ Applied computer science is now playing  the role that mathematics did from the 17th throug...
The Computation Institute <ul><li>A joint institute of Argonne and the University of Chicago, focused on furthering  syste...
 
Upcoming SlideShare
Loading in …5
×

Computation and Knowledge

3,019 views
2,942 views

Published on

Physics Colloquium, University of Chicago
May 2008

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,019
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
59
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Ken Wilson observed: computational science a third mode of enquiry in addition to experiment and theory. My theme is rather how, by taking a systems view of the knowledge generation process, we can identify ways in which computation can accelerate.
  • Computation and Knowledge

    1. 1. Computation and Knowledge Ian Foster Computation Institute Argonne National Lab & University of Chicago
    2. 2. Abstract <ul><li>I speak to the question of how computation can contribute to the generation of new knowledge by accelerating the work of distributed collaborative teams and enabling the extraction of knowledge from large quantities of information produced by many workers. I illustrate my presentation with examples of work being performed within the Computation Institute at the University of Chicago and Argonne National Laboratory. </li></ul>
    3. 4. Knowledge Generation in Astronomy ~1600 30 years ? years 10 years 6 years 2 years
    4. 5. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year
    5. 6. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year Text mining Federation/ collaboration Data analysis Complex system modeling Hypothesis generation Experiment design
    6. 7. FLASH Turbulence Simulation (R.Fisher, D.Lamb, et al.) LLNL BG/L External users access processed dataset 74 million files 154 TB 1 week, 65K CPUs 11M CPU hours Largest compressible homogeneous isotropic turbulence simulation 23 TB 3 weeks @ 20 MB/sec (GridFTP) Globus
    7. 8. An End-to-End Systems Problem That Includes … <ul><li>Massively parallel simulation </li></ul><ul><li>High-performance parallel I/O </li></ul><ul><li>Remote visualization </li></ul><ul><li>High-speed reliable data delivery </li></ul><ul><li>High-performance local analysis </li></ul><ul><li>Data access and analysis by external users </li></ul><ul><li>Troubleshooting </li></ul><ul><li>Security </li></ul><ul><li>Orchestration of distributed activities </li></ul>
    8. 9. Data Delivery as a Systems Problem <ul><li>Data complexity </li></ul><ul><li>Many components </li></ul><ul><li>Parallelism (in many places) </li></ul><ul><li>Network heterogeneities (e.g., firewalls) </li></ul><ul><li>Space (or the lack of it) </li></ul><ul><li>Protocols </li></ul><ul><li>Failures at many levels </li></ul><ul><li>Deadlines </li></ul><ul><li>Resource contention </li></ul><ul><li>Policy and priorities </li></ul>74 million files 154 TB 2 weeks @20 MB/sec 3 hours @2000 MB/sec 2 mins? @200,000 MB/sec 23 TB
    9. 10. Send 1 GB partitioned into equi-sized files over 60 ms RTT, 1 Gbit/s WAN Megabit/sec File size (Kbyte) (16MB TCP buffer) Number of files John Bresnahan et al., Argonne Globus
    10. 11. LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month Ann Chervenak et al., ISI; Scott Koranda et al, LIGO <ul><li>Cardiff </li></ul>AEI/Golm Globus
    11. 12. Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO
    12. 15. Cancer Biology Globus
    13. 16. Service-Oriented Science <ul><li>People create services (data, code, instr.) … </li></ul><ul><li>which I discover (& decide whether to use) … </li></ul><ul><li>& compose to create a new function ... </li></ul><ul><li>& then publish as a new service. </li></ul><ul><li> I find “someone else” to host services, so I don’t have to become an expert in operating services & computers! </li></ul><ul><li> I hope that this “someone else” can manage security, reliability, scalability, … </li></ul>! ! “ Service-Oriented Science”, Science , 2005
    14. 17. Discovering Services <ul><li>Assume success </li></ul><ul><li>Syntax, semantics </li></ul><ul><li>Permissions </li></ul><ul><li>Reputation </li></ul><ul><li> The ultimate arbiter? </li></ul><ul><li> Types, ontologies </li></ul><ul><li> Can I use it? </li></ul><ul><li> Billions of services </li></ul>A B
    15. 18. Discovery (1): Registries Duke OSU NCI NCI Globus
    16. 19. Discovery (2): Standardized Vocabularies Core Services Grid Service Uses Terminology Described In Cancer Data Standards Repository Enterprise Vocabulary Services References Objects Defined in Service Metadata Publishes Subscribes to and Aggregates Queries Service Metadata Aggregated In Registers To Discovery Client API Index Service Globus
    17. 21. Text Mining
    18. 22. More Knowledge (?) <ul><li>US papers/year in ApJ+AJ+PASP </li></ul>
    19. 23. GeneWays Online Journals Pathways GeneWays Andrey Rzhetsky et al. Screening 250,000 journal articles 2.5M reasoning chains 4M statements
    20. 24. Image: Andrey Rzhetsky
    21. 26. Image: Andrey Rzhetsky
    22. 27. Image: Andrey Rzhetsky
    23. 28. Image: Andrey Rzhetsky
    24. 29. Evidence Integration: Genetics & Disease Susceptibility Identify Genes Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Predictive Disease Susceptibility Physiology Metabolism Endocrine Proteome Immune Transcriptome Biomarker Signatures Morphometrics Pharmacokinetics Ethnicity Environment Age Gender Source: Terry Magnuson
    25. 31. The Data Deluge
    26. 32. Images courtesy Mark Ellisman, UCSD
    27. 33. Images courtesy Mark Ellisman, UCSD
    28. 34. Images courtesy Mark Ellisman, UCSD
    29. 35. Images courtesy Mark Ellisman, UCSD A human brain @ 1 micron voxel = 3.5 peta (10 15 )bytes
    30. 36. <ul><li>Understanding increases far more slowly </li></ul><ul><li>Methodological bottlenecks? </li></ul><ul><ul><li> Improved technology </li></ul></ul><ul><li>Human limitations? </li></ul><ul><ul><li> Computer-assisted discovery </li></ul></ul>
    31. 37. Data Analysis gets Fuzzy <ul><li>Global statistics? </li></ul><ul><ul><li>Correlation functions: N 2 </li></ul></ul><ul><ul><li>Likelihood methods: N 3 </li></ul></ul><ul><li>Best we can do is N or maybe N logN </li></ul>(scale approximate) Haystack: Jim Gray/Alex Szalay
    32. 38. Towards an Open Analytics Environment Data in <ul><li>“ No limits” </li></ul><ul><li>Storage </li></ul><ul><li>Computing </li></ul><ul><li>Format </li></ul><ul><li>Program </li></ul><ul><li>Allowing for </li></ul><ul><li>Provenance </li></ul><ul><li>Collaboration </li></ul><ul><li>Annotation </li></ul>Results out Programs & rules in
    33. 39. Tagging & Social Networking GLOSS : Generalized Labels Over Scientific data Sources (Foster, Nestorov)
    34. 40. High-Performance Data Analytics Functional MRI Ben Clifford, Mihael Hatigan, Mike Wilde, Yong Zhao Globus
    35. 41. Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
    36. 42. start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions 4 million tasks 500K cpu-hrs (Mike Kubal, Benoit Roux, and others) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
    37. 43. DOCK on SiCortex <ul><li>CPU cores: 5760 </li></ul><ul><li>Power: 15,000 W </li></ul><ul><li>Tasks: 92160 </li></ul><ul><li>Elapsed time: 12821 sec </li></ul><ul><li>Compute time: 1.94 CPU years </li></ul>(does not include ~800 sec to stage input data) Ioan Raicu, Zhao Zhang
    38. 44. An NSF MRI Proposal: PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup
    39. 45. Using Computation to Accelerate Science Complex modeling Experiment automation Data analysis Collaboration & federation Hypothesis generation
    40. 46. Integrated View of Simulation, Experiment, & Informatics *Simulation Information Management System + Laboratory Information Management System Database Analysis Tools Experiment SIMS* Problem Specification Simulation Browsing & Visualization LIMS + Experimental Design Browsing & Visualization
    41. 47. Robot Scientist (University of Wales)
    42. 48. A Concluding Thought <ul><li>“ Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework & exploratory apparatus for other sciences.” </li></ul>George Djorgovski
    43. 49. The Computation Institute <ul><li>A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods. </li></ul><ul><li>Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three). </li></ul>www.ci.uchicago.edu Faculty, fellows, staff, students, computers, projects.

    ×