Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Computation and Knowledge


Published on

Physics Colloquium, University of Chicago
May 2008

Published in: Technology, Education
  • Be the first to comment

Computation and Knowledge

  1. 1. Computation and Knowledge Ian Foster Computation Institute Argonne National Lab & University of Chicago
  2. 2. Abstract <ul><li>I speak to the question of how computation can contribute to the generation of new knowledge by accelerating the work of distributed collaborative teams and enabling the extraction of knowledge from large quantities of information produced by many workers. I illustrate my presentation with examples of work being performed within the Computation Institute at the University of Chicago and Argonne National Laboratory. </li></ul>
  3. 4. Knowledge Generation in Astronomy ~1600 30 years ? years 10 years 6 years 2 years
  4. 5. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year
  5. 6. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year Text mining Federation/ collaboration Data analysis Complex system modeling Hypothesis generation Experiment design
  6. 7. FLASH Turbulence Simulation (R.Fisher, D.Lamb, et al.) LLNL BG/L External users access processed dataset 74 million files 154 TB 1 week, 65K CPUs 11M CPU hours Largest compressible homogeneous isotropic turbulence simulation 23 TB 3 weeks @ 20 MB/sec (GridFTP) Globus
  7. 8. An End-to-End Systems Problem That Includes … <ul><li>Massively parallel simulation </li></ul><ul><li>High-performance parallel I/O </li></ul><ul><li>Remote visualization </li></ul><ul><li>High-speed reliable data delivery </li></ul><ul><li>High-performance local analysis </li></ul><ul><li>Data access and analysis by external users </li></ul><ul><li>Troubleshooting </li></ul><ul><li>Security </li></ul><ul><li>Orchestration of distributed activities </li></ul>
  8. 9. Data Delivery as a Systems Problem <ul><li>Data complexity </li></ul><ul><li>Many components </li></ul><ul><li>Parallelism (in many places) </li></ul><ul><li>Network heterogeneities (e.g., firewalls) </li></ul><ul><li>Space (or the lack of it) </li></ul><ul><li>Protocols </li></ul><ul><li>Failures at many levels </li></ul><ul><li>Deadlines </li></ul><ul><li>Resource contention </li></ul><ul><li>Policy and priorities </li></ul>74 million files 154 TB 2 weeks @20 MB/sec 3 hours @2000 MB/sec 2 mins? @200,000 MB/sec 23 TB
  9. 10. Send 1 GB partitioned into equi-sized files over 60 ms RTT, 1 Gbit/s WAN Megabit/sec File size (Kbyte) (16MB TCP buffer) Number of files John Bresnahan et al., Argonne Globus
  10. 11. LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month Ann Chervenak et al., ISI; Scott Koranda et al, LIGO <ul><li>Cardiff </li></ul>AEI/Golm Globus
  11. 12. Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO
  12. 15. Cancer Biology Globus
  13. 16. Service-Oriented Science <ul><li>People create services (data, code, instr.) … </li></ul><ul><li>which I discover (& decide whether to use) … </li></ul><ul><li>& compose to create a new function ... </li></ul><ul><li>& then publish as a new service. </li></ul><ul><li> I find “someone else” to host services, so I don’t have to become an expert in operating services & computers! </li></ul><ul><li> I hope that this “someone else” can manage security, reliability, scalability, … </li></ul>! ! “ Service-Oriented Science”, Science , 2005
  14. 17. Discovering Services <ul><li>Assume success </li></ul><ul><li>Syntax, semantics </li></ul><ul><li>Permissions </li></ul><ul><li>Reputation </li></ul><ul><li> The ultimate arbiter? </li></ul><ul><li> Types, ontologies </li></ul><ul><li> Can I use it? </li></ul><ul><li> Billions of services </li></ul>A B
  15. 18. Discovery (1): Registries Duke OSU NCI NCI Globus
  16. 19. Discovery (2): Standardized Vocabularies Core Services Grid Service Uses Terminology Described In Cancer Data Standards Repository Enterprise Vocabulary Services References Objects Defined in Service Metadata Publishes Subscribes to and Aggregates Queries Service Metadata Aggregated In Registers To Discovery Client API Index Service Globus
  17. 21. Text Mining
  18. 22. More Knowledge (?) <ul><li>US papers/year in ApJ+AJ+PASP </li></ul>
  19. 23. GeneWays Online Journals Pathways GeneWays Andrey Rzhetsky et al. Screening 250,000 journal articles 2.5M reasoning chains 4M statements
  20. 24. Image: Andrey Rzhetsky
  21. 26. Image: Andrey Rzhetsky
  22. 27. Image: Andrey Rzhetsky
  23. 28. Image: Andrey Rzhetsky
  24. 29. Evidence Integration: Genetics & Disease Susceptibility Identify Genes Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Predictive Disease Susceptibility Physiology Metabolism Endocrine Proteome Immune Transcriptome Biomarker Signatures Morphometrics Pharmacokinetics Ethnicity Environment Age Gender Source: Terry Magnuson
  25. 31. The Data Deluge
  26. 32. Images courtesy Mark Ellisman, UCSD
  27. 33. Images courtesy Mark Ellisman, UCSD
  28. 34. Images courtesy Mark Ellisman, UCSD
  29. 35. Images courtesy Mark Ellisman, UCSD A human brain @ 1 micron voxel = 3.5 peta (10 15 )bytes
  30. 36. <ul><li>Understanding increases far more slowly </li></ul><ul><li>Methodological bottlenecks? </li></ul><ul><ul><li> Improved technology </li></ul></ul><ul><li>Human limitations? </li></ul><ul><ul><li> Computer-assisted discovery </li></ul></ul>
  31. 37. Data Analysis gets Fuzzy <ul><li>Global statistics? </li></ul><ul><ul><li>Correlation functions: N 2 </li></ul></ul><ul><ul><li>Likelihood methods: N 3 </li></ul></ul><ul><li>Best we can do is N or maybe N logN </li></ul>(scale approximate) Haystack: Jim Gray/Alex Szalay
  32. 38. Towards an Open Analytics Environment Data in <ul><li>“ No limits” </li></ul><ul><li>Storage </li></ul><ul><li>Computing </li></ul><ul><li>Format </li></ul><ul><li>Program </li></ul><ul><li>Allowing for </li></ul><ul><li>Provenance </li></ul><ul><li>Collaboration </li></ul><ul><li>Annotation </li></ul>Results out Programs & rules in
  33. 39. Tagging & Social Networking GLOSS : Generalized Labels Over Scientific data Sources (Foster, Nestorov)
  34. 40. High-Performance Data Analytics Functional MRI Ben Clifford, Mihael Hatigan, Mike Wilde, Yong Zhao Globus
  35. 41. Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
  36. 42. start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions 4 million tasks 500K cpu-hrs (Mike Kubal, Benoit Roux, and others) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
  37. 43. DOCK on SiCortex <ul><li>CPU cores: 5760 </li></ul><ul><li>Power: 15,000 W </li></ul><ul><li>Tasks: 92160 </li></ul><ul><li>Elapsed time: 12821 sec </li></ul><ul><li>Compute time: 1.94 CPU years </li></ul>(does not include ~800 sec to stage input data) Ioan Raicu, Zhao Zhang
  38. 44. An NSF MRI Proposal: PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup
  39. 45. Using Computation to Accelerate Science Complex modeling Experiment automation Data analysis Collaboration & federation Hypothesis generation
  40. 46. Integrated View of Simulation, Experiment, & Informatics *Simulation Information Management System + Laboratory Information Management System Database Analysis Tools Experiment SIMS* Problem Specification Simulation Browsing & Visualization LIMS + Experimental Design Browsing & Visualization
  41. 47. Robot Scientist (University of Wales)
  42. 48. A Concluding Thought <ul><li>“ Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework & exploratory apparatus for other sciences.” </li></ul>George Djorgovski
  43. 49. The Computation Institute <ul><li>A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods. </li></ul><ul><li>Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three). </li></ul> Faculty, fellows, staff, students, computers, projects.