Computation and Knowledge
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Computation and Knowledge

on

  • 2,995 views

Physics Colloquium, University of Chicago

Physics Colloquium, University of Chicago
May 2008

Statistics

Views

Total Views
2,995
Views on SlideShare
2,704
Embed Views
291

Actions

Likes
5
Downloads
58
Comments
0

7 Embeds 291

http://homepage1.nifty.com 210
http://text.world.coocan.jp 45
http://www.linkedin.com 17
https://www.linkedin.com 12
http://webcache.googleusercontent.com 5
http://209.85.175.104 1
http://74.125.153.132 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Ken Wilson observed: computational science a third mode of enquiry in addition to experiment and theory. My theme is rather how, by taking a systems view of the knowledge generation process, we can identify ways in which computation can accelerate.

Computation and Knowledge Presentation Transcript

  • 1. Computation and Knowledge Ian Foster Computation Institute Argonne National Lab & University of Chicago
  • 2. Abstract
    • I speak to the question of how computation can contribute to the generation of new knowledge by accelerating the work of distributed collaborative teams and enabling the extraction of knowledge from large quantities of information produced by many workers. I illustrate my presentation with examples of work being performed within the Computation Institute at the University of Chicago and Argonne National Laboratory.
  • 3.  
  • 4. Knowledge Generation in Astronomy ~1600 30 years ? years 10 years 6 years 2 years
  • 5. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year
  • 6. Astronomy, from 1600 to 2000 Automation 10 -1  10 8 Hz data capture Community 10 0  10 4 astronomers (10 6 amateur) Computation Data 10 6  10 15 B aggregate 10 -1  10 15 Hz peak Literature 10 1  10 5 pages/year Text mining Federation/ collaboration Data analysis Complex system modeling Hypothesis generation Experiment design
  • 7. FLASH Turbulence Simulation (R.Fisher, D.Lamb, et al.) LLNL BG/L External users access processed dataset 74 million files 154 TB 1 week, 65K CPUs 11M CPU hours Largest compressible homogeneous isotropic turbulence simulation 23 TB 3 weeks @ 20 MB/sec (GridFTP) Globus
  • 8. An End-to-End Systems Problem That Includes …
    • Massively parallel simulation
    • High-performance parallel I/O
    • Remote visualization
    • High-speed reliable data delivery
    • High-performance local analysis
    • Data access and analysis by external users
    • Troubleshooting
    • Security
    • Orchestration of distributed activities
  • 9. Data Delivery as a Systems Problem
    • Data complexity
    • Many components
    • Parallelism (in many places)
    • Network heterogeneities (e.g., firewalls)
    • Space (or the lack of it)
    • Protocols
    • Failures at many levels
    • Deadlines
    • Resource contention
    • Policy and priorities
    74 million files 154 TB 2 weeks @20 MB/sec 3 hours @2000 MB/sec 2 mins? @200,000 MB/sec 23 TB
  • 10. Send 1 GB partitioned into equi-sized files over 60 ms RTT, 1 Gbit/s WAN Megabit/sec File size (Kbyte) (16MB TCP buffer) Number of files John Bresnahan et al., Argonne Globus
  • 11. LIGO Gravitational Wave Observatory Birmingham • >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month Ann Chervenak et al., ISI; Scott Koranda et al, LIGO
    • Cardiff
    AEI/Golm Globus
  • 12. Lag Plot for Data Transfers to Caltech Credit: Kevin Flasch, LIGO
  • 13.  
  • 14.  
  • 15. Cancer Biology Globus
  • 16. Service-Oriented Science
    • People create services (data, code, instr.) …
    • which I discover (& decide whether to use) …
    • & compose to create a new function ...
    • & then publish as a new service.
    •  I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!
    •  I hope that this “someone else” can manage security, reliability, scalability, …
    ! ! “ Service-Oriented Science”, Science , 2005
  • 17. Discovering Services
    • Assume success
    • Syntax, semantics
    • Permissions
    • Reputation
    •  The ultimate arbiter?
    •  Types, ontologies
    •  Can I use it?
    •  Billions of services
    A B
  • 18. Discovery (1): Registries Duke OSU NCI NCI Globus
  • 19. Discovery (2): Standardized Vocabularies Core Services Grid Service Uses Terminology Described In Cancer Data Standards Repository Enterprise Vocabulary Services References Objects Defined in Service Metadata Publishes Subscribes to and Aggregates Queries Service Metadata Aggregated In Registers To Discovery Client API Index Service Globus
  • 20.  
  • 21. Text Mining
  • 22. More Knowledge (?)
    • US papers/year in ApJ+AJ+PASP
  • 23. GeneWays Online Journals Pathways GeneWays Andrey Rzhetsky et al. Screening 250,000 journal articles 2.5M reasoning chains 4M statements
  • 24. Image: Andrey Rzhetsky
  • 25.  
  • 26. Image: Andrey Rzhetsky
  • 27. Image: Andrey Rzhetsky
  • 28. Image: Andrey Rzhetsky
  • 29. Evidence Integration: Genetics & Disease Susceptibility Identify Genes Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Predictive Disease Susceptibility Physiology Metabolism Endocrine Proteome Immune Transcriptome Biomarker Signatures Morphometrics Pharmacokinetics Ethnicity Environment Age Gender Source: Terry Magnuson
  • 30.  
  • 31. The Data Deluge
  • 32. Images courtesy Mark Ellisman, UCSD
  • 33. Images courtesy Mark Ellisman, UCSD
  • 34. Images courtesy Mark Ellisman, UCSD
  • 35. Images courtesy Mark Ellisman, UCSD A human brain @ 1 micron voxel = 3.5 peta (10 15 )bytes
  • 36.
    • Understanding increases far more slowly
    • Methodological bottlenecks?
      •  Improved technology
    • Human limitations?
      •  Computer-assisted discovery
  • 37. Data Analysis gets Fuzzy
    • Global statistics?
      • Correlation functions: N 2
      • Likelihood methods: N 3
    • Best we can do is N or maybe N logN
    (scale approximate) Haystack: Jim Gray/Alex Szalay
  • 38. Towards an Open Analytics Environment Data in
    • “ No limits”
    • Storage
    • Computing
    • Format
    • Program
    • Allowing for
    • Provenance
    • Collaboration
    • Annotation
    Results out Programs & rules in
  • 39. Tagging & Social Networking GLOSS : Generalized Labels Over Scientific data Sources (Foster, Nestorov)
  • 40. High-Performance Data Analytics Functional MRI Ben Clifford, Mihael Hatigan, Mike Wilde, Yong Zhao Globus
  • 41. Many Tasks: Identifying Potential Drug Targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
  • 42. start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions 4 million tasks 500K cpu-hrs (Mike Kubal, Benoit Roux, and others) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
  • 43. DOCK on SiCortex
    • CPU cores: 5760
    • Power: 15,000 W
    • Tasks: 92160
    • Elapsed time: 12821 sec
    • Compute time: 1.94 CPU years
    (does not include ~800 sec to stage input data) Ioan Raicu, Zhao Zhang
  • 44. An NSF MRI Proposal: PADS: Petascale Active Data Store 500 TB reliable storage (data & metadata) 180 TB, 180 GB/s 17 Top/s analysis Data ingest Dynamic provisioning Parallel analysis Remote access Offload to remote data centers P A D S Diverse users Diverse data sources 1000 TB tape backup
  • 45. Using Computation to Accelerate Science Complex modeling Experiment automation Data analysis Collaboration & federation Hypothesis generation
  • 46. Integrated View of Simulation, Experiment, & Informatics *Simulation Information Management System + Laboratory Information Management System Database Analysis Tools Experiment SIMS* Problem Specification Simulation Browsing & Visualization LIMS + Experimental Design Browsing & Visualization
  • 47. Robot Scientist (University of Wales)
  • 48. A Concluding Thought
    • “ Applied computer science is now playing the role that mathematics did from the 17th through the 20th centuries: providing an orderly, formal framework & exploratory apparatus for other sciences.”
    George Djorgovski
  • 49. The Computation Institute
    • A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods.
    • Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three).
    www.ci.uchicago.edu Faculty, fellows, staff, students, computers, projects.
  • 50.