Computing Outside The Box June 2009

  • 1,306 views
Uploaded on

Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows. …

Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.

The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,306
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
42
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Ian Foster Computation Institute Argonne National Lab & University of Chicago 1
  • 2. Abstract The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs quot;outside the boxquot; can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible. 2
  • 3. 3
  • 4. “I’ve been doing cloud computing since before it was called grid.” 4
  • 5. 1890 5
  • 6. 1953 6
  • 7. “Computation may someday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961) 7
  • 8. 8
  • 9. “When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Science Connectivity (on log scale) Grid Time 9
  • 10. Application Infrastructure 10
  • 11. Layered grid architecture Application “Specialized services”: user- or appln-specific distributed services User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective Application “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link (“The Anatomy of the Grid,” 2001) 11
  • 12. Application Service oriented infrastructure Infrastructure 12
  • 13. 13
  • 14. www.opensciencegrid.org 14
  • 15. www.opensciencegrid.org 15
  • 16. Application Service oriented infrastructure Infrastructure 16
  • 17. Application Service oriented applications Service oriented infrastructure Infrastructure 17
  • 18. 18
  • 19. As of Oct 122 participants 70 data 19, 2008: 105 services 35 analytical 19
  • 20. l Query and retrieve Microarray clustering microarray data from a caArray data service: using Taverna cagridnode.c2b2.columbia.edu:80 Workflow in/output 80/wsrf/services/cagrid/CaArrayS caGrid services crub others “Shim” services l Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf /services/cagrid/PreprocessDatase tMAGEService l Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:80 80/wsrf/services/cagrid/Hierarchic alClusteringMage Wei Tan 20
  • 21. Applications Infrastructure 21
  • 22. Energy Progress of adoption 22
  • 23. $$ $$ $$ Energy Progress of adoption 23
  • 24. $$ $$ $$ Energy Progress of adoption 24
  • 25. “When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Science Enterprise Connectivity (on log scale) Grid Cloud Time 25
  • 26. 26
  • 27. 27
  • 28. US$3 28
  • 29. Credit: Werner Vogels 29
  • 30. Credit: Werner Vogels 30
  • 31. 4000 Animoto EC2 image usage 0 Day 1 Day 8 31
  • 32. Salesforce.com, Google, Software Animoto, …, …, caBIG, TeraGrid gateways Platform Infrastructure 32
  • 33. Salesforce.com, Google, Software Animoto, …, …, caBIG, TeraGrid gateways Platform Amazon, GoGrid, Sun, Infrastructure Microsoft, … 33
  • 34. Salesforce.com, Google, Software Animoto, …, …, caBIG, TeraGrid gateways Amazon, Google, Platform Microsoft, … Amazon, GoGrid, Sun, Infrastructure Microsoft, … 34
  • 35. 35
  • 36. Dynamo: Amazon’s highly available key- value store (DeCandia et al., SOSP’07) q Simple query model q Weak consistency, no isolation q Stringent SLAs (e.g., 300ms for 99.9% of requests; peak 500 requests/sec) q Incremental scalability q Symmetry q Decentralization q Heterogeneity 36
  • 37. Technologies used in Dynamo T e c h n iq u P r o b le m Ad v a n ta g e C o n se t e n t is In c r e m e n t a l P a r t it io n in g Ve c t o r h a s h in g s c a la b ilit y H ig h c lo c k s w it h V e r s io n s iz e is A v a ila b ilit y f o r r e c o n c ilia t io d e c o u p le d f r o m w r it e s n d u r in g P r odvaid e s aht ig sh up t e r e S lo p p y a v a ila b ilit y a n d H a n d lin g re a d s q u o ru m a n d d u r a b ilit y te m p o ra ry h in t e d g u a ra n te e w h e n f a ilu r e s h a n d o ff s o m e o f th e R e c o v e r in g r eS p licc ahsr oa n iz enso t yn re A n t i-e n t r o p y fro m d iv a ila b le t a v e rg e n u s in g M e r k le p e rm a n e n t r e P r e s e r ine t h e p lic a s v s tre e s s b a cm e trroy u a n d ym kg f a ilu r e s G o s s ip - nd a v o id s h a v in g a bas e d M e m b e r s h ip c e n t r a liz e d m e m b e r s h ip a n d f a ilu r e r e g is t r y f o r p ro to c o l a n d d e t e c t io n s t o r in g f a ilu r e m e m b e r s h ip a n d d e t e c t io n . n o d e liv e n e s s
  • 38. Application Service oriented applications Service oriented infrastructure Infrastructure 38
  • 39. The Globus-based LIGO data grid LIGO Gravitational Wave Observatory B ir m in g h a m • Cardiff AEI/Golm Replicating >1 Terabyte/day to 8 sites >100 million replicas so far MTBF = 1 month 39
  • 40. Data replication service Pull “missing” files to a storage system Data Location Data Movement Local Replica GridFTP Replica Location Reliable Catalog Index File Transfer Service Local Replica GridFTP Replica Location Catalog Index Data Replication List of required Data Files Replication Service “Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 40
  • 41. Specializing further … S1 User S2 D “Provide access to S3 data D at S1, S2, S3 S1 with performance P” S2 Service D Provider Replica catalog, S3 “Provide storage User-level multicast, … with performance P1, network with P2, …” S1 Resource S2 D Provider S3 41
  • 42. Using IaaS in biomedical informatics handle.net IaaS provider My servers BIRN BIRN Chicago Chicago Chicago Chicago Chicago 42
  • 43. Clouds and supercomputers: Conventional wisdom? Clouds/ clusters ✔ Too slow Super computers Too expensive ✔ Loosely coupled Tightly coupled applications applications 43
  • 44. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 44
  • 45. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 45
  • 46. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 46
  • 47. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 47
  • 48. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 48
  • 49. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 49
  • 50. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 50
  • 51. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 51
  • 52. Clouds and supercomputers: Conventional wisdom? ✔ Good for Clouds/ clusters rapid response Super computers Too expensive ✔ Loosely coupled Tightly coupled applications applications 52
  • 53. Loosely coupled problems q Ensemble runs to quantify climate model uncertainty q Identify potential drug targets by screening a database of ligand structures against target proteins q Study economic model sensitivity to parameters q Analyze turbulence dataset from many perspectives q Perform numerical optimization to determine optimal resource assignment in energy problems q Mine collection of data from advanced light sources q Construct databases of computed properties of chemical compounds q Analyze data from the Large Hadron Collider q Analyze log data from 100,000-node parallel 53 53
  • 54. Many many tasks: Identifying potential drug targets Protein x 2M+ ligands target(s) (Mike Kubal, Benoit Roux, and others) 54
  • 55. Manually prep Manually prep NAB script ZINC DOCK6 rec file FRED rec file NAB parameters 3-D Script (defines flexible structures DOCK6 FRED Template residues, Receptor Receptor #MDsteps) 6 2M (1 per protein: (1 per protein: structures (6GB PDB defines pocket defines pocket 1 GB) protein protein to bind to) to bind to) BuildNABScript descriptions (1MB) NAB Amber prep: start 2. AmberizeReceptor Script ~4M x 60s x 1 cpu 4. perl: gen nabscript FRED DOCK6 ~60K cpu-hrs Select best ~5K Select best ~5K Amber Score: ~10K x 20m x 1 cpu 1. AmberizeLigand Amber ~3K cpu-hrs 3. AmberizeComplex 5. RunNABScript Select best ~500 GCMC ~500 x 10hr x 100 cpu ~500K cpu-hrs For 1 target: end 4 million tasks 500,000 cpu-hrs report ligands complexes (50 cpu-years)55
  • 56. 56
  • 57. DOCK on BG/P: ~1M tasks on 118,000 CPUs q CPU cores: 118784 q Tasks: 934803 q Elapsed time: 7257 sec q Compute time: 21.43 CPU years q Average task time: 667 sec q Relative Efficiency: 99.7% q (from 16 to 32 racks) q Utilization: x Sustained: 99.6% Time (secs) • x Overall: 78.3% GPFS • 1 script (~5KB) • 2 file read (~10KB) • 1 file write (~10KB) • RAM (cached from GPFS on first task per node) Ioan Zhao Mike • 1 binary (~7MB) Raicu Zhang Wilde • Static input data (~45MB) 57
  • 58. Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage 58
  • 59. Scaling Chirp Global file system Posix to (multicast) petascale Staging  Torus and tree interconnects  CN-striped intermediate file system Intermediate Large … MosaStore dataset (striping) Compute Compute LFS node ... LFS node Local (local datasets) (local datasets) 59
  • 60. Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors 60
  • 61. Ioan Raicu “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node 61
  • 62. Same scenario, but with dynamic resource provisioning 62
  • 63. Data diffusion sine-wave workload: Summary q GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs q DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs q DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs 63
  • 64. Clouds and supercomputers: Conventional wisdom? ✔ Good for Clouds/ clusters rapid response Super computers Excellent ✔ Loosely coupled Tightly coupled applications applications 64
  • 65. “The computer revolution hasn’t happened yet.” Alan Kay, 1997 65
  • 66. “When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Science Enterprise Consumer Connectivity (on log scale) Grid Cloud ???? Time 66
  • 67. The Shape of Grids to Come? Energy Internet 67
  • 68. Thank you! Computation Institute www.ci.uchicago.edu