Ian Foster
           Computation Institute
Argonne National Lab & University of Chicago   1
Abstract
The past decade has seen increasingly ambitious and successful
methods for outsourcing computing. Approaches such...
3
“I’ve been doing
cloud computing
 since before it
was called grid.”

                    4
1890




       5
1953




       6
“Computation may someday be
   organized as a public utility …
The computing utility could become
 the basis for a new and...
8
“When the network is as fast as the computer's
 internal links, the machine disintegrates across
 the net into a set of sp...
Application




Infrastructure
                 10
Layered grid architecture
                                        Application

    “Specialized services”: user- or
  appl...
Application




Service oriented infrastructure
  Infrastructure
                                  12
13
www.opensciencegrid.org   14
www.opensciencegrid.org   15
Application




Service oriented infrastructure
  Infrastructure
                                  16
Application
Service oriented applications




Service oriented infrastructure
  Infrastructure
                           ...
18
As of Oct   122 participants   70 data
19, 2008:   105 services       35 analytical
                                      ...
l   Query and retrieve                 Microarray clustering
    microarray data from a
    caArray data service:
        ...
Applications




Infrastructure   21
Energy




         Progress of adoption
                                22
$$ $$ $$




Energy




          Progress of adoption
                                 23
$$ $$ $$




Energy




          Progress of adoption
                                 24
“When the network is as fast as the computer's
 internal links, the machine disintegrates across
 the net into a set of sp...
26
27
US$3   28
Credit: Werner Vogels
                  29
Credit: Werner Vogels
                  30
4000

           Animoto EC2 image usage




  0
       Day 1                         Day 8
                              ...
Salesforce.com, Google,
     Software    Animoto, …, …, caBIG,
                 TeraGrid gateways




     Platform




In...
Salesforce.com, Google,
     Software    Animoto, …, …, caBIG,
                 TeraGrid gateways




     Platform




  ...
Salesforce.com, Google,
     Software    Animoto, …, …, caBIG,
                 TeraGrid gateways




                 Ama...
35
Dynamo: Amazon’s highly available key-
     value store (DeCandia et al., SOSP’07)
q   Simple query model
q   Weak consist...
Technologies used in Dynamo
                           T e c h n iq u
  P r o b le m                                      ...
Application
Service oriented applications




Service oriented infrastructure
  Infrastructure
                           ...
The Globus-based
                       LIGO data grid
                 LIGO Gravitational Wave Observatory



           ...
Data replication service
      Pull “missing” files to a storage system
                                            Data L...
Specializing further …
                                             S1

                   User                        S2
...
Using IaaS in biomedical informatics

                      handle.net

  IaaS provider            My servers

     BIRN  ...
Clouds and supercomputers:
         Conventional wisdom?


 Clouds/
 clusters       ✔              Too slow




  Super
co...
Ed Walker, Benchmarking Amazon EC2 for high-performance
scientific computing, ;Login, October 2008.               44
Ed Walker, Benchmarking Amazon EC2 for high-performance
scientific computing, ;Login, October 2008.               45
Ed Walker, Benchmarking Amazon EC2 for high-performance
scientific computing, ;Login, October 2008.               46
Ed Walker, Benchmarking Amazon EC2 for high-performance
scientific computing, ;Login, October 2008.               47
D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation
        from time series. SIGMETRICS 2007: 379-380         ...
D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation
        from time series. SIGMETRICS 2007: 379-380         ...
D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation
        from time series. SIGMETRICS 2007: 379-380         ...
D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation
        from time series. SIGMETRICS 2007: 379-380         ...
Clouds and supercomputers:
         Conventional wisdom?



                ✔
                               Good for
 Clo...
Loosely coupled problems
q   Ensemble runs to quantify climate model
    uncertainty
q   Identify potential drug targets b...
Many many tasks:
  Identifying potential drug targets


            Protein        x   2M+ ligands
         target(s)




...
Manually prep      Manually prep                         NAB script
                   ZINC             DOCK6 rec file    ...
56
DOCK on BG/P: ~1M tasks on 118,000 CPUs
                                               q   CPU cores: 118784
             ...
Managing 160,000 cores




Falkon

                       High-speed local “disk”

             Slower
             shared...
Scaling
 Chirp          Global file system                     Posix to
(multicast)                                       ...
Efficiency for 4 second tasks and varying data size (1KB
to 1MB) for CIO and GPFS up to 32K processors            60
Ioan
Raicu



“Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes,
             GCC policy, 50GB caches/node            ...
Same scenario, but with dynamic resource provisioning
                                                        62
Data diffusion sine-wave
             workload: Summary
q   GPFS     5.70 hrs, ~8Gb/s, 1138 CPU hrs
q   DD+SRP  1.80 hrs...
Clouds and supercomputers:
         Conventional wisdom?



                ✔
                               Good for
 Clo...
“The computer
revolution hasn’t
 happened yet.”
  Alan Kay, 1997



                   65
“When the network is as fast as the computer's
 internal links, the machine disintegrates across
 the net into a set of sp...
The Shape of Grids to Come?
                              Energy Internet




                                            ...
Thank you!



 Computation Institute
www.ci.uchicago.edu
Upcoming SlideShare
Loading in …5
×

Computing Outside The Box June 2009

1,463
-1

Published on

Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.

The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,463
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
43
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Computing Outside The Box June 2009

    1. 1. Ian Foster Computation Institute Argonne National Lab & University of Chicago 1
    2. 2. Abstract The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs quot;outside the boxquot; can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible. 2
    3. 3. 3
    4. 4. “I’ve been doing cloud computing since before it was called grid.” 4
    5. 5. 1890 5
    6. 6. 1953 6
    7. 7. “Computation may someday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961) 7
    8. 8. 8
    9. 9. “When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Science Connectivity (on log scale) Grid Time 9
    10. 10. Application Infrastructure 10
    11. 11. Layered grid architecture Application “Specialized services”: user- or appln-specific distributed services User Internet Protocol Architecture “Managing multiple resources”: ubiquitous infrastructure services Collective Application “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link (“The Anatomy of the Grid,” 2001) 11
    12. 12. Application Service oriented infrastructure Infrastructure 12
    13. 13. 13
    14. 14. www.opensciencegrid.org 14
    15. 15. www.opensciencegrid.org 15
    16. 16. Application Service oriented infrastructure Infrastructure 16
    17. 17. Application Service oriented applications Service oriented infrastructure Infrastructure 17
    18. 18. 18
    19. 19. As of Oct 122 participants 70 data 19, 2008: 105 services 35 analytical 19
    20. 20. l Query and retrieve Microarray clustering microarray data from a caArray data service: using Taverna cagridnode.c2b2.columbia.edu:80 Workflow in/output 80/wsrf/services/cagrid/CaArrayS caGrid services crub others “Shim” services l Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf /services/cagrid/PreprocessDatase tMAGEService l Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:80 80/wsrf/services/cagrid/Hierarchic alClusteringMage Wei Tan 20
    21. 21. Applications Infrastructure 21
    22. 22. Energy Progress of adoption 22
    23. 23. $$ $$ $$ Energy Progress of adoption 23
    24. 24. $$ $$ $$ Energy Progress of adoption 24
    25. 25. “When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Science Enterprise Connectivity (on log scale) Grid Cloud Time 25
    26. 26. 26
    27. 27. 27
    28. 28. US$3 28
    29. 29. Credit: Werner Vogels 29
    30. 30. Credit: Werner Vogels 30
    31. 31. 4000 Animoto EC2 image usage 0 Day 1 Day 8 31
    32. 32. Salesforce.com, Google, Software Animoto, …, …, caBIG, TeraGrid gateways Platform Infrastructure 32
    33. 33. Salesforce.com, Google, Software Animoto, …, …, caBIG, TeraGrid gateways Platform Amazon, GoGrid, Sun, Infrastructure Microsoft, … 33
    34. 34. Salesforce.com, Google, Software Animoto, …, …, caBIG, TeraGrid gateways Amazon, Google, Platform Microsoft, … Amazon, GoGrid, Sun, Infrastructure Microsoft, … 34
    35. 35. 35
    36. 36. Dynamo: Amazon’s highly available key- value store (DeCandia et al., SOSP’07) q Simple query model q Weak consistency, no isolation q Stringent SLAs (e.g., 300ms for 99.9% of requests; peak 500 requests/sec) q Incremental scalability q Symmetry q Decentralization q Heterogeneity 36
    37. 37. Technologies used in Dynamo T e c h n iq u P r o b le m Ad v a n ta g e C o n se t e n t is In c r e m e n t a l P a r t it io n in g Ve c t o r h a s h in g s c a la b ilit y H ig h c lo c k s w it h V e r s io n s iz e is A v a ila b ilit y f o r r e c o n c ilia t io d e c o u p le d f r o m w r it e s n d u r in g P r odvaid e s aht ig sh up t e r e S lo p p y a v a ila b ilit y a n d H a n d lin g re a d s q u o ru m a n d d u r a b ilit y te m p o ra ry h in t e d g u a ra n te e w h e n f a ilu r e s h a n d o ff s o m e o f th e R e c o v e r in g r eS p licc ahsr oa n iz enso t yn re A n t i-e n t r o p y fro m d iv a ila b le t a v e rg e n u s in g M e r k le p e rm a n e n t r e P r e s e r ine t h e p lic a s v s tre e s s b a cm e trroy u a n d ym kg f a ilu r e s G o s s ip - nd a v o id s h a v in g a bas e d M e m b e r s h ip c e n t r a liz e d m e m b e r s h ip a n d f a ilu r e r e g is t r y f o r p ro to c o l a n d d e t e c t io n s t o r in g f a ilu r e m e m b e r s h ip a n d d e t e c t io n . n o d e liv e n e s s
    38. 38. Application Service oriented applications Service oriented infrastructure Infrastructure 38
    39. 39. The Globus-based LIGO data grid LIGO Gravitational Wave Observatory B ir m in g h a m • Cardiff AEI/Golm Replicating >1 Terabyte/day to 8 sites >100 million replicas so far MTBF = 1 month 39
    40. 40. Data replication service Pull “missing” files to a storage system Data Location Data Movement Local Replica GridFTP Replica Location Reliable Catalog Index File Transfer Service Local Replica GridFTP Replica Location Catalog Index Data Replication List of required Data Files Replication Service “Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 40
    41. 41. Specializing further … S1 User S2 D “Provide access to S3 data D at S1, S2, S3 S1 with performance P” S2 Service D Provider Replica catalog, S3 “Provide storage User-level multicast, … with performance P1, network with P2, …” S1 Resource S2 D Provider S3 41
    42. 42. Using IaaS in biomedical informatics handle.net IaaS provider My servers BIRN BIRN Chicago Chicago Chicago Chicago Chicago 42
    43. 43. Clouds and supercomputers: Conventional wisdom? Clouds/ clusters ✔ Too slow Super computers Too expensive ✔ Loosely coupled Tightly coupled applications applications 43
    44. 44. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 44
    45. 45. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 45
    46. 46. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 46
    47. 47. Ed Walker, Benchmarking Amazon EC2 for high-performance scientific computing, ;Login, October 2008. 47
    48. 48. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 48
    49. 49. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 49
    50. 50. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 50
    51. 51. D. Nurmi, J. Brevik, R. Wolski: QBETS: queue bounds estimation from time series. SIGMETRICS 2007: 379-380 51
    52. 52. Clouds and supercomputers: Conventional wisdom? ✔ Good for Clouds/ clusters rapid response Super computers Too expensive ✔ Loosely coupled Tightly coupled applications applications 52
    53. 53. Loosely coupled problems q Ensemble runs to quantify climate model uncertainty q Identify potential drug targets by screening a database of ligand structures against target proteins q Study economic model sensitivity to parameters q Analyze turbulence dataset from many perspectives q Perform numerical optimization to determine optimal resource assignment in energy problems q Mine collection of data from advanced light sources q Construct databases of computed properties of chemical compounds q Analyze data from the Large Hadron Collider q Analyze log data from 100,000-node parallel 53 53
    54. 54. Many many tasks: Identifying potential drug targets Protein x 2M+ ligands target(s) (Mike Kubal, Benoit Roux, and others) 54
    55. 55. Manually prep Manually prep NAB script ZINC DOCK6 rec file FRED rec file NAB parameters 3-D Script (defines flexible structures DOCK6 FRED Template residues, Receptor Receptor #MDsteps) 6 2M (1 per protein: (1 per protein: structures (6GB PDB defines pocket defines pocket 1 GB) protein protein to bind to) to bind to) BuildNABScript descriptions (1MB) NAB Amber prep: start 2. AmberizeReceptor Script ~4M x 60s x 1 cpu 4. perl: gen nabscript FRED DOCK6 ~60K cpu-hrs Select best ~5K Select best ~5K Amber Score: ~10K x 20m x 1 cpu 1. AmberizeLigand Amber ~3K cpu-hrs 3. AmberizeComplex 5. RunNABScript Select best ~500 GCMC ~500 x 10hr x 100 cpu ~500K cpu-hrs For 1 target: end 4 million tasks 500,000 cpu-hrs report ligands complexes (50 cpu-years)55
    56. 56. 56
    57. 57. DOCK on BG/P: ~1M tasks on 118,000 CPUs q CPU cores: 118784 q Tasks: 934803 q Elapsed time: 7257 sec q Compute time: 21.43 CPU years q Average task time: 667 sec q Relative Efficiency: 99.7% q (from 16 to 32 racks) q Utilization: x Sustained: 99.6% Time (secs) • x Overall: 78.3% GPFS • 1 script (~5KB) • 2 file read (~10KB) • 1 file write (~10KB) • RAM (cached from GPFS on first task per node) Ioan Zhao Mike • 1 binary (~7MB) Raicu Zhang Wilde • Static input data (~45MB) 57
    58. 58. Managing 160,000 cores Falkon High-speed local “disk” Slower shared storage 58
    59. 59. Scaling Chirp Global file system Posix to (multicast) petascale Staging  Torus and tree interconnects  CN-striped intermediate file system Intermediate Large … MosaStore dataset (striping) Compute Compute LFS node ... LFS node Local (local datasets) (local datasets) 59
    60. 60. Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors 60
    61. 61. Ioan Raicu “Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node 61
    62. 62. Same scenario, but with dynamic resource provisioning 62
    63. 63. Data diffusion sine-wave workload: Summary q GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs q DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs q DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs 63
    64. 64. Clouds and supercomputers: Conventional wisdom? ✔ Good for Clouds/ clusters rapid response Super computers Excellent ✔ Loosely coupled Tightly coupled applications applications 64
    65. 65. “The computer revolution hasn’t happened yet.” Alan Kay, 1997 65
    66. 66. “When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Science Enterprise Consumer Connectivity (on log scale) Grid Cloud ???? Time 66
    67. 67. The Shape of Grids to Come? Energy Internet 67
    68. 68. Thank you! Computation Institute www.ci.uchicago.edu

    ×