Ian Foster Computation Institute Argonne National Lab & University of Chicago
Abstract The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
1890
1953
“ Computation may someday be organized as a public utility …  The computing utility could become the basis for a new and important industry.” John  McCarthy  (1961)
 
 
 
 
 
I-WAY, 1995
The grid, 1998 “ Dependable, consistent, pervasive access to resources” Dependable : Performance  and functionality guarantees Consistent : Uniform interfaces  to a wide variety of resources Pervasive : Ability to “plug in”  from anywhere
Application Infrastructure
Application Infrastructure Service oriented  infrastructure
Layered grid architecture Initially custom … later Web Services Application Fabric “ Controlling things locally”: Access to, & control of, resources Connectivity “ Talking to things”: communication (Internet protocols) & security Resource “ Sharing single resources”: negotiating access, controlling use Collective “ Managing multiple resources”: ubiquitous infrastructure services User “ Specialized services”: user- or appln-specific distributed services Internet Transport Application Link Internet Protocol Architecture
 
www.opensciencegrid.org
www.opensciencegrid.org
Bennett Berthenthal et al., www.sidgrid.org
Brian Tieman
Simplified example workflows Genome sequence analysis Physics data analysis Sloan digital sky survey www.opensciencegrid.org
“ Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node Ioan Raicu
Same scenario, but with dynamic resource provisioning
Data diffusion ine-wave workload: Summary GPFS      5.70 hrs,  ~8Gb/s,  1138 CPU hrs DD+SRP    1.80 hrs, ~25Gb/s,  361 CPU hrs DD+DRP    1.86 hrs, ~24Gb/s,  253 CPU hrs
Application Infrastructure Service oriented  infrastructure
Application Service oriented  applications Infrastructure Service oriented  infrastructure
 
Creating Services in 2008 Introduce and gRAVI  Introduce Define service Create skeleton Discover types Add operations Configure security Grid R emote  A pplication  V irtualization  Infrastructure Wrap executables Index  service Repository   Service Introduce Container Ohio State University and Argonne/U.Chicago Appln Service Create Store Advertize Discover Invoke; get results Transfer GAR Deploy Globus
As of  Oct 19 , 2008: 122 participants 105   services 70   data 35  analytical
Microarray clustering  using Taverna Query  and retrieve microarray data from a caArray data service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub Normalize  microarray data using GenePattern analytical service  node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService Hierarchical clustering  using geWorkbench analytical service:  cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage Workflow in/output caGrid services “ Shim” services others Wei Tan
The Globus-based LIGO data grid  Birmingham • Replicating >1 Terabyte/day to 8 sites >100 million replicas so far MTBF = 1 month LIGO Gravitational Wave Observatory Cardiff AEI/Golm
Pull “missing” files to a storage system Data replication service List of required Files GridFTP Local Replica Catalog Replica Location Index Data Replication Service Reliable File Transfer Service Local Replica Catalog GridFTP “ Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005  Replica Location Index Data Movement Data Location Data Replication
Why not leverage dynamic deployment capabilities? Physical machine Procure hardware VM VM Deploy virtual machine State exposed & access uniformly at all levels Provisioning, management, and monitoring at all levels JVM Deploy container DRS Deploy service GridFTP LRC VO Services GridFTP Hypervisor/OS  Deploy hypervisor/OS
Maybe we need to specialize further … User D S1 S2 S3 Service Provider “ Provide access to data D at S1, S2, S3 with performance P” Resource Provider “ Provide storage  with performance P1, network with P2, …” D S1 S2 S3 Replica catalog, User-level multicast, … D S1 S2 S3
Infrastructure Applications
Energy Progress of adoption
 
US$3
Credit: Werner Vogels
Credit: Werner Vogels
Animoto EC2 image usage Day 1 Day 8 0 4000
Software Platform Infrastructure Saleforce.com, Google, Animoto, …, …, … caBIG, TG gateways
Software Platform Infrastructure Saleforce.com, Google, Animoto, …, …, … caBIG, TG gateways Amazon, GoGrid, Sun, Microsoft, …
Software Platform Infrastructure Saleforce.com, Google, Animoto, …, …, … caBIG, TG gateways Amazon, GoGrid, Sun, Microsoft, … Amazon, Google, Microsoft, …
Dynamo: Amazon’s highly available key-value store (DeCandia et al., SOSP’07) Simple query model Weak consistency, no isolation Stringent SLAs (e.g., 300ms for 99.9% of requests; peak 500 requests/sec) Incremental scalability Symmetry Decentralization Heterogeneity
Technologies used in Dynamo Problem Technique Advantage Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information
Application Service oriented  applications Infrastructure Service oriented  infrastructure
Energy Internet The Shape of Grids to Come?
Killers apps for COTB? Biomedical informatics/Evidence-based medicine Human responses to global climate disruption
Using IaaS in biomedical informatics My servers Chicago Chicago handle.net BIRN Chicago IaaS provider Chicago BIRN Chicago
“ The computer revolution hasn’t happened yet.” Alan Kay, 1997
Time Connectivity (on log scale) Science Enterprise Consumer “ When the network is as fast as the computer's    internal links, the machine disintegrates across    the net into a set of special purpose appliances” (George Gilder, 2001) Grid Cloud ????
Thank you! Computation Institute www.ci.uchicago.edu

Computing Outside The Box

  • 1.
    Ian Foster ComputationInstitute Argonne National Lab & University of Chicago
  • 2.
    Abstract The pastdecade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
  • 3.
  • 4.
  • 5.
    “ Computation maysomeday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961)
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
    The grid, 1998“ Dependable, consistent, pervasive access to resources” Dependable : Performance and functionality guarantees Consistent : Uniform interfaces to a wide variety of resources Pervasive : Ability to “plug in” from anywhere
  • 13.
  • 14.
    Application Infrastructure Serviceoriented infrastructure
  • 15.
    Layered grid architectureInitially custom … later Web Services Application Fabric “ Controlling things locally”: Access to, & control of, resources Connectivity “ Talking to things”: communication (Internet protocols) & security Resource “ Sharing single resources”: negotiating access, controlling use Collective “ Managing multiple resources”: ubiquitous infrastructure services User “ Specialized services”: user- or appln-specific distributed services Internet Transport Application Link Internet Protocol Architecture
  • 16.
  • 17.
  • 18.
  • 19.
    Bennett Berthenthal etal., www.sidgrid.org
  • 20.
  • 21.
    Simplified example workflowsGenome sequence analysis Physics data analysis Sloan digital sky survey www.opensciencegrid.org
  • 22.
    “ Sine” workload,2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node Ioan Raicu
  • 23.
    Same scenario, butwith dynamic resource provisioning
  • 24.
    Data diffusion ine-waveworkload: Summary GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs
  • 25.
    Application Infrastructure Serviceoriented infrastructure
  • 26.
    Application Service oriented applications Infrastructure Service oriented infrastructure
  • 27.
  • 28.
    Creating Services in2008 Introduce and gRAVI Introduce Define service Create skeleton Discover types Add operations Configure security Grid R emote A pplication V irtualization Infrastructure Wrap executables Index service Repository Service Introduce Container Ohio State University and Argonne/U.Chicago Appln Service Create Store Advertize Discover Invoke; get results Transfer GAR Deploy Globus
  • 29.
    As of Oct 19 , 2008: 122 participants 105 services 70 data 35 analytical
  • 30.
    Microarray clustering using Taverna Query and retrieve microarray data from a caArray data service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage Workflow in/output caGrid services “ Shim” services others Wei Tan
  • 31.
    The Globus-based LIGOdata grid Birmingham • Replicating >1 Terabyte/day to 8 sites >100 million replicas so far MTBF = 1 month LIGO Gravitational Wave Observatory Cardiff AEI/Golm
  • 32.
    Pull “missing” filesto a storage system Data replication service List of required Files GridFTP Local Replica Catalog Replica Location Index Data Replication Service Reliable File Transfer Service Local Replica Catalog GridFTP “ Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 Replica Location Index Data Movement Data Location Data Replication
  • 33.
    Why not leveragedynamic deployment capabilities? Physical machine Procure hardware VM VM Deploy virtual machine State exposed & access uniformly at all levels Provisioning, management, and monitoring at all levels JVM Deploy container DRS Deploy service GridFTP LRC VO Services GridFTP Hypervisor/OS Deploy hypervisor/OS
  • 34.
    Maybe we needto specialize further … User D S1 S2 S3 Service Provider “ Provide access to data D at S1, S2, S3 with performance P” Resource Provider “ Provide storage with performance P1, network with P2, …” D S1 S2 S3 Replica catalog, User-level multicast, … D S1 S2 S3
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
    Animoto EC2 imageusage Day 1 Day 8 0 4000
  • 42.
    Software Platform InfrastructureSaleforce.com, Google, Animoto, …, …, … caBIG, TG gateways
  • 43.
    Software Platform InfrastructureSaleforce.com, Google, Animoto, …, …, … caBIG, TG gateways Amazon, GoGrid, Sun, Microsoft, …
  • 44.
    Software Platform InfrastructureSaleforce.com, Google, Animoto, …, …, … caBIG, TG gateways Amazon, GoGrid, Sun, Microsoft, … Amazon, Google, Microsoft, …
  • 45.
    Dynamo: Amazon’s highlyavailable key-value store (DeCandia et al., SOSP’07) Simple query model Weak consistency, no isolation Stringent SLAs (e.g., 300ms for 99.9% of requests; peak 500 requests/sec) Incremental scalability Symmetry Decentralization Heterogeneity
  • 46.
    Technologies used inDynamo Problem Technique Advantage Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information
  • 47.
    Application Service oriented applications Infrastructure Service oriented infrastructure
  • 48.
    Energy Internet TheShape of Grids to Come?
  • 49.
    Killers apps forCOTB? Biomedical informatics/Evidence-based medicine Human responses to global climate disruption
  • 50.
    Using IaaS inbiomedical informatics My servers Chicago Chicago handle.net BIRN Chicago IaaS provider Chicago BIRN Chicago
  • 51.
    “ The computerrevolution hasn’t happened yet.” Alan Kay, 1997
  • 52.
    Time Connectivity (onlog scale) Science Enterprise Consumer “ When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Grid Cloud ????
  • 53.
    Thank you! ComputationInstitute www.ci.uchicago.edu