Computing Outside The Box

Ian Foster Computation Institute Argonne National Lab & University of Chicago

Abstract The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.

“ Computation may someday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961)

The grid, 1998 “ Dependable, consistent, pervasive access to resources” Dependable : Performance and functionality guarantees Consistent : Uniform interfaces to a wide variety of resources Pervasive : Ability to “plug in” from anywhere

Application Infrastructure Service oriented infrastructure

Layered grid architecture Initially custom … later Web Services Application Fabric “ Controlling things locally”: Access to, & control of, resources Connectivity “ Talking to things”: communication (Internet protocols) & security Resource “ Sharing single resources”: negotiating access, controlling use Collective “ Managing multiple resources”: ubiquitous infrastructure services User “ Specialized services”: user- or appln-specific distributed services Internet Transport Application Link Internet Protocol Architecture

Bennett Berthenthal et al., www.sidgrid.org

Simplified example workflows Genome sequence analysis Physics data analysis Sloan digital sky survey www.opensciencegrid.org

“ Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node Ioan Raicu

Same scenario, but with dynamic resource provisioning

Data diffusion ine-wave workload: Summary GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs

Application Service oriented applications Infrastructure Service oriented infrastructure

Creating Services in 2008 Introduce and gRAVI Introduce Define service Create skeleton Discover types Add operations Configure security Grid R emote A pplication V irtualization Infrastructure Wrap executables Index service Repository Service Introduce Container Ohio State University and Argonne/U.Chicago Appln Service Create Store Advertize Discover Invoke; get results Transfer GAR Deploy Globus

As of Oct 19 , 2008: 122 participants 105 services 70 data 35 analytical

Microarray clustering using Taverna Query and retrieve microarray data from a caArray data service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage Workflow in/output caGrid services “ Shim” services others Wei Tan

The Globus-based LIGO data grid Birmingham • Replicating >1 Terabyte/day to 8 sites >100 million replicas so far MTBF = 1 month LIGO Gravitational Wave Observatory Cardiff AEI/Golm

Pull “missing” files to a storage system Data replication service List of required Files GridFTP Local Replica Catalog Replica Location Index Data Replication Service Reliable File Transfer Service Local Replica Catalog GridFTP “ Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 Replica Location Index Data Movement Data Location Data Replication

Why not leverage dynamic deployment capabilities? Physical machine Procure hardware VM VM Deploy virtual machine State exposed & access uniformly at all levels Provisioning, management, and monitoring at all levels JVM Deploy container DRS Deploy service GridFTP LRC VO Services GridFTP Hypervisor/OS Deploy hypervisor/OS

Maybe we need to specialize further … User D S1 S2 S3 Service Provider “ Provide access to data D at S1, S2, S3 with performance P” Resource Provider “ Provide storage with performance P1, network with P2, …” D S1 S2 S3 Replica catalog, User-level multicast, … D S1 S2 S3

Animoto EC2 image usage Day 1 Day 8 0 4000

Software Platform Infrastructure Saleforce.com, Google, Animoto, …, …, … caBIG, TG gateways

Software Platform Infrastructure Saleforce.com, Google, Animoto, …, …, … caBIG, TG gateways Amazon, GoGrid, Sun, Microsoft, …

Software Platform Infrastructure Saleforce.com, Google, Animoto, …, …, … caBIG, TG gateways Amazon, GoGrid, Sun, Microsoft, … Amazon, Google, Microsoft, …

Dynamo: Amazon’s highly available key-value store (DeCandia et al., SOSP’07) Simple query model Weak consistency, no isolation Stringent SLAs (e.g., 300ms for 99.9% of requests; peak 500 requests/sec) Incremental scalability Symmetry Decentralization Heterogeneity

Technologies used in Dynamo Problem Technique Advantage Partitioning Consistent hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Energy Internet The Shape of Grids to Come?

Killers apps for COTB? Biomedical informatics/Evidence-based medicine Human responses to global climate disruption

Using IaaS in biomedical informatics My servers Chicago Chicago handle.net BIRN Chicago IaaS provider Chicago BIRN Chicago

“ The computer revolution hasn’t happened yet.” Alan Kay, 1997

Time Connectivity (on log scale) Science Enterprise Consumer “ When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Grid Cloud ????

Thank you! Computation Institute www.ci.uchicago.edu

Computing Outside The Box

More Related Content

What's hot

Similar to Computing Outside The Box

More from Ian Foster

Recently uploaded

Computing Outside The Box