Grid Computing July 2009

Grid computing Ian Foster Computation Institute Argonne National Lab & University of Chicago

“ When the network is as fast as the computer’s internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001)

“ I’ve been doing cloud computing since before it was called grid.”

“ Computation may someday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961)

Scientific collaboration Scientific collaboration

Important characteristics We must integrate systems that may not have worked together before These are human systems, with differing goals, incentives, capabilities All components are dynamic—change is the norm, not the exception Processes evolve rapidly also We are not building something simple like a bridge or an airline reservation system

We are dealing with complex adaptive systems A complex adaptive system is a collection of individual agents that have the freedom to act in ways that are not always predictable and whose actions are interconnected such that one agent’s actions changes the context for other agents. Crossing the Quality Chasm, IOM, 2001; pp 312-13 Non-linear and dynamic Agents are independent and intelligent Goals and behaviors often in conflict Self-organization through adaptation and learning No single point(s) of control Hierarchical decomp-osition has limited value

We need to function in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos Zone of complexity

We need to function in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos

“ The Anatomy of the Grid,” 2001 The … problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations . The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource -brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).

Examples (from AotG, 2001) “ The application service providers, storage service providers, cycle providers, and consultants engaged by a car manufacturer to perform scenario evaluation during planning for a new factory” “ Members of an industrial consortium bidding on a new aircraft” “ A crisis management team and the databases and simulation systems that they use to plan a response to an emergency situation” “ Members of a large, international, multiyear high-energy physics collaboration”

From the organizational behavior and management community “ [A] group of people who interact through interdependent tasks guided by common purpose [that] works across space, time, and organizational boundaries with links strengthened by webs of communication technologies ” — Lipnack & Stamps, 1997 Yes—but adding cyber-infrastructure: People  computational agents & services Communication technologies  IT infrastructure Collaboration based on rich data & computing capabilities

NSF Workshops on Building Effective Virtual Organizations [Search “BEVO 2008”]

The Grid paradigm Principles and mechanisms for dynamic VOs Leverage service oriented architecture (SOA) Loose coupling of data and services Open software, architecture 1995 2000 2005 2010 Computer science Physics Astronomy Engineering Biology Biomedicine Healthcare

We call these groupings virtual organizations (VOs) Healthcare = dynamic, overlapping VOs, linking Patient – primary care Sub-specialist – hospital Pharmacy – laboratory Insurer – … A set of individuals and/or institutions engaged in the controlled sharing of resources in pursuit of a common goal But U.S. health system is marked by fragmented and inefficient VOs with insufficient mechanisms for controlled sharing I advocate … a model of virtual integration rather than true vertical integration … G. Halvorson, CEO Kaiser

The Grid paradigm and information integration Data sources Platform services Radiology Medical records Name resources; move data around Make resources usable and useful Make resources accessible over the network Pathology Genomics Labs Manage who can do what RHIO

The Grid paradigm and information integration Data sources Platform services Transform data into knowledge Radiology Medical records Management Integration Publication Enhance user cognitive processes Incorporate into business processes Pathology Genomics Labs Security and policy RHIO

The Grid paradigm and information integration Data sources Platform services Value services Analysis Radiology Medical records Management Integration Publication Cognitive support Applications Pathology Genomics Labs Security and policy RHIO

We partition the multi-faceted interoperability problem Process interoperability Integrate work across healthcare enterprise Data interoperability Syntactic: move structured data among system elements Semantic: use information across system elements Systems interoperability Communicate securely, reliably among system elements Analysis Management Integration Publication Applications

Security and policy : Managing who can do what Familiar division of labor Publication level: bridge between local and global Integration level: VO-specific policies, based on attributes  Attribute authorities

Identity-based authZ Most simple - not scalable Unix Access Control Lists (Discretionary Access Control: DAC) Groups, directories, simple admin POSIX ACLs/MS-ACLs Finer-grained admin policy Role-based Access Control (RBAC) Separation of role/group from rule admin Mandatory Access Control (MAC) Clearance, classification, compartmentalization Attribute-based Access Control (ABAC) Generalization of attributes >>> Policy language abstraction level and expressiveness >>>

Publication : Make information accessible Make data available in a remotely accessible, reusable manner Leave mediation for integration layer Gateway from local policy/protocol into wide area mechanisms (transport, security, …)

Federating computers for physics data analysis

Earth System Grid Main ESG Portal CMIP3 (IPCC AR4) ESG Portal 198 TB of data at four locations 1,150 datasets 1,032,000 files Includes the past 6 years of joint DOE/NSF climate modeling experiments 35 TB of data at one location 74,700 files Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change Data from 13 countries, representing 25 models 8,000 registered users 1,900 registered projects Downloads to date 49 TB 176,000 files Downloads to date 387 TB 1,300,000 files 500 GB/day (average) 400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data ESG usage: over 500 sites worldwide ESG monthly download volumes Globus

Children’s Oncology Group Enterprise/Grid Interface service DICOM protocols Grid protocols (Web services) DICOM XDS HL7 Vendor-specific Wide area service actor Plug-in adapters

Automating service creation, deployment Introduce Define service Create skeleton Discover types Add operations Configure security Grid Remote Application Virtualization Infrastructure Wrap executables Index service Repository Service Introduce Container caGrid, Introduce, gRAVI: Ohio State, U.Chicago Appln Service Create Store Advertize Discover Invoke; get results Transfer GAR Deploy

As of Oct 19, 2008: 122 participants 105 services 70 data 35 analytical

Management : Naming and moving information Persistent, uniform global naming of objects, independent of type Orchestration of data movement among services D S1 S2 S3 D S1 S2 S3 D S1 S2 S3

LIGO Data Grid Birmingham • Replicating >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month LIGO Gravitational Wave Observatory Ann Chervenak et al., ISI; Scott Koranda et al, LIGO Cardiff AEI/Golm Globus

Pull “missing” files to a storage system Data replication service List of required Files GridFTP Local Replica Catalog Replica Location Index Data Replication Service Reliable File Transfer Service Local Replica Catalog GridFTP “ Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 Replica Location Index Data movement Data location Data replication

Naming objects: A prerequisite to management The naming problem: “ Health objects” = patient information, images, records, etc. “ Names” refer to health objects in records, files, databases, papers, reports, research, emails, etc. Challenges: No systematic way of naming health objects Many health objects, like DICOM images and reports, include references to other objects through non-unique, ambiguous, PHI-tainted identifiers A framework for distributed digital object services: Kahn, Wilensky, 1995

Health Object Identifier (HOI) naming system uri:hdl :// 888 .us.npi. 1234567890 .dicom/ 8A648C33 -A5…4939EBE Random String for Identifier-Body PHI-free and guaranteed unique 888: CHI’s top-level naming authority National Provider Id used in hierarchical Identifier Namespace Application Context’s Namespace governed by provider Naming Authority HOI’s URI schema identifier—based on Handle

Data movement in clinical trials

Community public health: Digital retinopathy screening network

Integration : Making information useful ? 0% 100% Degree of prior syntactic and semantic agreement Degree of communication 0% 100% Rigid standards-based approach Loosely coupled approach Adaptive approach

Integration via mediation Map between models Scoped to domain use Multiple concurrent use Bottom up mediation Between standards and versions Between local versions In absence of agreement Query Reformulation Query Optimization Query Execution Engine Wrapper Query in the source schema Wrapper Query in union of exported source schema Distributed query execution Global Data Model (Levy 2000)

ECOG 5202 integrated sample management ECOG CC ECOG PCO MD Anderson Web portal OGSA-DQP OGSA-DAI OGSA-DAI OGSA-DAI Mediator

Analytics : Transform data into knowledge “ The overwhelming success of genetic and genomic research efforts has created an enormous backlog of data with the potential to improve the quality of patient care and cost effectiveness of treatment.” — US Presidential Council of Advisors on Science and Technology, Personalized Medicine Themes, 2008

Microarray clustering using Taverna Query and retrieve microarray data from a caArray data service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage Workflow in/output caGrid services “ Shim” services others Wei Tan

Many many tasks: Identifying potential drug targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)

start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K

DOCK on BG/P: ~1M tasks on 118,000 CPUs CPU cores: 118784 Tasks: 934803 Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization: Sustained: 99.6% Overall: 78.3% Time (secs)

Scaling Posix to petascale … . . . Large dataset CN-striped intermediate file system  Torus and tree interconnects  Global file system Chirp (multicast) MosaStore (striping) Staging Inter- mediate Local LFS Compute node (local datasets) LFS Compute node (local datasets)

Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors

“ Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node Ioan Raicu

Same scenario, but with dynamic resource provisioning

Data diffusion sine-wave workload: Summary GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs

Recap Increased recognition that information systems and data understanding are limiting factor … much of the promise associated with health IT requires high levels of adoption … and high levels of use of interoperable systems (in which information can be exchanged across unrelated systems) … . RAND COMPARE Health system is complex, adaptive system There is no single point(s) of control. System behaviors are often unpredictable and uncontrollable, and no one is “in charge.” W Rouse, NAE Bridge With diverse and evolving requirements and user communities … I advocate … a model of virtual integration rather than true vertical integration…. G. Halvorson, CEO Kaiser

Functioning in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos

“ The computer revolution hasn’t happened yet.” Alan Kay, 1997

Time Connectivity (on log scale) Science Enterprise Consumer “ When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Grid Cloud ????

Thank you! Computation Institute www.ci.uchicago.edu

Grid Computing July 2009

More Related Content

What's hot

Viewers also liked

Similar to Grid Computing July 2009

More from Ian Foster

Recently uploaded

Grid Computing July 2009

Editor's Notes