Grid Computing July 2009

  • 1,756 views
Uploaded on

I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the …

I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,756
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
106
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • With high-speed networks, the Internet becomes more than a communications device—it becomes a computing device. We can disintegrate the computer – outsourcing computing and storage, for example. And we can aggregate capabilities (data and software; computing and storage) from many places The outsourcing/on-demand part is what people have called grid, utility computing, and more recently infrastructure as a service or cloud. It seems to be going mainstream, which is very exciting (and about time!) It’s worth remembering that these ideas are old
  • What I want to focus on today is the aggregation part, and in particular on the “virtual organization” concept. Let me remind us of another comment made back in 2001.
  • Early on, people realized that it didn’t make sense for people to travel to computers—that we should be able to compute outside the box. For example, AI pioneer John McCarthy spoke in these terms in 1961, at the launch of Project MAC (?) Here he is a couple of years ago, as such an industry is just emerging. It takes a while.
  • We cite [Rouse, Health Care as a CAS: Implications for Design… , NAE 2008] for the righthand side aprt. Must support Dynamic composition for a specific purpose Evolving community, function, environment Messy data, failure, incomplete knowledge Nice, but insufficient Data standards Platform standards Federal policies
  • Another perspective on the problem. A few words of explanation. If we are deploying a hospital IT system, we have Add other regions of agreement. You can’t achieve success via central planning. Quoted in Crossing the Quality Chasm, p. 312
  • We could show these things as moving if we wanted to be really clever  Over time, things change, these groups evolve. If we are successful, they merge
  • Foster, Kesselman, and Tuecke claimed that grids were all about “virtual organizations.” The way one should interpret that claim, I would assert, is in the context of Gilder’s comments. Things are distributed, for one reason or another—either via deliberate disintegration process, via outsourcing, or because they just started out distributed. Now we need to reassemble them, in a controlled manner.  We gave some examples
  • The first encompasses what people are tending to call “cloud” today. The fourth of course we are quite familiar with! Today, I would use some additional examples, taken from healthcare—a field that I believe will be the “killer app” for VO technologies
  • I particular, the organizational behavior and management community, who have studied virtual organizations for many years. Our VOs have a lot in common with their’s, but also differences—we’re not just about people, and maybe not even particularly about people. Fortunately we were able to speak to a lot of these people a couple of years ago, via some NSF workshops we organized.
  • The results are online – “a blueprint for advancing the design, development, and evaluation of virtual organizations.” One interesting anecdote: I found that just as CS can resent being brought into collaborative projects to “write code,” so organizational people can resent being brought in to “fix organizations”  One thing I learned was that …
  • Technology that has been under development for some years Include Globus logo. caGrid, BIRN LHC
  • Sharing relationships form and devolve dynamically—e.g., temporally Picture on left?
  • “ Make data usable and useful”  initially, I had “Address syntactic, semantic differences”
  • Talk about API vs Protocol Add “ilities,” function benefits to stack.
  • Talk about API vs Protocol Add “ilities,” function benefits to stack.
  • [Create an image here.] For example DICOM and HL7 combine messaging and data model in the same interoperability standard. People are contextualizing this problem at the data interoperability level.  Systems interoperability often neglected.  An area of differentiation, bringing in best practice in industry and science into health care space. Open source platform.  Experience with systems interoperability standards: IETF, OASIS, W3C, 
  • Attribute authorities emerge as an important system component Bridge between local and global: honest broker is an example Note sure what “policy in the network” means.
  • List services from
  • DO SOMETHING INTERESTING ON THE RIGHT Scaling via automating data adapters Representations of those things and semantics of those representations. Talk about how services are published, data modeling, etc. Publish data bases Publish services Name published objects
  • Why childhood cancer? Rare. 5-year survival rates for all childhood cancers combined increase dfrom 58.1 percent in 1975-77 to 79.6 percent in 1996-2003
  • 07/25/09 Test Built using the same mechanisms used to build SOI. -- PKI, delegation, attribute-based authorization -- Registries, monitoring Operating a service is a pain! Would be nice to outsource. But they need to be near the data, which also has privacy concerns. So things become complicated.
  • Objects are published, they need to be named, then they can be moved around without losing track of them Bulk data movement Fine grain access for data integration
  • GridFTP = high-perf data movement, multiple protocols, credential delegation, restart RLS = P2P system, soft state, Bloom filters, BUT: the services themselves are operated by the LIGO community. Running persistent, reliable, scalable services is expensive and difficult
  • Clinical, administrative, research. Issues often hidden and escalate Uniqueness No guaranteed global uniqueness Name ownership No ability to prove that a certain entity issued that name PHI-tainted names Filenames for some images have patientID embedded – sharing of name only may constitute HIPPA violation
  • Talk about handle….
  • TO PUT IN A SLIDE? Loose coupling and encapsulation Interoperability through integration based on data mediation Evolutionary in nature Set of scalable systems and methods Explicit in architecture – data integration layer Demonstrated in GSI, GridFTP, MDS, ECOG
  • This would be a good place for a graphic, perhaps showing top down vs. bottom up.
  • No coordinated data systems Excel spreadsheet Web service to application Oracle data base
  • DO SOMETHING INTERESTING ON THE RIGHT Scaling via automating data adapters Representations of those things and semantics of those representations. Talk about how services are published, data modeling, etc. Publish data bases Publish services Name published objects
  • 07/25/09 Test Workflows are becoming a widespread mechanism for coordinating the execution of scientific services and linking scientific resources. Analytical and data processing pipelines. Is this stuff real? EBI 3 million+ web service API submissions in 2007 A lot? We want to publish workflows as services. Think of caBIG services as service providers that then invoke grid services to execute services. (E.g., via TeraGrid gateways.)
  • "docking" is the identification of the low-energy binding modes of a small molecule (ligands) within the active site of a macromolecule (receptor) whose structure is known A compound that interacts strongly with (i.e. binds) a receptor associated with a disease may inhibit its function and thus act as a drug Typical Workload: Application Size: 7MB (static binary) Static input data: 35MB (binary and ASCII text) Dynamic input data:10KB (ASCII text) Output data: 10KB (ASCII text) Expected execution time: 5~5000 seconds Parameter space: 1 billion tasks
  • More precisely, step 3 is “GCMC + hydration.” Mike Kubal say: “This task is a Free Energy Perturbation computation using the Grand Canonical Monte Carlo algorithm for modeling the transition of the ligand (compound) between different potential states and the General Solvent Boundary Partition to explicitly model the water molecules in the volume around the ligand and pocket of the protein. The result is a binding energy just like the task at the top of the funnel; it is just a more rigorous attempt to model the actual interaction of protein and compound. To refer to the task in short hand, you can use "GCMC + hydration". This is a method that Benoit has pioneered.”
  • Application Efficiency was computed between the 16 rack and 32 rack runs. Sustained Utilization is the utilization achieved during the part of the experiment while there was enough work to do, 0 to 5300 sec. Overall utilization is the number of CPU hours used divided by total number of CPU hours allocated. The experiment included the caching of the 36 MB (52MB uncompressed) archive on each of the 1 st access per node We use “dd” to move data to and from GPFS…. The application itself had some bad I/O patterns in the write, which prevented it from scaling well, so we decided to write to RAM, and then dd back to GPFS. For this particular run, we had 464 Falkon services running on 464 I/O nodes, 118K workers (256 per Falkon service), and 1 client on a login node. The 32 rack job took 15 minutes to start. It took the client 6 minutes to establish a connection and setup the corresponding state with all 464 Falkon services. It took the client 40 seconds to dispatch 118K tasks to 118K CPUs. The rest can be seen from the graph and slide text…
  • We could show these things as moving if we wanted to be really clever  Over time, things change, these groups evolve. If we are successful, they merge
  • Talk about API vs Protocol Add “ilities,” function benefits to stack.
  • Because we are still mostly computing inside the box
  • Why now? Law of unexpected consequences—like Web: not just Tim Berners-Lee’s genius, but also disk drive capacity What will happen when ubiquitous high-speed wireless means we can all reach any service anytime—and powerful tools mean we can author our own services? Fascinating set of challenges -- What sort of services? Applications? -- What does openness mean in this context? -- How do we address interoperability, portability, composition? -- Accounting, security, audit?

Transcript

  • 1. Grid computing Ian Foster Computation Institute Argonne National Lab & University of Chicago
  • 2. “ When the network is as fast as the computer’s internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001)
  • 3.
    • “ I’ve been doing cloud computing since before it was called grid.”
  • 4. “ Computation may someday be organized as a public utility … The computing utility could become the basis for a new and important industry.” John McCarthy (1961)
  • 5. Scientific collaboration Scientific collaboration
  • 6. Addressing urban health needs
  • 7. Important characteristics
    • We must integrate systems that may not have worked together before
    • These are human systems, with differing goals, incentives, capabilities
    • All components are dynamic—change is the norm, not the exception
    • Processes evolve rapidly also
    We are not building something simple like a bridge or an airline reservation system
  • 8. We are dealing with complex adaptive systems
    • A complex adaptive system is a collection of individual agents that have the freedom to act in ways that are not always predictable and whose actions are interconnected such that one agent’s actions changes the context for other agents.
    • Crossing the Quality Chasm, IOM, 2001; pp 312-13
    • Non-linear and dynamic
    • Agents are independent and intelligent
    • Goals and behaviors often in conflict
    • Self-organization through adaptation and learning
    • No single point(s) of control
    • Hierarchical decomp-osition has limited value
  • 9. We need to function in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos Zone of complexity
  • 10. We need to function in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos
  • 11. “ The Anatomy of the Grid,” 2001
    • The … problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations . The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource -brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO).
  • 12. Examples (from AotG, 2001)
    • “ The application service providers, storage service providers, cycle providers, and consultants engaged by a car manufacturer to perform scenario evaluation during planning for a new factory”
    • “ Members of an industrial consortium bidding on a new aircraft”
    • “ A crisis management team and the databases and simulation systems that they use to plan a response to an emergency situation”
    • “ Members of a large, international, multiyear high-energy physics collaboration”
  • 13. From the organizational behavior and management community
    • “ [A] group of people who interact through interdependent tasks guided by common purpose [that] works across space, time, and organizational boundaries with links strengthened by webs of communication technologies ”
    • — Lipnack & Stamps, 1997
    • Yes—but adding cyber-infrastructure:
      • People  computational agents & services
      • Communication technologies  IT infrastructure
    Collaboration based on rich data & computing capabilities
  • 14. NSF Workshops on Building Effective Virtual Organizations
    • [Search “BEVO 2008”]
  • 15. The Grid paradigm
    • Principles and mechanisms for dynamic VOs
    • Leverage service oriented architecture (SOA)
    • Loose coupling of data and services
    • Open software, architecture
    1995 2000 2005 2010 Computer science Physics Astronomy Engineering Biology Biomedicine Healthcare
  • 16. We call these groupings virtual organizations (VOs)
    • Healthcare = dynamic, overlapping VOs, linking
      • Patient – primary care
      • Sub-specialist – hospital
      • Pharmacy – laboratory
      • Insurer – …
    A set of individuals and/or institutions engaged in the controlled sharing of resources in pursuit of a common goal But U.S. health system is marked by fragmented and inefficient VOs with insufficient mechanisms for controlled sharing
      • I advocate … a model of virtual integration rather than true vertical integration … G. Halvorson, CEO Kaiser
  • 17. The Grid paradigm and information integration Data sources Platform services Radiology Medical records Name resources; move data around Make resources usable and useful Make resources accessible over the network Pathology Genomics Labs Manage who can do what RHIO
  • 18. The Grid paradigm and information integration Data sources Platform services Transform data into knowledge Radiology Medical records Management Integration Publication Enhance user cognitive processes Incorporate into business processes Pathology Genomics Labs Security and policy RHIO
  • 19. The Grid paradigm and information integration Data sources Platform services Value services Analysis Radiology Medical records Management Integration Publication Cognitive support Applications Pathology Genomics Labs Security and policy RHIO
  • 20. We partition the multi-faceted interoperability problem
    • Process interoperability
      • Integrate work across healthcare enterprise
    • Data interoperability
      • Syntactic: move structured data among system elements
      • Semantic: use information across system elements
    • Systems interoperability
      • Communicate securely, reliably among system elements
    Analysis Management Integration Publication Applications
  • 21. Security and policy : Managing who can do what
    • Familiar division of labor
    • Publication level: bridge between local and global
    • Integration level: VO-specific policies, based on attributes
    •  Attribute authorities
  • 22. Identity-based authZ Most simple - not scalable Unix Access Control Lists (Discretionary Access Control: DAC) Groups, directories, simple admin POSIX ACLs/MS-ACLs Finer-grained admin policy Role-based Access Control (RBAC) Separation of role/group from rule admin Mandatory Access Control (MAC) Clearance, classification, compartmentalization Attribute-based Access Control (ABAC) Generalization of attributes >>> Policy language abstraction level and expressiveness >>>
  • 23. Globus / caGrid GAARDS
  • 24. Publication : Make information accessible
    • Make data available in a remotely accessible, reusable manner
    • Leave mediation for integration layer
    • Gateway from local policy/protocol into wide area mechanisms (transport, security, …)
  • 25. TeraGrid participants
  • 26. Federating computers for physics data analysis
  • 27.  
  • 28. Earth System Grid Main ESG Portal CMIP3 (IPCC AR4) ESG Portal
    • 198 TB of data at four locations
    • 1,150 datasets
    • 1,032,000 files
    • Includes the past 6 years of joint DOE/NSF climate modeling experiments
    • 35 TB of data at one location
    • 74,700 files
    • Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change
    • Data from 13 countries, representing 25 models
    8,000 registered users 1,900 registered projects
    • Downloads to date
    • 49 TB
    • 176,000 files
    • Downloads to date
    • 387 TB
    • 1,300,000 files
    • 500 GB/day (average)
    400 scientific papers published to date based on analysis of CMIP3 (IPCC AR4) data ESG usage: over 500 sites worldwide ESG monthly download volumes Globus
  • 29. Children’s Oncology Group Enterprise/Grid Interface service DICOM protocols Grid protocols (Web services) DICOM XDS HL7 Vendor-specific Wide area service actor Plug-in adapters
  • 30. Automating service creation, deployment
    • Introduce
      • Define service
      • Create skeleton
      • Discover types
      • Add operations
      • Configure security
    • Grid Remote Application Virtualization Infrastructure
      • Wrap executables
    Index service Repository Service Introduce Container caGrid, Introduce, gRAVI: Ohio State, U.Chicago Appln Service Create Store Advertize Discover Invoke; get results Transfer GAR Deploy
  • 31. As of Oct 19, 2008: 122 participants 105 services 70 data 35 analytical
  • 32. Management : Naming and moving information
    • Persistent, uniform global naming of objects, independent of type
    • Orchestration of data movement among services
    D S1 S2 S3 D S1 S2 S3 D S1 S2 S3
  • 33. LIGO Data Grid Birmingham • Replicating >1 Terabyte/day to 8 sites 770 TB replicated to date: >120 million replicas MTBF = 1 month LIGO Gravitational Wave Observatory Ann Chervenak et al., ISI; Scott Koranda et al, LIGO
    • Cardiff
    AEI/Golm Globus
  • 34.
    • Pull “missing” files to a storage system
    Data replication service List of required Files GridFTP Local Replica Catalog Replica Location Index Data Replication Service Reliable File Transfer Service Local Replica Catalog GridFTP “ Design and Implementation of a Data Replication Service Based on the Lightweight Data Replicator System,” Chervenak et al., 2005 Replica Location Index Data movement Data location Data replication
  • 35. Naming objects: A prerequisite to management
    • The naming problem:
    • “ Health objects” = patient information, images, records, etc.
    • “ Names” refer to health objects in records, files, databases, papers, reports, research, emails, etc.
    • Challenges:
    • No systematic way of naming health objects
    • Many health objects, like DICOM images and reports, include references to other objects through non-unique, ambiguous, PHI-tainted identifiers
    A framework for distributed digital object services: Kahn, Wilensky, 1995
  • 36. Health Object Identifier (HOI) naming system uri:hdl :// 888 .us.npi. 1234567890 .dicom/ 8A648C33 -A5…4939EBE Random String for Identifier-Body PHI-free and guaranteed unique 888: CHI’s top-level naming authority National Provider Id used in hierarchical Identifier Namespace Application Context’s Namespace governed by provider Naming Authority HOI’s URI schema identifier—based on Handle
  • 37. Data movement in clinical trials
  • 38. Community public health: Digital retinopathy screening network
  • 39. Integration : Making information useful ? 0% 100% Degree of prior syntactic and semantic agreement Degree of communication 0% 100% Rigid standards-based approach Loosely coupled approach Adaptive approach
  • 40. Integration via mediation
    • Map between models
    • Scoped to domain use
      • Multiple concurrent use
    • Bottom up mediation
      • Between standards and versions
      • Between local versions
      • In absence of agreement
    Query Reformulation Query Optimization Query Execution Engine Wrapper Query in the source schema Wrapper Query in union of exported source schema Distributed query execution Global Data Model (Levy 2000)
  • 41. ECOG 5202 integrated sample management ECOG CC ECOG PCO MD Anderson Web portal OGSA-DQP OGSA-DAI OGSA-DAI OGSA-DAI Mediator
  • 42. Analytics : Transform data into knowledge
    • “ The overwhelming success of genetic and genomic research efforts has created an enormous backlog of data with the potential to improve the quality of patient care and cost effectiveness of treatment.”
      • — US Presidential Council of Advisors on Science and Technology, Personalized Medicine Themes, 2008
  • 43. Microarray clustering using Taverna
    • Query and retrieve microarray data from a caArray data service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/CaArrayScrub
    • Normalize microarray data using GenePattern analytical service node255.broad.mit.edu:6060/wsrf/services/cagrid/PreprocessDatasetMAGEService
    • Hierarchical clustering using geWorkbench analytical service: cagridnode.c2b2.columbia.edu:8080/wsrf/services/cagrid/HierarchicalClusteringMage
    Workflow in/output caGrid services “ Shim” services others Wei Tan
  • 44. Many many tasks: Identifying potential drug targets 2M+ ligands Protein x target(s) (Mike Kubal, Benoit Roux, and others)
  • 45. start report DOCK6 Receptor (1 per protein: defines pocket to bind to) ZINC 3-D structures ligands complexes NAB script parameters (defines flexible residues, #MDsteps) Amber Score: 1. AmberizeLigand 3. AmberizeComplex 5. RunNABScript end BuildNABScript NAB Script NAB Script Template Amber prep: 2. AmberizeReceptor 4. perl: gen nabscript FRED Receptor (1 per protein: defines pocket to bind to) Manually prep DOCK6 rec file Manually prep FRED rec file 1 protein (1MB) PDB protein descriptions For 1 target: 4 million tasks 500,000 cpu-hrs (50 cpu-years) 6 GB 2M structures (6 GB) DOCK6 FRED ~4M x 60s x 1 cpu ~60K cpu-hrs Amber ~10K x 20m x 1 cpu ~3K cpu-hrs Select best ~500 ~500 x 10hr x 100 cpu ~500K cpu-hrs GCMC Select best ~5K Select best ~5K
  • 46. DOCK on BG/P: ~1M tasks on 118,000 CPUs
    • CPU cores: 118784
    • Tasks: 934803
    • Elapsed time: 7257 sec
    • Compute time: 21.43 CPU years
    • Average task time: 667 sec
    • Relative Efficiency: 99.7% (from 16 to 32 racks)
    • Utilization:
      • Sustained: 99.6%
      • Overall: 78.3%
    Time (secs)
  • 47. Scaling Posix to petascale … . . . Large dataset CN-striped intermediate file system  Torus and tree interconnects  Global file system Chirp (multicast) MosaStore (striping) Staging Inter- mediate Local LFS Compute node (local datasets) LFS Compute node (local datasets)
  • 48. Efficiency for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors
  • 49. “ Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes, GCC policy, 50GB caches/node Ioan Raicu
  • 50. Same scenario, but with dynamic resource provisioning
  • 51. Data diffusion sine-wave workload: Summary
    • GPFS  5.70 hrs, ~8Gb/s, 1138 CPU hrs
    • DD+SRP  1.80 hrs, ~25Gb/s, 361 CPU hrs
    • DD+DRP  1.86 hrs, ~24Gb/s, 253 CPU hrs
  • 52. Recap
    • Increased recognition that information systems and data understanding are limiting factor
      • … much of the promise associated with health IT requires high levels of adoption … and high levels of use of interoperable systems (in which information can be exchanged across unrelated systems) … . RAND COMPARE
    • Health system is complex, adaptive system
      • There is no single point(s) of control. System behaviors are often unpredictable and uncontrollable, and no one is “in charge.” W Rouse, NAE Bridge
    • With diverse and evolving requirements and user communities
      • … I advocate … a model of virtual integration rather than true vertical integration…. G. Halvorson, CEO Kaiser
  • 53. Functioning in the zone of complexity Ralph Stacey, Complexity and Creativity in Organizations , 1996 Low Low High High Agreement about outcomes Certainty about outcomes Plan and control Chaos
  • 54. The Grid paradigm and information integration Data sources Platform services Value services Analysis Radiology Medical records Management Integration Publication Cognitive support Applications Pathology Genomics Labs Security and policy RHIO
  • 55. “ The computer revolution hasn’t happened yet.” Alan Kay, 1997
  • 56. Time Connectivity (on log scale) Science Enterprise Consumer “ When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances” (George Gilder, 2001) Grid Cloud ????
  • 57. Thank you! Computation Institute www.ci.uchicago.edu