Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Discovery Cloud: Accelerating Science via Outsourcing and Automation


Published on

Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.

We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

The Discovery Cloud: Accelerating Science via Outsourcing and Automation

  1. 1. The Discovery Cloud! Ian Foster Argonne National Laboratory and University of Chicago
  2. 2. Publish results The discovery process: Iterative and time-consuming Collect data Design experiment Test hypothesis Analyze data Identify patterns Hypothesize explanation Pose question
  3. 3. Civilization advances by extending the number of important operations which we can perform without thinking about them Alfred North Whitehead (1911)
  4. 4. About 85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know J.C.R Licklider, 1960
  5. 5. Automation is required to apply more sophisticated methods at larger scales
  6. 6. Automation is required to apply more sophisticated methods at larger scales Outsourcing is needed to achieve economies of scale in the use of automated methods
  7. 7. Outsourcing and automation: (1) The Grid A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities Foster and Kesselman, 1998
  8. 8. Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
  9. 9. Outsourcing and automation: (2) The Cloud Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction NIST, 2011
  10. 10. 11
  11. 11. Tripit exemplifies process automation Me Book flights Book hotel Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight Other services
  12. 12. How the “business cloud” works Platform services Database, analytics, application, deployment, workflow, queuing Auto-scaling, Domain Name Service, content distribution Elastic MapReduce, streaming data analytics Email, messaging, transcoding. Many more. Infrastructure services Computing, storage, networking Elastic capacity Multiple availability zones
  13. 13. The Intelligence Cloud
  14. 14. Process automation for science Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Automate and outsource: the Discovery cloud
  15. 15. Globus research data management services Staging Ingest Analysis Registry Community Repository Archive Mirror Next-gen genome sequencer Telescope In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting Simulation
  16. 16. “I need to easily, quickly, and reliably mirror [portions of] my data to other places.” Research Computing HPC Cluster Campus Home Filesystem Lab Server Desktop Workstation Personal Laptop XSEDE Resource Public Cloud
  17. 17. “I need to easily and securely share my data with colleagues.”
  18. 18. “I need to get data from a scientific instrument to my analysis server.” Next Gen Sequencer MRI Light Sheet Microscope Advanced Light Source
  19. 19. Globus transfer & sharing; identity & group management, data discovery & publication 25,000 users, 60 PB and 3B files transferred, 8,000 endpoints
  20. 20. The Globus Galaxies platform: Science as a service Globus Galaxies platform Tool and workflow execution, publication, discovery, sharing; identity management; data management; task scheduling Infra-structure services EC2, EBS, S3, SNS, Spot, Route 53, Cloud Formation Ematter materials FACE-IT science PDACS
  21. 21. Ravi Madduri, Paul Davé , Dina Sulakhe, Al2e2x Rodriguez
  22. 22. Globus Genomics Sequencing Centers Public Data Globus Provides a • High-performance • Fault-tolerant • Secure file transfer Service between all data-endpoints Galaxy-based workflow management Globus Genomics Storage Local Cluster/ Research Lab Seq Cloud Center • Globus Integrated within Fastq Ref Genome Picard Alignment GATK Variant Calling Galaxy Data Libraries Globus Genomics on Amazon EC2 Data Management Data Analysis Galaxy • Web-based UI • Drag-Drop workflow creations • Easily modify Workflows with new tools • Analytical tools are automatically run on the scalable compute resources when possible
  23. 23. It’s proving popular Dobyns Lab Nagarajan Lab Cox Lab Volchenboum Lab Olopade Lab
  24. 24. 2.5 million core hours used in first six months of 2014 12000 10000 8000 6000 4000 2000 0 1200000 1000000 800000 600000 400000 200000 0 January February March April May June Cost ($) Instance Hours Date Instance Hours Cost 25
  25. 25. Costs are remarkably low • Pricing includes • Estimated compute • Storage (one month) • Globus Genomics platform usage • Support
  26. 26. Data service as community resource
  27. 27.
  28. 28. Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne Experimental Sample scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  29. 29. New data, computational capabilities, and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data Integrate data movement, management, workflow, and computation to accelerate data-driven applications New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
  30. 30. A lab-wide data architecture and facility 3 2 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  31. 31. Immediate assessment of alignment quality in near-field high-energy diffraction microscopy 3 Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow Workflow Progress Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma Justin Wozniak Mike Wilde Jon Almer
  32. 32. 34 One APS data node: 125 destinations
  33. 33. Same node (1 Gbps link)
  34. 34. The discovery Cloud! Accelerate discovery via automation and outsourcing And at the same time: – Enhance reproducibility – Encourage entrepreneurial science – Democratize access and contributions – Enhance collaboration