The Discovery Cloud! 
Ian Foster 
Argonne National Laboratory and University of Chicago 
foster@anl.gov 
ianfoster.org
Publish 
results 
The discovery process: 
Iterative and time-consuming 
Collect 
data 
Design 
experiment 
Test 
hypothesis 
Analyze 
data 
Identify 
patterns 
Hypothesize 
explanation 
Pose 
question
Civilization advances 
by extending the number of important 
operations which 
we can perform without 
thinking about them 
Alfred North Whitehead (1911)
About 85% of my “thinking” time 
was spent getting into a position to think, 
to make a decision, 
to learn something I needed to know 
J.C.R Licklider, 1960
Automation is required to 
apply more sophisticated 
methods at larger scales
Automation is required to 
apply more sophisticated 
methods at larger scales 
Outsourcing is needed to 
achieve economies of 
scale in the use of 
automated methods
Outsourcing and automation: 
(1) The Grid 
A computational grid is a hardware and 
software infrastructure that provides 
dependable, consistent, pervasive, and 
inexpensive access to computational 
capabilities 
Foster and Kesselman, 1998
Higgs discovery “only possible because 
of the extraordinary achievements of … 
grid computing”—Rolf Heuer, CERN DG 
10s of PB, 100s of institutions,1000s of 
scientists, 100Ks of CPUs, Bs of tasks
Outsourcing and automation: 
(2) The Cloud 
Cloud computing is a model for enabling 
ubiquitous, convenient, on-demand network 
access to a shared pool of configurable 
computing resources (e.g., networks, 
servers, storage, applications, and services) 
that can be rapidly provisioned and released 
with minimal management effort or service 
provider interaction 
NIST, 2011
11
Tripit exemplifies process automation 
Me 
Book flights 
Book hotel 
Record flights 
Suggest hotel 
Record hotel 
Get weather 
Prepare maps 
Share info 
Monitor prices 
Monitor flight 
Other services
How the “business cloud” works 
Platform 
services 
Database, analytics, application, deployment, workflow, queuing 
Auto-scaling, Domain Name Service, content distribution 
Elastic MapReduce, streaming data analytics 
Email, messaging, transcoding. Many more. 
Infrastructure 
services 
Computing, storage, networking 
Elastic capacity 
Multiple availability zones
The Intelligence Cloud
Process automation for science 
Run experiment 
Collect data 
Move data 
Check data 
Annotate data 
Share data 
Find similar data 
Link to literature 
Analyze data 
Publish data 
Automate 
and 
outsource: 
the 
Discovery 
cloud
Globus research data 
management services 
Staging Ingest 
Analysis 
Registry 
Community 
Repository 
Archive Mirror 
Next-gen 
genome 
sequencer 
Telescope 
In millions of labs worldwide, 
researchers struggle with massive 
data, advanced software, complex 
protocols, burdensome reporting 
www.globus.org 
Simulation
“I need to easily, quickly, and reliably mirror 
[portions of] my data to other places.” 
Research Computing HPC Cluster 
Campus Home Filesystem 
Lab Server 
Desktop Workstation 
Personal Laptop 
XSEDE Resource 
Public Cloud
“I need to easily and securely 
share my data with colleagues.”
“I need to get data from a scientific 
instrument to my analysis server.” 
Next Gen 
Sequencer 
MRI 
Light Sheet Microscope 
Advanced Light Source
Globus transfer & sharing; identity & group 
management, data discovery & publication 
25,000 users, 60 PB and 3B files transferred, 8,000 endpoints
The Globus Galaxies platform: 
Science as a service 
Globus 
Galaxies 
platform 
Tool and workflow execution, 
publication, discovery, sharing; 
identity management; data 
management; task scheduling 
Infra-structure 
services 
EC2, EBS, S3, SNS, 
Spot, Route 53, 
Cloud Formation 
Ematter 
materials 
FACE-IT science 
PDACS
Ravi Madduri, Paul Davé , Dina Sulakhe, Al2e2x Rodriguez
Globus Genomics 
Sequencing 
Centers 
Public 
Data 
Globus Provides a 
• High-performance 
• Fault-tolerant 
• Secure 
file transfer Service between 
all data-endpoints 
Galaxy-based workflow 
management Globus Genomics 
Storage 
Local Cluster/ 
Research Lab 
Seq Cloud 
Center 
• Globus Integrated within 
Fastq Ref Genome 
Picard 
Alignment 
GATK 
Variant Calling 
Galaxy 
Data Libraries 
Globus Genomics on 
Amazon EC2 
Data Management Data Analysis 
Galaxy 
• Web-based UI 
• Drag-Drop workflow 
creations 
• Easily modify Workflows 
with new tools 
• Analytical tools are 
automatically run 
on the scalable 
compute resources 
when possible
It’s proving popular 
Dobyns 
Lab 
Nagarajan Lab 
Cox Lab 
Volchenboum Lab 
Olopade Lab
2.5 million core hours used 
in first six months of 2014 
12000	 
10000	 
8000	 
6000	 
4000	 
2000	 
0	 
1200000	 
1000000	 
800000	 
600000	 
400000	 
200000	 
0	 
January	 February	 March	 April	 May	 June	 
Cost	($)	 
Instance	Hours	 
Date	 
Instance	Hours	 
Cost	 
25
Costs are remarkably low 
• Pricing includes 
• Estimated compute 
• Storage (one month) 
• Globus Genomics platform usage 
• Support
Data service as community resource 
metagenomics.anl.gov
kbase.us
Linking simulation and experiment to study 
disordered structures 
Diffuse scattering images from Ray Osborn et al., Argonne 
Experimental Sample 
scattering 
Material 
composition 
Simulated 
structure 
Simulated 
scattering 
La 60% 
Sr 40% 
Detect errors 
(secs—mins) 
Knowledge base 
Past experiments; 
simulations; literature; 
expert knowledge 
Select experiments 
(mins—hours) 
Contribute to knowledge base 
Simulations driven by 
experiments (mins—days) 
Knowledge-driven 
decision making 
Evolutionary optimization
New data, computational capabilities, and 
methods create opportunities and challenges 
Integrate statistics/machine learning to assess 
many models and calibrate them against `all' 
relevant data 
Integrate data movement, management, workflow, 
and computation to accelerate data-driven 
applications 
New computer facilities enable on-demand 
computing and high-speed analysis of large 
quantities of data
A lab-wide data architecture and facility 
3 
2 
Researchers, system administrators, collaborators, students, … 
Web interfaces, REST APIs, command line interfaces 
Services Domain portals 
Registry: 
metadata, 
attributes 
Component 
& workflow 
repository 
PDACS 
Resources 
Workflow 
execution 
Data 
transfer, 
sync, 
sharing 
Registry: 
metadata, 
attributes 
Integration layer: Remote access protocols, authentication, authorization 
Utility 
compute 
system 
(“cloud”) 
Data 
publication 
& 
discovery 
Parallel 
file 
DISC 
system system 
Experimental 
facility 
Visualization 
system 
Component 
& workflow 
repository 
kBase 
HPC 
compute 
eMatter 
FACE-IT
Immediate assessment of alignment quality in 
near-field high-energy diffraction microscopy 
3 
Blue Gene/Q 
Orthros 
(All data in NFS) 
3: Generate 
Parameters 
FOP.c 
50 tasks 
25s/task 
¼ CPU hours 
Uses Swift/K 
Dataset 
360 files 
4 GB total 
1: Median calc 
75s (90% I/O) 
MedianImage.c 
Uses Swift/K 
2: Peak Search 
15s per file 
ImageProcessing.c 
Uses Swift/K 
Reduced 
Dataset 
360 files 
5 MB total 
Detector 
4: Analysis Pass 
FitOrientation.c 
60s/task (PC) 
1667 CPU hours 
60s/task (BG/Q) 
1667 CPU hours 
Uses Swift/T 
GO Transfer 
Up to 
2.2 M CPU hours 
per week! 
ssh 
Globus Catalog 
Scientific Metadata 
Workflow Workflow Progress 
Control 
Script 
Bash 
Manual 
This is a 
single 
workflow 
3: Convert bin L 
to N 
2 min for all files, 
convert files to 
Network Endian 
format 
Before 
After 
Hemant Sharma 
Justin Wozniak 
Mike Wilde 
Jon Almer
34 
One APS data 
node: 
125 
destinations
Same 
node 
(1 Gbps 
link)
The discovery Cloud! 
Accelerate discovery via automation and 
outsourcing 
And at the same time: 
– Enhance reproducibility 
– Encourage entrepreneurial science 
– Democratize access and contributions 
– Enhance collaboration
U.S. DEPARTMENT OF 
ENERGY 
37
The Discovery Cloud: Accelerating Science via Outsourcing and Automation

The Discovery Cloud: Accelerating Science via Outsourcing and Automation

  • 1.
    The Discovery Cloud! Ian Foster Argonne National Laboratory and University of Chicago foster@anl.gov ianfoster.org
  • 2.
    Publish results Thediscovery process: Iterative and time-consuming Collect data Design experiment Test hypothesis Analyze data Identify patterns Hypothesize explanation Pose question
  • 4.
    Civilization advances byextending the number of important operations which we can perform without thinking about them Alfred North Whitehead (1911)
  • 5.
    About 85% ofmy “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know J.C.R Licklider, 1960
  • 6.
    Automation is requiredto apply more sophisticated methods at larger scales
  • 7.
    Automation is requiredto apply more sophisticated methods at larger scales Outsourcing is needed to achieve economies of scale in the use of automated methods
  • 8.
    Outsourcing and automation: (1) The Grid A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities Foster and Kesselman, 1998
  • 9.
    Higgs discovery “onlypossible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG 10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks
  • 10.
    Outsourcing and automation: (2) The Cloud Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction NIST, 2011
  • 11.
  • 12.
    Tripit exemplifies processautomation Me Book flights Book hotel Record flights Suggest hotel Record hotel Get weather Prepare maps Share info Monitor prices Monitor flight Other services
  • 13.
    How the “businesscloud” works Platform services Database, analytics, application, deployment, workflow, queuing Auto-scaling, Domain Name Service, content distribution Elastic MapReduce, streaming data analytics Email, messaging, transcoding. Many more. Infrastructure services Computing, storage, networking Elastic capacity Multiple availability zones
  • 14.
  • 15.
    Process automation forscience Run experiment Collect data Move data Check data Annotate data Share data Find similar data Link to literature Analyze data Publish data Automate and outsource: the Discovery cloud
  • 16.
    Globus research data management services Staging Ingest Analysis Registry Community Repository Archive Mirror Next-gen genome sequencer Telescope In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting www.globus.org Simulation
  • 17.
    “I need toeasily, quickly, and reliably mirror [portions of] my data to other places.” Research Computing HPC Cluster Campus Home Filesystem Lab Server Desktop Workstation Personal Laptop XSEDE Resource Public Cloud
  • 18.
    “I need toeasily and securely share my data with colleagues.”
  • 19.
    “I need toget data from a scientific instrument to my analysis server.” Next Gen Sequencer MRI Light Sheet Microscope Advanced Light Source
  • 20.
    Globus transfer &sharing; identity & group management, data discovery & publication 25,000 users, 60 PB and 3B files transferred, 8,000 endpoints
  • 21.
    The Globus Galaxiesplatform: Science as a service Globus Galaxies platform Tool and workflow execution, publication, discovery, sharing; identity management; data management; task scheduling Infra-structure services EC2, EBS, S3, SNS, Spot, Route 53, Cloud Formation Ematter materials FACE-IT science PDACS
  • 22.
    Ravi Madduri, PaulDavé , Dina Sulakhe, Al2e2x Rodriguez
  • 23.
    Globus Genomics Sequencing Centers Public Data Globus Provides a • High-performance • Fault-tolerant • Secure file transfer Service between all data-endpoints Galaxy-based workflow management Globus Genomics Storage Local Cluster/ Research Lab Seq Cloud Center • Globus Integrated within Fastq Ref Genome Picard Alignment GATK Variant Calling Galaxy Data Libraries Globus Genomics on Amazon EC2 Data Management Data Analysis Galaxy • Web-based UI • Drag-Drop workflow creations • Easily modify Workflows with new tools • Analytical tools are automatically run on the scalable compute resources when possible
  • 24.
    It’s proving popular Dobyns Lab Nagarajan Lab Cox Lab Volchenboum Lab Olopade Lab
  • 25.
    2.5 million corehours used in first six months of 2014 12000 10000 8000 6000 4000 2000 0 1200000 1000000 800000 600000 400000 200000 0 January February March April May June Cost ($) Instance Hours Date Instance Hours Cost 25
  • 26.
    Costs are remarkablylow • Pricing includes • Estimated compute • Storage (one month) • Globus Genomics platform usage • Support
  • 27.
    Data service ascommunity resource metagenomics.anl.gov
  • 28.
  • 30.
    Linking simulation andexperiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne Experimental Sample scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  • 31.
    New data, computationalcapabilities, and methods create opportunities and challenges Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data Integrate data movement, management, workflow, and computation to accelerate data-driven applications New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
  • 32.
    A lab-wide dataarchitecture and facility 3 2 Researchers, system administrators, collaborators, students, … Web interfaces, REST APIs, command line interfaces Services Domain portals Registry: metadata, attributes Component & workflow repository PDACS Resources Workflow execution Data transfer, sync, sharing Registry: metadata, attributes Integration layer: Remote access protocols, authentication, authorization Utility compute system (“cloud”) Data publication & discovery Parallel file DISC system system Experimental facility Visualization system Component & workflow repository kBase HPC compute eMatter FACE-IT
  • 33.
    Immediate assessment ofalignment quality in near-field high-energy diffraction microscopy 3 Blue Gene/Q Orthros (All data in NFS) 3: Generate Parameters FOP.c 50 tasks 25s/task ¼ CPU hours Uses Swift/K Dataset 360 files 4 GB total 1: Median calc 75s (90% I/O) MedianImage.c Uses Swift/K 2: Peak Search 15s per file ImageProcessing.c Uses Swift/K Reduced Dataset 360 files 5 MB total Detector 4: Analysis Pass FitOrientation.c 60s/task (PC) 1667 CPU hours 60s/task (BG/Q) 1667 CPU hours Uses Swift/T GO Transfer Up to 2.2 M CPU hours per week! ssh Globus Catalog Scientific Metadata Workflow Workflow Progress Control Script Bash Manual This is a single workflow 3: Convert bin L to N 2 min for all files, convert files to Network Endian format Before After Hemant Sharma Justin Wozniak Mike Wilde Jon Almer
  • 34.
    34 One APSdata node: 125 destinations
  • 35.
    Same node (1Gbps link)
  • 36.
    The discovery Cloud! Accelerate discovery via automation and outsourcing And at the same time: – Enhance reproducibility – Encourage entrepreneurial science – Democratize access and contributions – Enhance collaboration
  • 37.

Editor's Notes

  • #3 The basic research process remains essentially unchanged since the emergence of the scientific method in the 17th Century. Collect data, analyze data, identify patterns within data, seek explanations for those patterns, collect new data to test explanations. Speed of discovery depends to a significant degree on the time required for this cycle. Here, new technologies are changing the research process rapidly and dramatically. Data collection time used to dominate research. For example, Janet Rowley took several years to collect data on gross chromosomal abnormalities for a few patients. Today, we can generate genome data at the rate of billions of base pairs per day. So other steps become bottlenecks, like managing and analyzing data—a key issue for Midway. It is important to realize that the vast majority of research is performed within “small and medium labs.” For example, almost all of the ~1000 faculty in BSD and PSD at UChicago work in their own lab. Academic research is a cottage industry—albeit one that is increasingly interconnected—and is likely to stay that way.
  • #5 Mathematics Computers Excel
  • #10 Need for entirely new instruments, computing infrastructure, organizational structures 173 TB/day
  • #11 “Whenever Amazon introduces a new innovation or improvement in cloud services, the IC cloud will evolve. Company officials say AWS made more than 200 such incremental improvements last year, ensuring a sort of built-in innovation to the IC cloud that will help the intelligence community keep pace with commercial advances.” – CIA article
  • #17 Add more analysis
  • #22 Add logos Add IaaS/PaaS/SaaS
  • #24 Our goal is to operationalize key capabilities so researchers can depend on them. Think of Gmail for science..
  • #31 “Most of materials science is bottlenecked by disordered structures”—Littlewood. Solve inverse problem. How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. Challenge: takes months to do a single loop through cycle. Just as important, it is an incredibly labor intensive and expensive process.
  • #32 This picture shows the big picture.
  • #33 New types of computer systems enable high-speed data access, high-speed analysis, and on-demand computing Integrated networking, data transfer, and security solutions enable ultra-rapid, secure communication among resources New data and workflow services enable automation and provenance tracking for data-driven applications Simple APIs enable rapid development of domain-specific tools, applications, and portals
  • #34 DS, NF-HEDM, FF-HEDM, PD workflows operational Catalog integrated into workflow, supports rich user interface Workflows use large-scale compute resources outside of APS Data publication service demonstrated Parallel algs for 3-D image reconstruction, structure determination, etc. Globus Galaxies platform integrated with Swift for scalability
  • #37 Rethinking scientific computing infrastructure for the 21st Century
  • #39 Thanks.