re:Invent 2013-foster-madduri

Science as a Service
Ian Foster, The University of Chicago and Argonne National Laboratory
November 14, 2013

Most labs have limited resources
Heidorn: NSF grants in 2007
$1,000,000
$100,000
$10,000

< $350,000
80% of awards
50% of grant $$

$1,000

2000

4000

6000

8000

Automation is required to
apply more sophisticated
methods to far more data

Automation is required to
apply more sophisticated
methods to far more data
Outsourcing is needed to
achieve economies of scale
in the use of automated
methods

Building a discovery cloud
• Identify time-consuming activities amenable to
automation and outsourcing
• Implement as high-quality, low-touch SaaS
• Leverage IaaS for reliability,
Software as a service
economies of scale
Platform as a service
Infrastructure as a service
• Extract common elements as
research automation platform
Bonus question: Sustainability

We aspire (initially) to create a
great user experience for
research data management
What would a “dropbox for
science” look like?

• Collect
• Move
• Sync
• Share
• Analyze

• Annotate
• Publish
• Search
• Backup
• Archive

BIG DATA

It should be trivial to
Collect, Move, Sync, Share, Analyze, Annotate, Publish,
… but in reality it’s often very challenging
Search, Backup, & Archive BIG DATA
!

Staging
Store

! Ingest

Expired
Store
credentials

Registry
Permission
denied
Communit
Community
yStore
Store

!

Analysis
!
Store Quota

Network
failed. Retry.

exceeded

Archive

Mirror

• Collect
• Annotate
Move
• Publish
• Move
Sync
• Search
• Sync
• Share
Share
• Backup
Capabilities delivered using
• Analyze
• Archive

Software-as-Service (SaaS) model

BIG DATA

2
Data
Source

1

Globus
Online
moves/syncs
files

Data
Destination

User
initiates
transfer
request

Globus Online
notifies user

3

2

1

User A selects
file(s) to share;
selects user/group,
sets share
permissions

Globus Online tracks
shared files; no need
to move files to cloud
storage!

Data
Source

3
User B logs in to
Globus Online
and accesses
shared file

Extreme ease of use
•
•
•
•
•
•
•
•

InCommon, Oauth, OpenID, X.509, …
Credential management
Group definition and management
Transfer management and optimization
Reliability via transfer retries
Web interface, REST API, command line
One-click “Globus Connect” install
5-minute Globus Connect Multi User install

Early adoption is encouraging

>12,000 registered users; >150 daily
>27 PB moved; >1B files
10x (or better) performance vs. scp
99.9% availability
Entirely hosted on Amazon

Amazon web services used
• EC2 for hosting Globus services
• ELB to use multiple availability zones for
reliability and uptime
• SES and SNS to send notifications of transfer
status
• S3 to store historical state
• PostgreSQL for active state

K. Heitmann (Argonne)
moves 22 TB of cosmology
data LANL  ANL at 5 Gb/s

B. Winjum (UCLA) moves
900K-file plasma physics
datasets UCLA NERSC

Dan Kozak (Caltech) replicates 1
PB LIGO astronomy data for
resilience

Erin Miller (PNNL)
collects data at
Advanced Photon
Source, renders at
PNNL, and views at ANL
Credit: Kerstin Kleese-van Dam

Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)

Globus Toolkit

Globus Connect

Globus Online
APIs

Globus Online already does a lot

The identity challenge in science
• Research communities often need to
– Assign identities to their users
– Manage user profiles
– Organize users into groups for authorization

• Obstacles to high-quality implementations
–
–
–
–

Complexity of associated security protocols
Creation of identity silos
Multiple credentials for users
Reliability, availability, scalability, security

Nexus provides four key capabilities
• Identity provisioning

I
I

I

– Create, manage Globus identities
I

I
G

I
V
U
aI b

• Identity hub
– Link with other identities; use
to authenticate to services

• Group hub
– User-managed groups; groups can
be used for authorization

• Profile management
– User-managed attributes;
can use in group admission

Key points:
1) Outsource
identity, group,
profile
management
2) REST API for
flexible integration
3) Intuitive,
customizable
Web interfaces

Branded sites
XSEDE

Open Science Grid

University of Chicago

DOE kBase

Indiana University

University of Exeter

Globus Online

NERSC

NIH BIRN

Data management SaaS (Globus) +
Next-gen sequence analysis pipelines (Galaxy) +
Cloud IaaS (Amazon) =
Flexible, scalable, easy-to-use genomics analysis for
all biologists
globus
genomics

Sharing Service
Transfer Service
Globus Nexus

Globus Toolkit

Globus Connect

Globus Online
APIs

We are adding capabilities

Dataset Services
Sharing Service
Transfer Service
Globus Nexus

Globus Toolkit

Globus Connect

Globus Online
APIs


• Ingest and publication
– Imagine a DropBox that not only replicates, but also extracts
metadata, catalogs, converts

• Cataloging
– Virtual views of data based on user-defined and/or automatically
extracted metadata

• Computation
– Associate computational procedures, orchestrate application,
catalog results, record provenance

Next Gen Sequencing Analysis for Everyone –
No IT Required
Ravi K Madduri, The University of Chicago and Argonne National Laboratory

November 14, 2013

One slide to get your attention

Outline
• Globus Vision
• Challenges in Sequencing Analysis
– Big Data Management
– Analysis at Scale
– Reproducibility

• Proposed Approach Using Globus Genomics
• Example Collaborations
• Q&A

Globus Vision
Goal: Accelerate discovery and innovation worldwide
by providing research IT as a service
Leverage software-as-a-service to:
– provide millions of researchers with unprecedented access to
powerful tools for managing Big Data
– reduce research IT costs dramatically via economies of scale

“Civilization advances by extending the number of important
operations which we can perform without thinking of them”
—Alfred North Whitehead , 1911

Challenges in Sequencing Analysis
Data Movement and Access Challenges
•
•
•
•

Shell scripts to sequentially execute the tools
Manually modify the scripts for any change

•

Public
Data

Manually move the data to the Compute node
Install all the tools required for the Analysis

Difficult to maintain and transfer the knowledge

•

BWA, Picard, GATK, Filtering Scripts, etc.

•

Error Prone, difficult to keep track, messy..

Storage

Sequencing
Centers

Fastq

Ref Genome

Research Lab
Seq
Center

Local Cluster/
Cloud

Modify

Picard
Install

•
•
•
•

Difficult to Data is distributed in different locations
Research labs need access to the data for analysis
Be able to Share data with other researchers/collaborators
•
Inefficient ways of data movement
Data needs to be available on the local and Distributed Compute
Resources
•
Local Clusters, Cloud, Grid and transfer the knowledge

Alignment
(Re)Run
GATK

Script
Variant
Calling

How do we analyze this
Sequence Data

Manual Data Analysis

Globus Genomics

Globus Genomics

Galaxy Based
Workflow
Management System
•

Public
Data
Sequencin
g Centers

Globus Provides a
•
High-performance
Research Lab
•
Fault-tolerant
Seq Secure
•
Center

Storage

•
•

Galaxy
Data Libraries

•
Local Cluster/
Cloud

•

file transfer Service between
all data-endpoints

Globus Integrated
within Galaxy
Web-based UI
Drag-Drop workflow
creations
Easily modify
Workflows with new
tools
Analytical tools are
automatically run on
the scalable compute
resources when
possible

Globus Genomics on
Amazon EC2

Data Management

Data Analysis

Globus Genomics Architecture

Figure 2: Globus Genomics Architecture

mputation Institute, University of Chicago, Chicago, IL, USA. 2Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, U
3 Section Genetic Medicine, University of Chicago, Chicago, IL.

Challenges in Next-Gen Sequencing Analysis

Parallel Workflows on Globus Genomics

High Performance, Reusable Consensus
Calling Pipeline

Globus Genomics
• Computational profiles for
various analysis tools
• Resources can be
provisioned on-demand with
Amazon Web Services cloud
based infrastructure
• Glusterfs as a shared file
system between head nodes
and compute nodes
• Provisioned I/O on EBS

Coming soon!
• Integration with Globus Catalog
– Better data discovery and metadata management

• Integration with Globus Sharing
– Easy and secure method to share large datasets with collaborators

• Integration with Amazon Glacier for data archiving
• Support for high throughput computational
modalities through Apache Mesos
– MapReduce and MPI clusters

• Dynamic Storage Strategies using S3 and/or LVMbased shared file system

Our vision for a 21st century
discovery infrastructure
Provide more capability for
more people at lower cost by
building a “Discovery Cloud”
Delivering “Science as a service”

For more information
• More information on Globus Genomics and to
sign up: www.globus.org/genomics
• More information on Globus:
www.globusonline.org
• Follow us on Twitter:
@ianfoster, @madduri, @globusgenomics, @gl
obusonline

Please give us your feedback on this
presentation

BDT 310
As a thank you, we will select prize
winners daily for completed surveys!

re:Invent 2013-foster-madduri

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to re:Invent 2013-foster-madduri

Similar to re:Invent 2013-foster-madduri (20)

Recently uploaded

Recently uploaded (20)

re:Invent 2013-foster-madduri

Editor's Notes