Curoverse Presentation at ICG-11 (November 2016)

Federated computing on massive biomedical
data sets across multiple data centers
JONATHAN SHEFFI
CEO, CUROVERSE

Researchers are struggling
to analyze large genomic
data sets
Problem
Data are physically distributed and difficult to
move because of:
• Physical size and network constraints
• Regulatory barriers
• Privacy and competitive concerns

Researcher
Commercial Data
Aggregators
Research Institutions
Medical Centers &
Hospitals
Seamless experience for
researchers using data
3.
Secure, distributed
queries & management
2.
Curated & indexed data
stored via open platform
on existing infrastructure
1.
Federated Computing
Answers
Workflows

Why not centralized proprietary SaaS?
• Data is getting larger and harder to move
• IT teams continue to choose a variety of IT
infrastructure solutions (public cloud, private cloud,
HPC) for good reasons
• Proprietary software makes standardization harder

• How do you know you are getting the files you
request?
• How do you know your pipeline will run
properly in the new environment?
• How can you be sure that the pipelines sent to
you are secure?
• How do you discover what data sets are
available?
What are the challenges to federation?

An open source platform for managing and
processing massive data sets
Designed for building federations

• Content addressing guarantees you get the data you expect
• Common Workflow Language and Docker gives you reliably
reproducible pipelines
• Multi-platform architecture lets you layer on top of existing
infrastructure
• Security and credentials that can travel with workflows
• Lightning enables complex variant-level queries, machine
learning, normalized VCF generation, GA4GH APIs & Beacon
Federation challenges solved

Common Workflow Language (commonwl.org)
A community-based global standard for workflow description
PROBLEM
• Difficult to use bioinformatics tools
because of poor run-time packaging
• No mechanism to easily discover the
availability and capabilities of tools
• No standard approach for creating
computational workflows
• Workflows are very difficult to reproduce
because of poor definition & design
• Workflows are not portable across
systems because of DIY approaches
SOLUTION
• Standard for packaging bioinformatics and
data science tools and algorithms into Docker
containers with clear interfaces
• Standard for defining computational
workflows built with tools packaged into
Docker containers
• Adopted by many major platforms in the
space, including Arvados, Galaxy, Taverna,
and Seven Bridges
• More than 250 bioinformaticians and data
scientists participating in creating standard

• Internal collaboration across countries
• Pharma translational research projects
• Large research consortiums
• Rare disease diagnosis search across institutions
• Clinical testing company operating in multiple geographies
• Clinical trial participant identification
Use cases for federations

• Wider range of platform support
• Adding a layer of brokering capabilities to coordinate a
federation
• Pushing industry adoption of CWL
• Getting more tools containerized and described with CWL
• Integrating with directory services such as Repositive
• Building a registry of tools (Dockstore.org)
What’s next?

Go to Arvados.org to download the code
Platform available for use under the AGPLv3 open source license
Go to Curoverse.com for commercial support options
Cluster Operations Subscriptions and Professional Services available
Get started

Curoverse Presentation at ICG-11 (November 2016)

More Related Content

What's hot

Viewers also liked

Similar to Curoverse Presentation at ICG-11 (November 2016)

Recently uploaded

Curoverse Presentation at ICG-11 (November 2016)