Federated computing on massive biomedical
data sets across multiple data centers
JONATHAN SHEFFI
CEO, CUROVERSE
Researchers are struggling
to analyze large genomic
data sets
Problem
Data are physically distributed and difficult to
move because of:
• Physical size and network constraints
• Regulatory barriers
• Privacy and competitive concerns
Researcher
Commercial Data
Aggregators
Research Institutions
Medical Centers &
Hospitals
Seamless experience for
researchers using data
3.
Secure, distributed
queries & management
2.
Curated & indexed data
stored via open platform
on existing infrastructure
1.
Federated Computing
Answers
Workflows
Why not centralized proprietary SaaS?
• Data is getting larger and harder to move
• IT teams continue to choose a variety of IT
infrastructure solutions (public cloud, private cloud,
HPC) for good reasons
• Proprietary software makes standardization harder
• How do you know you are getting the files you
request?
• How do you know your pipeline will run
properly in the new environment?
• How can you be sure that the pipelines sent to
you are secure?
• How do you discover what data sets are
available?
What are the challenges to federation?
An open source platform for managing and
processing massive data sets
Designed for building federations
• Content addressing guarantees you get the data you expect
• Common Workflow Language and Docker gives you reliably
reproducible pipelines
• Multi-platform architecture lets you layer on top of existing
infrastructure
• Security and credentials that can travel with workflows
• Lightning enables complex variant-level queries, machine
learning, normalized VCF generation, GA4GH APIs & Beacon
Federation challenges solved
Common Workflow Language (commonwl.org)
A community-based global standard for workflow description
PROBLEM
• Difficult to use bioinformatics tools
because of poor run-time packaging
• No mechanism to easily discover the
availability and capabilities of tools
• No standard approach for creating
computational workflows
• Workflows are very difficult to reproduce
because of poor definition & design
• Workflows are not portable across
systems because of DIY approaches
SOLUTION
• Standard for packaging bioinformatics and
data science tools and algorithms into Docker
containers with clear interfaces
• Standard for defining computational
workflows built with tools packaged into
Docker containers
• Adopted by many major platforms in the
space, including Arvados, Galaxy, Taverna,
and Seven Bridges
• More than 250 bioinformaticians and data
scientists participating in creating standard
• Internal collaboration across countries
• Pharma translational research projects
• Large research consortiums
• Rare disease diagnosis search across institutions
• Clinical testing company operating in multiple geographies
• Clinical trial participant identification
Use cases for federations
• Wider range of platform support
• Adding a layer of brokering capabilities to coordinate a
federation
• Pushing industry adoption of CWL
• Getting more tools containerized and described with CWL
• Integrating with directory services such as Repositive
• Building a registry of tools (Dockstore.org)
What’s next?
Go to Arvados.org to download the code
Platform available for use under the AGPLv3 open source license
Go to Curoverse.com for commercial support options
Cluster Operations Subscriptions and Professional Services available
Get started

Curoverse Presentation at ICG-11 (November 2016)

  • 1.
    Federated computing onmassive biomedical data sets across multiple data centers JONATHAN SHEFFI CEO, CUROVERSE
  • 2.
    Researchers are struggling toanalyze large genomic data sets Problem Data are physically distributed and difficult to move because of: • Physical size and network constraints • Regulatory barriers • Privacy and competitive concerns
  • 3.
    Researcher Commercial Data Aggregators Research Institutions MedicalCenters & Hospitals Seamless experience for researchers using data 3. Secure, distributed queries & management 2. Curated & indexed data stored via open platform on existing infrastructure 1. Federated Computing Answers Workflows
  • 4.
    Why not centralizedproprietary SaaS? • Data is getting larger and harder to move • IT teams continue to choose a variety of IT infrastructure solutions (public cloud, private cloud, HPC) for good reasons • Proprietary software makes standardization harder
  • 5.
    • How doyou know you are getting the files you request? • How do you know your pipeline will run properly in the new environment? • How can you be sure that the pipelines sent to you are secure? • How do you discover what data sets are available? What are the challenges to federation?
  • 6.
    An open sourceplatform for managing and processing massive data sets Designed for building federations
  • 7.
    • Content addressingguarantees you get the data you expect • Common Workflow Language and Docker gives you reliably reproducible pipelines • Multi-platform architecture lets you layer on top of existing infrastructure • Security and credentials that can travel with workflows • Lightning enables complex variant-level queries, machine learning, normalized VCF generation, GA4GH APIs & Beacon Federation challenges solved
  • 8.
    Common Workflow Language(commonwl.org) A community-based global standard for workflow description PROBLEM • Difficult to use bioinformatics tools because of poor run-time packaging • No mechanism to easily discover the availability and capabilities of tools • No standard approach for creating computational workflows • Workflows are very difficult to reproduce because of poor definition & design • Workflows are not portable across systems because of DIY approaches SOLUTION • Standard for packaging bioinformatics and data science tools and algorithms into Docker containers with clear interfaces • Standard for defining computational workflows built with tools packaged into Docker containers • Adopted by many major platforms in the space, including Arvados, Galaxy, Taverna, and Seven Bridges • More than 250 bioinformaticians and data scientists participating in creating standard
  • 9.
    • Internal collaborationacross countries • Pharma translational research projects • Large research consortiums • Rare disease diagnosis search across institutions • Clinical testing company operating in multiple geographies • Clinical trial participant identification Use cases for federations
  • 10.
    • Wider rangeof platform support • Adding a layer of brokering capabilities to coordinate a federation • Pushing industry adoption of CWL • Getting more tools containerized and described with CWL • Integrating with directory services such as Repositive • Building a registry of tools (Dockstore.org) What’s next?
  • 11.
    Go to Arvados.orgto download the code Platform available for use under the AGPLv3 open source license Go to Curoverse.com for commercial support options Cluster Operations Subscriptions and Professional Services available Get started