2 October 2017
Scalable and reproducible
workflows with Pachyderm
Jon Ander Novella de Miguel
Pharmaceutical Bioinformatics research group
Uppsala, Sweden
2 October 2017
APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS
Data growth in biomedicine Scalable methods for Big Data Analytics
enabled by Cloud Computing
2 October 2017
• Mass Spectrometry can offer high metabolite coverage
METABOLITE DATA
2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
2 October 2017
• Stitching many different software tools is
tedious
• Time-intensive and parameter heavy steps
involved
• Examples:Taverna, Nextflow, SciPipe
WORKFLOW DEFINITIONS
2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
2 October 2017
• Containers wrap an app with its own
operating environment
• Portability and environmental consistency
• Useful in science
• Is Vagrant already old-fashioned?
ISOLATION OF SCIENTIFIC SOFTWARE
2 October 2017
• Deployment, scaling and management of containers in a
cluster
• Kubernetes: big and active community
• Automatic healing and machine decoupling
[1] https://www.kubernetes.io
[1]
CONTAINER ORCHESTRATION TOOLS
2 October 2017
• Workflow definitions
• Isolation of scientific software
• Reproducibility
• Parallelisation
CHALLENGES
2 October 2017
• Workflow-system based on Kubernetes
• A distributed data processing tool based
on containers
• Enables reproducibility, provenance,
parallelization and isolation
“You can focus on being productive, while
Pachyderm will scale up and analyze for you”
[2] https://www.pachyderm.io
[2]
WHAT IS PACHYDERM?
2 October 2017
The main primitives are:
• Repositories: versioned collections of data
• Commits: new data
• Files: data storage primitives
[3] https://www.pachyderm.io/pfs.html
[3]
PFS offers version control for data:
PACHYDERM FILE SYSTEM (PFS)
2 October 2017
• Tasks executed by Kubernetes pods
• Parallelization: spreading data
• Incrementality and glob patterns
• Directed Acyclic Graph
[4] https://www.pachyderm.io/pps.html
[4]
PACHYDERM PIPELINE SYSTEM (PPS)
2 October 2017
• Reproducing a metabolomics workflow with Pachyderm
• Learn how to distribute processing using containers
• Feeling the power of data versioning
• Learn how we can use containers in a cloud-like distributed processing
environment
GOALS OF THE DAY
2 October 2017
• OpenMS: software for metabolite and proteome data
analysis and management
• Detection of mass traces and their aggregation into
features
• Four pre-processing steps
AN OPENMS BASED WORKFLOW
X
CSV
File Filter
Feature Finder
Feature Linker
Text Exporter
2 October 2017
• Kubernetes cluster backed by a Vagrant box (VM)
• https://github.com/CARAMBA-Clinic/COST-
CHARME/blob/master/README.md
• Execution of workflow-engine in Cloud-Like environment via
Jupyter
• Downstream analysis on RStudio
METHODS
2 October 2017
• Four interconnected tasks/processes
• Intermediate data handled by repositories
• Results stored also in a repository
WORKFLOW IN PACHYDERM
2 October 2017
• Thanks to Pachyderm, we can enable a reproducible and scalable data processing
platform
• Can you write your own container and distribute its computation?
REPRODUCIBLE RESULT
2 October 2017
THANKS! ANY QUESTIONS?
“Provenance and reproducibility enable a rigorous and
efficient data science”
Jon Ander Novella de Miguel
Department of Pharmaceutical Biosciences
Jon.Novella@farmbio.uu.se

Scalable and reproducible workflows with Pachyderm

  • 1.
    2 October 2017 Scalableand reproducible workflows with Pachyderm Jon Ander Novella de Miguel Pharmaceutical Bioinformatics research group Uppsala, Sweden
  • 2.
    2 October 2017 APPROACHESTOTACKLE BIOLOGICAL COMPUTATIONS Data growth in biomedicine Scalable methods for Big Data Analytics enabled by Cloud Computing
  • 3.
    2 October 2017 •Mass Spectrometry can offer high metabolite coverage METABOLITE DATA
  • 4.
    2 October 2017 •Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  • 5.
    2 October 2017 •Stitching many different software tools is tedious • Time-intensive and parameter heavy steps involved • Examples:Taverna, Nextflow, SciPipe WORKFLOW DEFINITIONS
  • 6.
    2 October 2017 •Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  • 7.
    2 October 2017 •Containers wrap an app with its own operating environment • Portability and environmental consistency • Useful in science • Is Vagrant already old-fashioned? ISOLATION OF SCIENTIFIC SOFTWARE
  • 8.
    2 October 2017 •Deployment, scaling and management of containers in a cluster • Kubernetes: big and active community • Automatic healing and machine decoupling [1] https://www.kubernetes.io [1] CONTAINER ORCHESTRATION TOOLS
  • 9.
    2 October 2017 •Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  • 10.
    2 October 2017 •Workflow-system based on Kubernetes • A distributed data processing tool based on containers • Enables reproducibility, provenance, parallelization and isolation “You can focus on being productive, while Pachyderm will scale up and analyze for you” [2] https://www.pachyderm.io [2] WHAT IS PACHYDERM?
  • 11.
    2 October 2017 Themain primitives are: • Repositories: versioned collections of data • Commits: new data • Files: data storage primitives [3] https://www.pachyderm.io/pfs.html [3] PFS offers version control for data: PACHYDERM FILE SYSTEM (PFS)
  • 12.
    2 October 2017 •Tasks executed by Kubernetes pods • Parallelization: spreading data • Incrementality and glob patterns • Directed Acyclic Graph [4] https://www.pachyderm.io/pps.html [4] PACHYDERM PIPELINE SYSTEM (PPS)
  • 13.
    2 October 2017 •Reproducing a metabolomics workflow with Pachyderm • Learn how to distribute processing using containers • Feeling the power of data versioning • Learn how we can use containers in a cloud-like distributed processing environment GOALS OF THE DAY
  • 14.
    2 October 2017 •OpenMS: software for metabolite and proteome data analysis and management • Detection of mass traces and their aggregation into features • Four pre-processing steps AN OPENMS BASED WORKFLOW X CSV File Filter Feature Finder Feature Linker Text Exporter
  • 15.
    2 October 2017 •Kubernetes cluster backed by a Vagrant box (VM) • https://github.com/CARAMBA-Clinic/COST- CHARME/blob/master/README.md • Execution of workflow-engine in Cloud-Like environment via Jupyter • Downstream analysis on RStudio METHODS
  • 16.
    2 October 2017 •Four interconnected tasks/processes • Intermediate data handled by repositories • Results stored also in a repository WORKFLOW IN PACHYDERM
  • 17.
    2 October 2017 •Thanks to Pachyderm, we can enable a reproducible and scalable data processing platform • Can you write your own container and distribute its computation? REPRODUCIBLE RESULT
  • 18.
    2 October 2017 THANKS!ANY QUESTIONS? “Provenance and reproducibility enable a rigorous and efficient data science” Jon Ander Novella de Miguel Department of Pharmaceutical Biosciences Jon.Novella@farmbio.uu.se