Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable and reproducible workflows with Pachyderm

326 views

Published on

This presentation contains an introduction to using Pachyderm as a tool to enable scalable and reproducible workflows in the life sciences. Pachyderm is an open-source workflow-engine and distributed data processing tool that leverages the container ecosystem.

Published in: Software
  • Be the first to comment

Scalable and reproducible workflows with Pachyderm

  1. 1. 2 October 2017 Scalable and reproducible workflows with Pachyderm Jon Ander Novella de Miguel Pharmaceutical Bioinformatics research group Uppsala, Sweden
  2. 2. 2 October 2017 APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS Data growth in biomedicine Scalable methods for Big Data Analytics enabled by Cloud Computing
  3. 3. 2 October 2017 • Mass Spectrometry can offer high metabolite coverage METABOLITE DATA
  4. 4. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  5. 5. 2 October 2017 • Stitching many different software tools is tedious • Time-intensive and parameter heavy steps involved • Examples:Taverna, Nextflow, SciPipe WORKFLOW DEFINITIONS
  6. 6. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  7. 7. 2 October 2017 • Containers wrap an app with its own operating environment • Portability and environmental consistency • Useful in science • Is Vagrant already old-fashioned? ISOLATION OF SCIENTIFIC SOFTWARE
  8. 8. 2 October 2017 • Deployment, scaling and management of containers in a cluster • Kubernetes: big and active community • Automatic healing and machine decoupling [1] https://www.kubernetes.io [1] CONTAINER ORCHESTRATION TOOLS
  9. 9. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  10. 10. 2 October 2017 • Workflow-system based on Kubernetes • A distributed data processing tool based on containers • Enables reproducibility, provenance, parallelization and isolation “You can focus on being productive, while Pachyderm will scale up and analyze for you” [2] https://www.pachyderm.io [2] WHAT IS PACHYDERM?
  11. 11. 2 October 2017 The main primitives are: • Repositories: versioned collections of data • Commits: new data • Files: data storage primitives [3] https://www.pachyderm.io/pfs.html [3] PFS offers version control for data: PACHYDERM FILE SYSTEM (PFS)
  12. 12. 2 October 2017 • Tasks executed by Kubernetes pods • Parallelization: spreading data • Incrementality and glob patterns • Directed Acyclic Graph [4] https://www.pachyderm.io/pps.html [4] PACHYDERM PIPELINE SYSTEM (PPS)
  13. 13. 2 October 2017 • Reproducing a metabolomics workflow with Pachyderm • Learn how to distribute processing using containers • Feeling the power of data versioning • Learn how we can use containers in a cloud-like distributed processing environment GOALS OF THE DAY
  14. 14. 2 October 2017 • OpenMS: software for metabolite and proteome data analysis and management • Detection of mass traces and their aggregation into features • Four pre-processing steps AN OPENMS BASED WORKFLOW X CSV File Filter Feature Finder Feature Linker Text Exporter
  15. 15. 2 October 2017 • Kubernetes cluster backed by a Vagrant box (VM) • https://github.com/CARAMBA-Clinic/COST- CHARME/blob/master/README.md • Execution of workflow-engine in Cloud-Like environment via Jupyter • Downstream analysis on RStudio METHODS
  16. 16. 2 October 2017 • Four interconnected tasks/processes • Intermediate data handled by repositories • Results stored also in a repository WORKFLOW IN PACHYDERM
  17. 17. 2 October 2017 • Thanks to Pachyderm, we can enable a reproducible and scalable data processing platform • Can you write your own container and distribute its computation? REPRODUCIBLE RESULT
  18. 18. 2 October 2017 THANKS! ANY QUESTIONS? “Provenance and reproducibility enable a rigorous and efficient data science” Jon Ander Novella de Miguel Department of Pharmaceutical Biosciences Jon.Novella@farmbio.uu.se

×