2017-11-03 Scientific Workflow systems


Presented 2017-11-03 to CESAB workshop on Reproducible Workflows, Aix-en-Provence

Science
  1. 1. Partners Funding Scientific Workflow Systems 1 Stian Soiland-Reyes eScience Lab, The University of Manchester 2017-11-03, Aix-en-Provence CESAB workshop: Reproducible Workflows @soilandreyes This work is licensed under a Creative Commons Attribution 4.0 International License.
  2. 2. What is a Workflow? Orchestrating computational tasks Managing the control and data flow Homogeneous or heterogeneous tasks: – Local / remote – Own / third party – White, grey or black boxes – Reliable / fragile – Reserved / dynamic – Various underpinning infrastructure – Various access controls BioExcel: Biomolecular recognition
  3. 3. Not on the agenda: Business workflows Control flow of who has responsibility for what BPM Business workflows + computational workflows  IBISBA 3
  4. 4. Why use workflows?Automation – Automate computational aspects – Repetitive pipelines, sweep campaigns Scaling – compute cycles – Make use of computational infrastructure & handle large data Abstraction – people cycles – Shield complexity and incompatibilities – Report, re-use, evolve, share, compare – Repeat –Tweak - Repeat – First class commodities Provenance - reporting – Capture, report and utilize log and data lineage auto-documentation – Traceable evolution, audit, transparency – Compare Findable Accessible Interoperable Reusable (Reproducible) 4 Adapted from Bertram Ludäscher atWORKS2015
  5. 5. The humble Makefile 5
  6. 6. Laser Interferometer Gravitational-Wave Observatory First detection of gravitational waves from colliding black holes
  7. 7. Workflow Environment Ecosystem 7
  10. 10. Pharmacological queries target, compound and pathway data
  12. 12. Stop Press!GUIs not essential! GUI: Canvas, drag-drop blocks, arrows, run button, data visualization Script: Textual, command line, view data externally. Script easily run from other apps. Scripts can be workflows! Workflow systems ⇆ Scripts Scripts on ASAP meter: Automation: ★ ★ ★ ★ ★ Scaling: ★ ★ Abstraction: ★ Provenance: ★ ★
  13. 13. Script-like, define flow as channels Streaming Automatic Parallelism Checkpoints Virtualization and packaging Portable Reproducibility
  14. 14. Snakemake MakeFile + Python ⇝ SnakeMake Filename patterns Shell commands Inline Python, R Scalable to grid/cloud 14
  15. 15. YesWorkflow Declare workflow steps as #annotations in existing scripts Graphical visualization of workflow 15
  16. 16. nextgen Distributed workflows for Next-Gen Sequencing analysis Domain-specific language Focus on parameters, algorithms Workflow fixed – no command lines!
  17. 17. Workflow interoperability Common workflow format Community based standards effort Designed for clusters & clouds Use containers (e.g. Docker) Textual YAML files (GUIs available) Workflow: Steps with data dependencies Step: command line or inline scripts Scatter/gather on steps Rich annotations
  19. 19. ContainersLinux Container technology ..light-weight "virtual" virtual machine A container is started from a image Images downloaded from Docker Hub Dockerfile: Layer-based recipe Philosophy: One service, one image → microservices Cloud's best friend: scalable, reproducible, customizable 19
  20. 20. Publish your own container images 20 Dockerfile
  21. 21. Find and Share
  23. 23. Running workflows, tracking provenance
  24. 24. Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. Provenance W3C standard: PROV But multiple formats Multiple styles Multiple extensions Best practice for Workflow Provenance? wfprov (Research Object, Taverna) OPMW/P-Plan (WINGS) ProvONE (DataOne)
  26. 26. bioexcel.eu application/vnd.wf4ever.robundle+zip Research Object Bundle
  27. 27. Partners Funding Acknowledgements 27 Carole Goble Michael R. Crusoe Apache Taverna BioExcel Common Workflow Language Research Object