Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advances in Scientific Workflow Environments


Published on

Advances in Scientific Workflow Environments for BioExcel SIG at European Conference on Computational Biology 2016. Biomolecular Simulation using HPC.

Published in: Science
  • Be the first to comment

Advances in Scientific Workflow Environments

  1. 1. 2016-09-04 BioExcel SIG, ECCB, Amsterdam Advances in Scientific Workflow Environments Carole Goble, Stian Soiland-Reyes The University of Manchester
  2. 2. What is a Workflow? • Orchestrating multiple computational tasks • Managing the control and data flow between them • In a world that is homogeneous or heterogeneous • Tasks – Local / remote – Local / third party – White, grey or black boxes – Reliable / fragile – Reserved / dynamic – Various underpinning infrastructure – Various access controls BioExcel: Biomolecular recognition
  3. 3. What is a Workflow? Automation – Automate computational aspects – Repetitive pipelines, sweep campaigns Scaling – compute cycles – Make use of computational infrastructure & handle large data Abstraction – people cycles – Shield complexity and incompatibilities – Report, re-use, evolve, share, compare – Repeat –Tweak - Repeat – First class commodities Provenance - reporting – Capture, report and utilize log and data lineage auto-documentation – Traceable evolution, audit, transparency – Compare With thanks to Bertram Ludascher:WORKS 2015 Keynote Findable Accessible Interoperable Reusable (Reproducible)
  4. 4. Laser Interferometer Gravitational-Wave Observatory – first detection of gravitational waves from colliding black holes
  5. 5. Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture. [Susheel Varma]
  6. 6. Galaxy
  7. 7. Marine metagenomics + Bespoke Scripts [Rob Finn]
  8. 8. Open PHACTS BioExcel workflow Targets Pharmacological queries target, compound and pathway data
  9. 9. Scripts, Ensemble toolkit, execution patterns
  10. 10. WF Zoo
  11. 11. Workflow Patterns, templates Data wrangling & analytics Simulations Instrument pipelines + + The Future of ScientificWorkflows, Report of DOEWorkshop 2015,
  12. 12. Workflow Patterns, templates Data wrangling & analytics Simulations Instrument pipelines + + Garijo et al Common Motifs in ScientificWorkflows: An EmpiricalAnalysis, FGCS, 36, July 2014, 338–351
  13. 13. Workflow Patterns, templates • Long running and complex code • Tunable parameters and input sets • Simulation sweeps / iterations • Ensembles, comparisons • Tricky set-ups, human-in-the-loop interaction • Computational steering • In situ workflows – multiple tasks, same box, within fixed time – data locality. – human-in-the-loop. – capture provenance. Data wrangling & analytics Simulations Instrument pipelines + +
  14. 14. Traction + Examples Reuse behaviours Exploratory vs Production Different kinds of user / deployment Developer – User Ratios BiologistDeveloper Computational Scientist
  15. 15. Existing computational research workflow systems WFMS Zoo
  16. 16. Existing computational research workflow systems
  17. 17. Existing computational research workflow systems s:// Workflow-systems
  18. 18. “Multi-scale” WFMS • Workflow Management System – Its design and reporting environment – Its execution environment • The tasks – tools, codes and services and their execution environments • Stack layer – App level, infrastructure level
  19. 19. Component making Tasks loosely coupled through files, • execute on geographically distributed clusters, clouds, grids across systems • execute on multiple facilities • call host services (web / grid services) DAIC Distributed Area/Instrument Computing “Multi-scale” WFMS Tasks tightly coupled • exchanging info over memory/storage • network of supercomputers • In situ workflows – multiple tasks, same box, within fixed time HPC Interoperability Portability Granularity Maintenance
  20. 20. Workflow Environment Ecosystem
  21. 21. Copernicus workflow engine for parallel adaptive molecular dynamics • Peer-to-peer distributed computing platform – high-level parallelization of statistical sampling problems • Consolidation of heterogeneous compute resources • Automatic resource matching of jobs against compute resources • Automatic fault tolerance of distributed work • Workflow execution engine to define a problem (reporting) and trace its results live (provenance) • Flexible plugin facilities – programs to be integrated to the workflow execution engine Free Energy Workflow using GROMACS
  22. 22. COMPs/PyCOMPs: Programmer Productivity framework • Sequential programming – Parallelisation and distribution heavy-lifting – Dependency detection • Infrastructure unaware – Abstract application from underlying infrastructure – Portability • Standard Programming Languages – Java, Python, C/C++ • No (or few!) APIs – Standard Java
  23. 23. Shield the user/programmer Exposure to the infrastructure System Design Manage/minimize data transfers
  24. 24. Stop Press! GUIs not essential! • Canvas, drag-drop blocks, arrows, run button • Command-line & embedding in developer or user applications Scripts can be workflows! • WMS<->Scripts • Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: **
  25. 25. Stop Press! GUIs not essential! • Canvas, drag-drop blocks, arrows, run button • Command-line & embedding in developer or user applications Scripts can be workflows! • WMS <-> Scripts • Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: ** Work close to a problem- specific ad-hoc data model Domain Specific Language "programming-lite" scripts • wire with declarative "makefile"-like DAG Plus • procedural scripting and expressions in languages like Javascript and Python Nextflow, SnakeMake, CommonWorkflow Language
  26. 26. GUIs Are Essential  take-up by the user base
  27. 27. Workflowising script software eco-systems prime example: provenance ASAP • common, interoperable provenance recording – W3C PROV ASAP • – Annotations in script yield workflow view ASAP • Library profilers – noWorkflow • runtime provenance recorders – Sumatra, RDataTracker
  28. 28. Provenance the link between computation and results W3C PROV model standard record for reporting compare diffs/discrepancies provenance analytics track changes, adapt partial repeat/reproduce carry attributions compute credits compute data quality/trust select data to keep/release optimisation and debugging Metadata propagation –where was the physical sample collected, and who should be attributed? Task-based abstractions: simplifying provenance using motifs and tool annotations “Free energy calculation” rather than 5 steps including preparation of PDB files and GROMACS execution
  29. 29. Provenance the link workflow variants and workflow reuse and repurpose W3C PROV model standard? record for reporting compare diffs/discrepancies provenance analytics track changes, adapt carry attributions compute design credits versioning, forking, cloning Nested workflows functions by stealth Copy and paste fragmentation Designing for reuse Find and Go Software practices Systematic reuse Guidelines for persistently identifying software using DataCite principles
  30. 30. ASAP Wfms for FAIR Science Automate: workflows, programs and services folks already use or want to use Scale: Enable computational productivity Abstract: Enable human productivity Provenance: Record and use Usability Workflow Plugged in Code Reporting Comparison Thanks to Bertram Ludascher
  31. 31. Dependency Management Codes Behaviours & Reliability
  32. 32. ● Task-specific “mini-workflow” fragments – e.g. using Gromacs, CPMD, HADDOCK ● Packaged – EGIVM images and Docker containers ● Backed by existing registries – ELIXIR’s and EGI App DB ● Instantiated as cloud instances – private (Open Nebula, Open Stack) – public (e.g.AmazonAWS ) Application Building Blocks BioExcel Virtualised Software Library “transversal workflow units”, higher level operations
  33. 33. BioExcel Use cases ● Genomics ● Ensembl Molecular simulations ● Free Energy simulations ● Multiscale modelling of molecular basis for odor and taste ● Biomolecular recognition ● Pharmacological queries ● Virtual Screening
  34. 34. Finding valid pathways through free-energy landscapes: implementation of the “string of swarms” method using Copernicus as a workflow manager, and GROMACS as a compute engine.
  35. 35. Workflow Interoperability. • Common format for bioinformatics tool & workflow execution • Community based standards effort • Designed for clusters & clouds • Supports the use of containers (e.g. Docker) • Specify data dependencies between steps • Scatter/gather on steps • Nest workflows in steps • Develop your pipeline on your local computer (optionally with Docker) • Execute on your research cluster or in the cloud • Deliver to users via workbenches • EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file
  36. 36. Workflow Research Object Bundle Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, JWeb Semantics doi:10.1016/j.websem.2015.01.003 application/vnd.wf4ever.robundle+zip
  37. 37. Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam
  38. 38. 2007 2015
  39. 39. research/ Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin (UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse (EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti (Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN) Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou Sign up ASAP!
  40. 40. Bonus Slides