Advertisement

FAIR Computational Workflows

Jul. 13, 2021
Advertisement

More Related Content

Advertisement

More from Carole Goble(20)

Advertisement

FAIR Computational Workflows

  1. FAIR Computational Workflows Professor Carole Goble The University of Manchester UK EU Research Infrastructures: ELIXIR, IBISBA, EOSC-Life Centre of Excellence: BioExcel carole.goble@manchester.ac.uk JOBIM 2021, 8th July 2021 https://tinyurl.com/jobim-goble
  2. Computational Workflows for Data intensive Bioscience prepare, analyze, and share increasing volumes of complex data CryoEM Image Analysis Metagenomic Pipelines Drug Discovery Protein Ligand MD Simulation Genome Annotation High Throughput Sequencing Fabrice Allain Romain Dallet
  3. 20 years+ Computational workflows decades in the making…finally coming of age…. doi: 10.1093/gigascience/giaa140 Nature 573, 149-150 (2019) https://doi.org/10.1038/d41586-019-02619-z
  4. What are Data intensive Computational Workflows? Systematic linking together multiple tools and software packages inputs outputs tools, CLI, containers, workflows Scale up Access to computational infrastructure and datasets, tool interoperability, processing portability and optimisation, data wrangling. Specification description Software Execution WfMS Engine Workflow Scale out Flexible workflow composition to construct & run executable control and data flows using heterogeneous software packages, codes, tools, other workflows made by other people.
  5. SARS-CoV-2 allelic-variant surveillance Automated monitoring of structured data from the European COVID-19 Data Portal and national SAR-CoV-2 sequencing datasets, notably COG-UK. Scalable via access to a global distributed compute network • Improved data quality • Uniformly analysed data for downstream analysis & visualisation • Submission of data to public archives • All workflows, data and documentation available https://covid19.galaxyproject.org https://elixir-europe.org/news/covid-19-variants-galaxy https://doi.org/10.1101/2021.03.25.437046 Suite of workflows
  6. Distributed analysis , Pulsar network Managed online hosted Workflow as a Service Platform Designed for direct use by end users - 32K users Experts build workflows that others can use with their own data Researchers build and reuse workflows that are shared End users also use it to access and interact with a tool Workflow and Tool histories and reporting [Björn Grüning]
  7. Those workflows in the WorkflowHub Registry Find, publish and cite workflows and collections. Reuse, recycle, repurpose.
  8. Sharing Accelerates Science 8 Jacques van Helden A digital space for EMERGEN, the French plan for SARS-CoV-2 genomic surveillance and research Adapting and Reusing the ELIXIR Galaxy Workflows Tried and tested transparent methods.
  9. Inter-twingled Workflow System Landscape Scripting environments Interactive Electronic Research Notebooks Workflow Management Systems & execution platforms Repositories Registries Inter-twingling Mix and Matching Interactive & exploratory analysis Production, automated, workflow-integrated software https://s.apache.org/existing-workflow-systems 298 Systems
  10. 10 Handy Properties of Computational Workflows Composition & Abstraction Using the best codes written by 3rd parties Handle heterogeneity Shield complexity & incompatibility Sharable reusable, re-mixable methods Automation Repetitive reproducible pipelines Simulation sweeps Manage data and control flow Optimised monitoring & recovery Automated deployment Scalability & Infrastructure Access Accessing infrastructures, datasets and tools Optimised computation and data handling Parallelisation Secure sensitive data access & management Interoperating datasets & permission handling Reporting & Accreditation Portability Sharing & Adaptability Provenance logging & data lineage Auto-documentation Result comparison Dependency handling Containerisation & packaging Moving between on premise & cloud Shared method, publishable know-how BYOD / parameters Different implementations Changes in execution infrastructure
  11. https://snakemake.github.io/ Workflows are rules: Graph of jobs for automatic parallelisation, DIY package & containerisation installation, auto-documentation from frameworks to web based analysis platforms, hybrid cloud deployment Communities tend to cluster round a few systems. Take up of a WfMS typically depends on the “plugged-in” availability of data type specific codes, skills level of the workflow developers, and popularity. Online portals users build and reuse workflows around publicly available or user-uploaded data and pre-wrapped, pre-installed tools.
  12. Vive la France! https://galaxy-synbiocad.org/ https://www.biorxiv.org/content/10.1101/2020.06.14.145730v1.full.pdf [Jean-Loup Faulon]
  13. WORKFLOW APPLICATION USER Yes it’s work, Labour saving -> Labour shifting know-how Production platforms & pipelines TOOL DEVELOPER WORKFLOW USER SYS ADMIN WORKFLOW DEVELOPER & CUSTODIAN COMPUTATIONAL USER Workflow System as a Platform Workflow System as a Service Labour Reach need infrastructure & services need tools to be wrapped & maintained need workflows to be developed, tested, run & maintained need to find and understand workflows, with explanations to use properly and safely.
  14. from compounds & genomics to tissue banks, from plants to marine to humans… https://lifescience-ri.eu/ An open collaborative space for digital biology in Europe
  15. A Workflow and Tools Collaboratory A data and method commons Workflows are an entry point to the tools and datasets of EOSC-Life functions for production quality FAIR data processing and access to secure data processing With thanks: Romain Dallet Galaxy Genome Annotation (GGA) environment in the cloud
  16. The EOSC-Life Workflow Collaboratory People -> workflows, services and standards for FAIR Workflows.
  17. Computational Workflow Framings Reproducibility Replication Regulation Labour saving Productivity Reliability Knowledge sharing Adaption Scholarly Objects Democratisation of computational analysis & methods
  18. Computational Workflow Framing: FAIR Principles The EOSC-Life FAIR Workflow Collaboratory A set of guiding principles to enhance the value of all digital resources and their reuse by people and by machines aligning a community around a journey to common data guidelines To help accelerate science so folks can find and reuse and interlink data – and tools and workflows too! Consumers and producers all benefit.
  19. Computational Workflow Framing: FAIR Principles The EOSC-Life FAIR Workflow Collaboratory FAIR is the EOSC glue to federate data and services, to apply to all objects
  20. How the FAIR Principles look RDA FAIR Data Maturity Model. Specification and Guidelines https://zenodo.org/record/3909563#.YORYkUzTX19 https://www.go-fair.org/fair-principles/ Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
  21. FAIR Principles for Data tl;dr https://www.go-fair.org/fair-principles/ Persistent human readable and machine-actionable metadata Linked metadata and community standards Persistent identifiers Clear licensing and access rules Protocols for machine accessibility Register / Index
  22. FAIR for Software Software is a digital object but research software is not (just) data https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg FAIR for Research Software (FAIR4RS) working group Katz et al., 2016; Lamprecht et al., 2019 FAIR4RS First Draft of FAIR4RS principles CodeMeta https://github.com/codemeta/codemeta/
  23. https://www.softwareheritage.org/ https://www.cascad.tech/ puts software on a par with publications and data and announces a number of measures designed to open research software and better recognize software development in research. https://cache.media.enseignementsup-recherche.gouv.fr/file/science_ouverte/20/9/MEN_brochure_PNSO_web_1415209.pdf
  24. Data and software are first class objects and there will be sharing. Primary responsibility aimed at creators and providers for benefit of consumers but consumers need to shoulder responsibility too. Operating in an (open) ecosystem. Adoption at scale in legacy settings. Not a green-field site. EOSC-Life FAIR Workflow Collaboratory FAIR Implicit Assumptions in the Principles
  25. FAIR Principles for Workflows Hybrid Processual Digital Objects Method “Data” Objects Workflows as FAIR Software FAIR+R and FAIR++ The principles can be revised Workflows as FAIR Digital Objects Data-like method objects The principles can be adapted Workflows as FAIR Data Instruments FAIRification of the dataflow The data principles can be supported C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000 Workflow Objects Software Objects Composable Usable Reusable FAIR Data
  26. Abstraction & Reporting Separation of the workflow specification from its execution & tools Specification description Software Execution Precise description of a procedure composed of multiple steps coordinated by input/output data relationships. Execution of computational and composted processes with data consumed & produced by each step. WfMS Engine Workflow Sub Workflows Tools and codes Parameters Inputs Outputs Infrastructure Guidance Associated Objects Data Logs / Histories / Provenance Services, e.g. Test engines + Related workflows Checker workflows Contextual Entities Metadata Graphs Sample input parameters, test data Software Management
  27. https://bioexcel.eu/speed-up-your-biomolecular-simulations-with-workflows-using-the-bioexcel-building-blocks-biobb/ Image credit: Bioexcel Centre of excellence Composition & Portability Analysis components - different codes/languages/third parties/compute
  28. FAIR Principles for Workflows coping with Hybrid Processual Digital Objects Composition & agency Usable not just reusable Abstraction forms Living & reusable parts & whole versioned, forked, cloned parts recycled, repurposed, remixed limited lifespans citable credit executability reproducibility, portability testing, maturity quality, maintainability specification implementation instantiation run result FAIR+R FAIR++ modularisation FAIR parts & dependencies propagation of FAIR properties
  29. Findable & Accessable register workflows with assigned PID + metadata in a searchable resource. https://workflowhub.eu Publishing Services Journals Digital Objects of Scholarship published, cited, exchanged, reviewed, validated & reused in new and different ways • Versioned identifiers • DOI assignment (https://doi.org/10.48546/workflowhub.workflow.29.2) • Collections, Canonical workflow libraries scripts Repos Containers Deploys Tools Agnostic and generous with the many WfMSs (with different degrees of support) • Workflows can be in native places • Metadata standards framework that all services can adopt on a spectrum and handles associated objects and links between objects. • Perpetual development by an open community
  30. licensing authors & credit analytics access search versions & status other workflows Biggest challenge? Metadata of course! Work with WfMS to auto-extract metadata + provide metadata services
  31. More than just a list 3 Spaces, Teams, People Linking up providers and users Building visibility & reputation Reciprocity to close the “Find – Get– Use – Credit” loop Research objects to be cited Build Knowledge Graphs linking out to OpenAIRE, DataCite and other tools
  32. FAIR Workflow are FAIR Software lifecycle support for living objects Indicators of Status Workflow monitoring Register versions Version PIDs Support Github actions Track authors and contributions Incremental metadata and supplementary materials Track & lift out sub- workflows R1.2: (Meta)data and software are associated with detailed provenance
  33. Tool Registry Service API Accessible metadata and workflows are retrievable by their PID using a standardized communication protocol GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
  34. FAIR Metadata for Machines Machine and human readable canonical descriptions of the workflow that are WfMS neutral https://www.commonwl.org Canonical description of the workflow Linked to containerised tools Aid collaboration & knowledge transfer Standardise expression of workflow Describe engine neutral portable, reusable workflows Reduce vendor / project lock-in Enable workflow comparisons “Abstract” CWL
  35. Design by canonical, modularised workflow blocks Build a library of tested and validated CWL blocks CWL: • Canonical descriptions • Recycle descriptions and sub-workflows • Platform independent pipeline exchange and comparison Rob Finn Folker Meyer AWE MEGAHIT Assembly pipeline [with thanks to Rob Finn]
  36. Extensible Metadata Framework that caters for all those processual FAIR criteria Common metadata about the workflow, tools & parameters Canonical workflow description of the steps of the workflow Type the input and outputs of the steps Run Provenance / Histories / Tests Format for packaging a workflow, its metadata and companion objects (links to containers, data etc) for exchange, archiving, reporting, citing. FAIR Digital Object All Open Communities
  37. Bioschemas lightweight metadata Extensible and Linked metadata in service of the Life Science Community Open community reusing industry de facto standard Computation workflow profile Formal parameter profile https://bioschemas.org Opinionated use of schema.org, the web resource mark-up used by search engines, knowledge graphs and increasingly science as a whole. Computational tool Herve Menager Pasteur Alban Gaignard Nante
  38. Workflow Digital Objects Lightweight way of packaging everything together regardless where or what it is https://www.researchobject.org/ro-crate/ Format for packaging up scattered resources and self describing the package and its parts to get an integrated view + context, using metadata and PIDs to reference digital and real things - data, workflows & people, places. Web-native, off the shelf - machine and human readable, search engine & developer friendly. Infrastructure independent & self-describing PIDs, JSON-LD, Schema.org, archive formats Extensible and open- ended to cope with diversity and legacy “Duck typing” using profiles + added schema.org and domain ontologies
  39. RO-Crate Profile Variants Galaxy- Workflow- RO-Crate Workflow-RO-Crate Workflow- Testing- RO-Crate Workflow- Run- RO-Crate BioComputeObject -RO-Crate IEEE P2791-2020 https://www.researchobject.org/ro-crate/profiles
  40. BioComputeObject - Regulation why and how to use a workflow IEEE P2791-2020 robust, safe exchange & reuse of HTS computational analytical workflows http://biocomputeobject.org Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al “Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results” PLOS Biology 2018m, https://doi.org/10.1371/journal.pbio.3000099 https://biocompute-objects.github.io/bco-ro-crate/ “Sidecar” third party metadata files inside the RO-Crate FAIR has to operate in a legacy ecosystem format
  41. FAIR Digital Objects RO-Crate a step towards FAIR Digital Object Middleware “To be FAIR each digital object type has its own metadata requirements, and may have its own repositories and registries” FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units: https://doi.org/10.3390/publications8020021 https://fairdo.org https://fairdo.org/wg/fdo-cwfr/
  42. Lightweight Semantic Workflow Underware is ready! A2. metadata are accessible, even when the workflow is no longer available Metadata preservation...beyond any one service. RO-Crate archive preserves metadata and workflow, republished in a long-term archive Archiving General Executing Testing & Monitoring WfMS R1. workflows are richly described with a plurality of accurate and relevant attributes Automating metadata as much as possible, which means on-boarding WfMS and FAIR services Enough metadata that a workflow is read- reproducible as a method description
  43. FAIR Software - not just Reusable but Usable i.e. can be executed once accessed Multiple wf/test backends: Galaxy Pandemo, CWL, Jenkins … Check workflow performance, provenance on containers, memory usage … Testing and monitoring Containers & Packaging FAIR+R FAIR++ Tool Registry Service API UI to start computational tasks based on containerised software https://github.com/inab/WfExS-backend High-level workflow execution service backend, sensitive data analysis & running on private clouds, produces & consumes RO-Crate
  44. Reproduciblity – Repeatability Provenance & Preservation Workflow-Run-RO-Crate Some heavy lifting … when is FAIR enough? https://iitdbgroup.github.io/ProvenanceWeek2021/ July 22nd 2021 It’s free!! R1.2: (Meta)data and software are associated with detailed provenance - not just the workflow but the run record associated with the data it produced ….
  45. FAIR Interoperability and Reusability = Composability *Reusable (can be understood, modified, built upon or incorporated into other software) Software interoperates with other software through community standard APIs and community standard meta(data) Software include qualified references to other objects Richly described Well documented Licensed Sample input parameters and test data Checker workflows Track versions Programmatic access to (meta)data Libraries of canonical workflow blocks Make tools workflow-ready Wrap tools *FAIR4RS Proposed Principles for FAIR Software Design for FAIR Data Design for Reuse Community Review Community Curation Certification Best Practice Licence combinations Access permissions Local -> Global identifiers
  46. FAIR takes a village its a JOINT responsibility and opportunity! In order for data to be FAIR, you need services that enable FAIR Be a good plug-in tool and data citizen enable programmatic access to datasets make clean tool interface avoid usage restrictions use open community data standards and formats simplify installation code for portability, parallelisation & reproducibility manage versions register! document! Be a good workflow maker......and user use and make FAIR identifiers for data license data outputs use open community data standards and formats validate parameters use a WfMS that tracks data provenance consider secure data processing manage versions design tests and test data credit tool and sub-workflow makers choose FAIR data services register! document! build libraries! use well documented FAIR enabling and FAIR workflows credit the makers!
  47. FAIR takes a village its a JOINT responsibility and FAIR ≠ FREE Advocate standards & practice Sustain and manage infrastructure Credit and incentives Maturity models & metrics Certification and canonical libraries In order for data to be FAIR, you need services that enable FAIR Training, Stewardship & Sustainability Workflows are an entry point to the tools and datasets of EOSC-Life and functions for FAIR data.
  48. FAIR Computational Workflows: TL;DL Modern bioinformatics increasingly leans on computational workflows as production workhorses and transparent, reproducible processing. Workflows democratise access to data and infrastructure and sharing of complex processing. Workflows are hybrid Digital Objects of scholarship that should be FAIR which means defining FAIR, and the necessary standards, services and processes. FAIR is an opportunity and necessity to get wider uptake of workflows FAIR data, workflows and their infrastructure and everything else takes a village where everyone shoulders responsibility for the benefit of all.
  49. Acknowledgements The WorkflowHub Club, Bioschemas Community, RO-Crate Community, CWL Community, Galaxy Europe, EOSC-Life and ELIXIR Tools Platform. Special Thanks Stian Soiland-Reyes (U Manchester / U Amsterdam) Paul Brack, Stuart Owen, Finn Bacall, Alan Williams (U Manchester) Björn Grüning (U Freiburg) Frederik Coppens (VIB) Sarah Jones (GEANT) Herve Menager (Pasteur Institute) Sarah Cohen-Boulakia (U Paris Sacly) Dan Katz (U Illinois Urbana-Champaign) Simone Leo (CRS4) Laura Rodriguez-Navas (BSC) José Mª Fernández (BSC) EOSC-Life https://www.eosc-life.eu/ ELIXIR http://elixir-europe.org RO-Crate https://www.researchobject.org/ro-crate/ WorkflowHub https://workflowhub.eu/ and workflowhub.org Galaxy Europe https://galaxyproject.eu/ Bioschemas https://bioschemas.org Common Workflow Language https://www.commonwl.org/
Advertisement