FAIR Computational Workflows

FAIR Computational Workflows
Professor Carole Goble
The University of Manchester UK
EU Research Infrastructures: ELIXIR, IBISBA, EOSC-Life
Centre of Excellence: BioExcel
Software Sustainability Institute UK
FAIRDOM Consortium
carole.goble@manchester.ac.uk
GCB 2021, 7th September 2021
https://gcb2021.de/
P44, Implementation of a scalable
SARS-CoV-2 NGS data processing,
variant calling and correlation analysis
pipeline with snakemake
[Norma J. Wendel]
GCB2021
Many examples of
multi-step processing and analytics
(and all those ML pipelines!)
Systematic linking together of
multiple tools and software
packages using computational
infrastructure
Some using
Computational Workflow Systems
Computational Workflows for Data intensive Bioscience
prepare, analyze, and share increasing volumes of complex data
Bioimage analysis with deep learning for
everyone: Visual programming in JIPipe
[Ruman Gerst, Jan-Philipp Praetorius]
APEER: A cloud-based digital microscopy
platform to create image processing
workflows.
[Bernhard Fichti]
Computational Workflows for Data intensive Bioscience
prepare, analyze, and share increasing volumes of complex data
CryoEM Image Analysis
Metagenomic Pipelines
Drug Discovery
Protein Ligand MD
Simulation
Genome Annotation
High Throughput Sequencing
[Fabrice Allain
JOBIM2021]
[Romain Dallet
JOBIM2021]
[Adam Hospital]
[Rob Finn]
[Carlos Oscar Sorzano Sanchez]
20 years+
Computational workflows
decades in the making…finally coming of age….
doi: 10.1093/gigascience/giaa140
Nature 573, 149-150 (2019)
https://doi.org/10.1038/d41586-019-02619-z
What are Data intensive Computational Workflows?
Systematic linking together multiple tools and software packages
inputs
outputs
tools, CLI,
containers,
workflows
Scale up
Access to computational infrastructure
and datasets, tool interoperability,
processing portability and
optimisation, data wrangling.
Specification
description
Software
Execution
WfMS
Engine
Workflow
Scale out
Flexible workflow composition to
construct & run executable control
and data flows using
heterogeneous software packages,
codes, tools, other workflows made
by other people.
SARS-CoV-2 allelic-variant surveillance
Automated monitoring of structured data
from the European COVID-19 Data Portal and
national SAR-CoV-2 sequencing datasets.
Scalable - access to a global distributed
compute network
• Improved data quality
• Uniformly analysed data for downstream
analysis & visualisation
• Submission of data to public archives
• https://covid19.galaxyproject.org
https://elixir-europe.org/news/covid-19-variants-galaxy
https://doi.org/10.1101/2021.03.25.437046
Suite of
workflows
Distributed analysis , Pulsar network
Managed online hosted Workflow as a Service Platform
Direct use by end users - 32K users
Experts build workflows that others can use with their own data
Researchers build and reuse workflows that are shared
End users also use it to access and interact with a tool
Workflow and Tool histories and reporting
Björn Grüning
U of Freiburg
Those workflows in the WorkflowHub Registry
curated collection Find, publish and cite workflows and
collections. Reuse, recycle, repurpose.
Sharing Accelerates Science
9
A digital space for
EMERGEN, the
French plan for SARS-
CoV-2 genomic
surveillance and
research
Adapting and Reusing the
ELIXIR Galaxy Workflows
Tried and tested
transparent methods.
Jacques van Helden
Inter-twingled Workflow System Landscape
Scripting
environments
Interactive Electronic
Research Notebooks
Repositories Registries
Inter-twingling
Mix and Matching
Interactive &
exploratory
analysis
Production, automated,
workflow-integrated
software
Workflow
Management
Systems & execution
platforms
https://s.apache.org/existing-workflow-systems
298 Systems
General and Specialised
https://snakemake.github.io/
Workflows are rules:
Graph of jobs for automatic parallelisation,
DIY package & containerisation
installation, auto-documentation
From frameworks to web based analysis platforms, hybrid cloud deployment
Communities tend to cluster round a few systems.
Take up of a WfMS typically depends on the “plugged-in” availability of data type
specific codes, skills level of the workflow developers, and popularity.
Online portals users build and reuse
workflows around publicly available or
user-uploaded data and pre-wrapped,
pre-installed tools.
Abstraction property of Computational Workflow Systems
Separation of the workflow specification from its execution & tools
Encoded method less dependent on
implementations
• Sustained as digital environment evolves
- extend investments
• Map to diverse platforms - exploit latest
advances in platforms
• Reused and re-purposed - lower
experimental design cost & share
methods across discipline boundaries
Ten Handy Properties of Computational Workflow Systems
Abstraction & Composition
Using the best codes written by 3rd parties
Handle heterogeneity
Shield complexity & incompatibility
Sharable reusable, re-mixable methods
Automation
Repetitive reproducible pipelines
Simulation sweeps
Manage data and control flow
Optimised monitoring & recovery
Automated deployment
Scalability & Infrastructure Access
Accessing infrastructures, datasets and tools
Optimised computation and data handling
Parallelisation
Secure sensitive data access & management
Interoperating datasets & permission handling
Reporting & Accreditation
Portability
Sharing & Adaptability
Provenance logging & data lineage
Auto-documentation
Result comparison
Dependency handling
Containerisation & packaging
Moving between on premise & cloud
Shared method, publishable know-how
BYOD / parameters
Different implementations
Changes in execution infrastructure
WORKFLOW
APPLICATION USER
Yes it’s work, Labour saving -> Labour shifting
Production platforms & pipelines, Collective Labour
TOOL
DEVELOPER
WORKFLOW
USER
SYS ADMIN WORKFLOW
DEVELOPER
& CUSTODIAN
COMPUTATIONAL
USER
Workflow System as a Platform Workflow System as a Service
Labour
Reach
need
infrastructure
& services
need tools to be
wrapped &
maintained
need workflows to be
developed, tested,
run & maintained
need to find and understand
workflows, with explanations to
use properly and safely.
The long tail
Common Workflow Language workflows
CWL Viewer, https://view.commonwl.org
Github and git repositories of
~2500 unique workflows (~26,700 including versions)
Top 10 repos
70% of workflows
Top 20 repos
80% of workflows
https://lifescience-ri.eu/
An open collaborative
space for digital
biology in Europe
Environment for hosting
& processing data
Workflows are an entry point to the
tools and datasets
functions for production quality
FAIR data processing
access to secure data processing
Figure Credit: Romain Dallet
Galaxy Genome Annotation (GGA) environment in the cloud
A data and method commons
A portable environment of interoperable tools
RIs publish data, methods & services for
management, storage and reuse
The EOSC-Life Workflow Collaboratory
People, workflows, services and standards for FAIR Workflows.
Reflection: Computational Workflows
Reproducibility
Replication
Regulation
Labour saving
Productivity
Reliability
Knowledge sharing
Adaption
Scholarly Objects
Democratisation
of computational
analysis & methods
Framings for the EOSC-Life FAIR Workflow Collaboratory
The EOSC-Life FAIR Workflow Collaboratory
FAIR
Workflows
FAIR Principles
Findable, Accessible, Interoperable, Reusable
A set of guiding principles to enhance the value
of all digital resources and their reuse by people
and by machines
Assumption – data (& software) are first class
objects that will be shared
Accelerate science - find and reuse and interlink
data (and tools, workflows, machine
learning….)
A community journey to common guidelines
Consumers and producers all benefit.
The FAIR Principles look
RDA FAIR Data Maturity Model. Specification and Guidelines https://zenodo.org/record/3909563#.YORYkUzTX19
https://www.go-fair.org/fair-principles/
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and
stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
tl;dr FAIR Principles for Data https://www.go-fair.org/fair-principles/
Enhance automation
Persistent human readable and machine-actionable metadata
Linked metadata and community standards
Persistent identifiers
Clear licensing and access rules
Protocols for machine accessibility
Register / Index
Assumption: operate in an ecosystem at scale and in legacy settings.
Fairly AI Ready
FAIR for Software
Software is a digital object but research software is not (just) data
https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg
FAIR for Research Software (FAIR4RS) working group
Lamprecht et al., 2019
FAIR4RS First Draft of FAIR4RS principles
CodeMeta
https://github.com/codemeta/codemeta/
Katz, et al PATTERNS 2, 2021
FAIR Principles for Workflows
Hybrid Processual Digital Objects
Method “Data” Objects
Workflows as
FAIR Software
FAIR+R and FAIR++
The principles revised
Workflows as
FAIR Digital Objects
Data-like method objects
The principles adapted
Workflows as
FAIR Data Instruments
FAIRification of the dataflow
The data principles supported
C. Goble, S. Cohen-Boulakia, S.
Soiland-Reyes, D. Garijo, Y. Gil, M.R.
Crusoe, K. Peters & D. Schober. FAIR
computational workflows. Data
Intelligence 2(2020), 108–121.
doi: 10.1162/dint_a_000
Workflow Objects
Software Objects
FAIR Principles for Workflows
Image credit: BioExcel Centre of Excellence
Composition & Portability
different
components,
codes,
languages,
third parties
FAIR Interoperability and Reusability = Composability
R: Reusable - can be understood, modified,
built upon or incorporated into other software
I: Software interoperates with other software through
community standard APIs and community standard meta(data)
Software include qualified references to other objects
Richly described
Well documented
Licensed
Sample input parameters and test data
Checker workflows
Track versions
Programmatic access to (meta)data
Libraries of canonical workflow blocks
Make tools workflow-ready
Wrap tools
FAIR4RS Proposed Principles for FAIR Software
Design for FAIR Data
Design for Reuse
Community Review
Community Curation
Certification
Best Practice
Licence combinations
Access permissions
Local -> Global identifiers
Findable & Accessable
register workflows with assigned PID + metadata in a searchable resource.
https://workflowhub.eu
Publishing Services
Journals
Digital Objects of Scholarship
published, cited, exchanged, reviewed, validated & reused in
new and different ways
• Versioned identifiers
• DOI assignment (https://doi.org/10.48546/workflowhub.workflow.29.2)
• Collections, Canonical workflow libraries
scripts
Repos
Containers Deploys
Tools
Agnostic and generous with many WfMSs
with different degrees of support
• Workflows in native places
• Metadata standards framework, handles
associated objects and links between objects.
• Perpetual development by an open community
licensing
authors
& credit
analytics
access
search
versions & status
other
workflows
Biggest challenge?
Metadata of course!
Work with WfMS to auto-extract
metadata + provide metadata services
More than just a list
3
Spaces, Teams, People
Linking up providers and users
Building visibility & reputation
Reciprocity to close the
“Find – Get– Use – Credit” loop
Citations
Knowledge Graphs linking out to
OpenAIRE, DataCite etc
Customised FAIRDOM-SEEK
https://fair-dom.org, https://fairdom-seek.org, https://fairdomhub.org
Digital asset management platform for Project Hubs
organising, cataloguing, sharing and publishing multiple kinds of research
objects held in multiple repositories for multi-partner projects.
Wolfgang
Müller
Martin
Golebiewski
Ulrike
Wittig
Olga
Krebs
Xiaoming
Hu
FAIR Workflow are FAIR Software
lifecycle support for living objects
Indicators of Status
Workflow
monitoring
Register versions
Version PIDs
Support Github actions
Track authors and contributions
Incremental metadata and
supplementary materials
Track & lift out sub-
workflows
R1.2: (Meta)data and software are associated
with detailed provenance
Tool Registry Service API
Accessible
metadata and workflows retrievable by their PID using a standardized
communication protocol
GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
FAIR Metadata for Machines and Humans
WfMS neutral canonical descriptions
https://www.commonwl.org
Canonical description of the workflow
Linked to containerised tools
• Aid collaboration & knowledge transfer
• Standardise expression of workflow
• Describe engine neutral portable, reusable workflows
• Reduce vendor / project lock-in
• Enable workflow comparisons
• “Abstract” CWL
Design by canonical, modularised workflow blocks
Build a library of tested and validated CWL blocks
CWL:
• Canonical descriptions
• Recycle descriptions and sub-workflows
• Platform independent pipeline exchange and comparison
Rob Finn
Folker Meyer
AWE
MEGAHIT
Assembly
pipeline
[with thanks to Rob Finn]
Extensible Metadata Framework
that caters for all those processual FAIR criteria
Common metadata
about the workflow,
tools & parameters
Canonical workflow
description of the
steps of the workflow
Type the input and outputs
of the steps
Run Provenance / Histories / Tests
Format for packaging a
workflow, its metadata and
companion objects (links to
containers, data etc) for
exchange, archiving,
reporting, citing.
FAIR Digital Object
All Open Communities
Bioschemas lightweight metadata
Extensible and Linked metadata in service of the Life Science Community
Open community reusing industry de facto standard
Computation workflow profile
Formal
parameter
profile
https://bioschemas.org
Opinionated use of schema.org, the web
resource mark-up used by search engines,
knowledge graphs and increasingly science
as a whole.
Computational tool
RO-Crate Digital Objects
Lightweight way of packaging everything together regardless where or what it is
https://www.researchobject.org/ro-crate/
Format for packaging up scattered resources and self
describing that package and its parts
- integrated view + context
- metadata and PIDs reference digital and real things
- datasets, workflows, services, software & people, places etc.
Web-native, off the
shelf - machine and
human readable,
search engine &
developer friendly.
Infrastructure
independent &
self-describing
PIDs, JSON-LD,
Schema.org,
archive formats
Extensible and open-
ended to cope with
diversity and legacy
“Duck typing”
using profiles +
added schema.org
and domain
ontologies
RO-Crate Profile Variants
https://www.researchobject.org/ro-crate/profiles
Galaxy-
Workflow-
RO-Crate
Workflow-RO-Crate
Workflow-
Testing-
RO-Crate
Workflow-
Run-
RO-Crate
BioComputeObject
-RO-Crate
IEEE P2791-2020
BioComputeObject - Regulation
why and how to use a workflow IEEE P2791-2020
robust, safe exchange & reuse of HTS
computational analytical workflows
http://biocomputeobject.org
Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al “Enabling Precision Medicine via standard
communication of NGS provenance, analysis, and results” PLOS Biology 2018m,
https://doi.org/10.1371/journal.pbio.3000099
https://biocompute-objects.github.io/bco-ro-crate/
“Sidecar” third party metadata files
inside the RO-Crate
FAIR has to operate in a legacy
ecosystem
format
Reproduciblity – Repeatability
Provenance & Preservation
Workflow-Run-RO-Crate
Some heavy lifting …
when is it FAIR enough?
R1.2: (Meta)data and software are associated with
detailed provenance - not just the workflow but the
run record associated with the data it produced ….
FAIR Digital Objects
RO-Crate, a step towards FAIR Digital Object Middleware
“Each FAIR digital object type has
its own metadata requirements,
and may have its own repositories
and registries”
FAIR Digital Objects for Science: From Data Pieces to Actionable
Knowledge Units: https://doi.org/10.3390/publications8020021
https://fairdo.org
https://fairdo.org/wg/fdo-cwfr/
Our Workflow Metadata Underware is ready!
Archiving
General
Executing
Testing & Monitoring
WfMS
A2. metadata are accessible, even when the workflow is no longer
available
Metadata preservation beyond any one service in RO-Crate archive,
republished in a long-term archive
R1. workflows are richly described with a plurality of accurate and
relevant attributes, R1.3 domain-relevant community standards
Automating metadata by on-boarding WfMS and FAIR services.
Metadata so that a workflow is read-reproducible as a method description
R. The software is usable (it can be executed) and reusable (it can be
understood, modified, built upon, or incorporated into other software).
Services and standards containers, testing & monitoring, execution,
GA4GH TRS API
Reflection: FAIR takes a village
its a JOINT responsibility and opportunity!
In order for data to be
FAIR, you need services
that enable FAIR
Be a good plug-in tool and data citizen
enable programmatic access to datasets
make clean tool interface
avoid usage restrictions
use open community data standards and formats
simplify installation
code for portability, parallelisation & reproducibility
manage versions
register! document!
Be a good workflow maker......and user
use and make FAIR identifiers for data
license data outputs
use open community data standards and formats
validate parameters
use a WfMS that tracks data provenance
consider secure data processing
manage versions
design tests and test data
credit tool and sub-workflow makers
choose FAIR data services
register! document! build libraries!
use well documented FAIR
enabling and FAIR workflows
credit the makers!
Reflection: FAIR takes a village
its a JOINT responsibility and FAIR ≠ FREE
Advocate standards & practice
Sustain and manage infrastructure
Credit and incentives
Maturity models & metrics
Certification and canonical libraries
In order for data to be
FAIR, you need services
that enable FAIR
Training,
Stewardship &
Sustainability
Workflows are an entry point to the tools
and datasets of EOSC-Life and functions
for FAIR data.
Summary: FAIR Computational Workflows
EOSC-Life Workflow Collaboratory
Production workhorses, transparent, reproducible
processing & democratised access to data, infrastructure and
complex processing.
Hybrid Digital Objects of scholarship that should be FAIR
themselves and support FAIR Data.
FAIR assumed to operate in an ecosystem at scale and in
legacy settings.
FAIR takes a village where everyone shoulders responsibility
not just data and service providers
Acknowledgements
The WorkflowHub Club, Bioschemas Community, RO-Crate
Community, CWL Community, Galaxy Europe, EOSC-Life
and ELIXIR Tools Platform.
Special Thanks
Stian Soiland-Reyes (U Manchester / U Amsterdam)
Paul Brack, Stuart Owen, Finn Bacall, Alan Williams (U Manchester)
Björn Grüning (U Freiburg)
Frederik Coppens (VIB)
Sarah Jones (GEANT)
Herve Menager (Pasteur Institute)
Sarah Cohen-Boulakia (U Paris Sacly)
Dan Katz (U Illinois Urbana-Champaign)
Simone Leo (CRS4)
Laura Rodriguez-Navas (BSC)
José Mª Fernández (BSC)
EOSC-Life https://www.eosc-life.eu/
ELIXIR http://elixir-europe.org
RO-Crate https://www.researchobject.org/ro-crate/
WorkflowHub https://workflowhub.eu/ and workflowhub.org
Galaxy Europe https://galaxyproject.eu/
Bioschemas https://bioschemas.org
Common Workflow Language https://www.commonwl.org/
FAIRDOM https://fair-dom.org
WorkflowsRI https://workflowsri.org/
Extras
50
1 of 48

More Related Content

FAIR Computational Workflows

  • 1. FAIR Computational Workflows Professor Carole Goble The University of Manchester UK EU Research Infrastructures: ELIXIR, IBISBA, EOSC-Life Centre of Excellence: BioExcel Software Sustainability Institute UK FAIRDOM Consortium carole.goble@manchester.ac.uk GCB 2021, 7th September 2021 https://gcb2021.de/
  • 2. P44, Implementation of a scalable SARS-CoV-2 NGS data processing, variant calling and correlation analysis pipeline with snakemake [Norma J. Wendel] GCB2021 Many examples of multi-step processing and analytics (and all those ML pipelines!) Systematic linking together of multiple tools and software packages using computational infrastructure Some using Computational Workflow Systems Computational Workflows for Data intensive Bioscience prepare, analyze, and share increasing volumes of complex data Bioimage analysis with deep learning for everyone: Visual programming in JIPipe [Ruman Gerst, Jan-Philipp Praetorius] APEER: A cloud-based digital microscopy platform to create image processing workflows. [Bernhard Fichti]
  • 3. Computational Workflows for Data intensive Bioscience prepare, analyze, and share increasing volumes of complex data CryoEM Image Analysis Metagenomic Pipelines Drug Discovery Protein Ligand MD Simulation Genome Annotation High Throughput Sequencing [Fabrice Allain JOBIM2021] [Romain Dallet JOBIM2021] [Adam Hospital] [Rob Finn] [Carlos Oscar Sorzano Sanchez]
  • 4. 20 years+ Computational workflows decades in the making…finally coming of age…. doi: 10.1093/gigascience/giaa140 Nature 573, 149-150 (2019) https://doi.org/10.1038/d41586-019-02619-z
  • 5. What are Data intensive Computational Workflows? Systematic linking together multiple tools and software packages inputs outputs tools, CLI, containers, workflows Scale up Access to computational infrastructure and datasets, tool interoperability, processing portability and optimisation, data wrangling. Specification description Software Execution WfMS Engine Workflow Scale out Flexible workflow composition to construct & run executable control and data flows using heterogeneous software packages, codes, tools, other workflows made by other people.
  • 6. SARS-CoV-2 allelic-variant surveillance Automated monitoring of structured data from the European COVID-19 Data Portal and national SAR-CoV-2 sequencing datasets. Scalable - access to a global distributed compute network • Improved data quality • Uniformly analysed data for downstream analysis & visualisation • Submission of data to public archives • https://covid19.galaxyproject.org https://elixir-europe.org/news/covid-19-variants-galaxy https://doi.org/10.1101/2021.03.25.437046 Suite of workflows
  • 7. Distributed analysis , Pulsar network Managed online hosted Workflow as a Service Platform Direct use by end users - 32K users Experts build workflows that others can use with their own data Researchers build and reuse workflows that are shared End users also use it to access and interact with a tool Workflow and Tool histories and reporting Björn Grüning U of Freiburg
  • 8. Those workflows in the WorkflowHub Registry curated collection Find, publish and cite workflows and collections. Reuse, recycle, repurpose.
  • 9. Sharing Accelerates Science 9 A digital space for EMERGEN, the French plan for SARS- CoV-2 genomic surveillance and research Adapting and Reusing the ELIXIR Galaxy Workflows Tried and tested transparent methods. Jacques van Helden
  • 10. Inter-twingled Workflow System Landscape Scripting environments Interactive Electronic Research Notebooks Repositories Registries Inter-twingling Mix and Matching Interactive & exploratory analysis Production, automated, workflow-integrated software Workflow Management Systems & execution platforms https://s.apache.org/existing-workflow-systems 298 Systems General and Specialised
  • 11. https://snakemake.github.io/ Workflows are rules: Graph of jobs for automatic parallelisation, DIY package & containerisation installation, auto-documentation From frameworks to web based analysis platforms, hybrid cloud deployment Communities tend to cluster round a few systems. Take up of a WfMS typically depends on the “plugged-in” availability of data type specific codes, skills level of the workflow developers, and popularity. Online portals users build and reuse workflows around publicly available or user-uploaded data and pre-wrapped, pre-installed tools.
  • 12. Abstraction property of Computational Workflow Systems Separation of the workflow specification from its execution & tools Encoded method less dependent on implementations • Sustained as digital environment evolves - extend investments • Map to diverse platforms - exploit latest advances in platforms • Reused and re-purposed - lower experimental design cost & share methods across discipline boundaries
  • 13. Ten Handy Properties of Computational Workflow Systems Abstraction & Composition Using the best codes written by 3rd parties Handle heterogeneity Shield complexity & incompatibility Sharable reusable, re-mixable methods Automation Repetitive reproducible pipelines Simulation sweeps Manage data and control flow Optimised monitoring & recovery Automated deployment Scalability & Infrastructure Access Accessing infrastructures, datasets and tools Optimised computation and data handling Parallelisation Secure sensitive data access & management Interoperating datasets & permission handling Reporting & Accreditation Portability Sharing & Adaptability Provenance logging & data lineage Auto-documentation Result comparison Dependency handling Containerisation & packaging Moving between on premise & cloud Shared method, publishable know-how BYOD / parameters Different implementations Changes in execution infrastructure
  • 14. WORKFLOW APPLICATION USER Yes it’s work, Labour saving -> Labour shifting Production platforms & pipelines, Collective Labour TOOL DEVELOPER WORKFLOW USER SYS ADMIN WORKFLOW DEVELOPER & CUSTODIAN COMPUTATIONAL USER Workflow System as a Platform Workflow System as a Service Labour Reach need infrastructure & services need tools to be wrapped & maintained need workflows to be developed, tested, run & maintained need to find and understand workflows, with explanations to use properly and safely.
  • 15. The long tail Common Workflow Language workflows CWL Viewer, https://view.commonwl.org Github and git repositories of ~2500 unique workflows (~26,700 including versions) Top 10 repos 70% of workflows Top 20 repos 80% of workflows
  • 16. https://lifescience-ri.eu/ An open collaborative space for digital biology in Europe Environment for hosting & processing data
  • 17. Workflows are an entry point to the tools and datasets functions for production quality FAIR data processing access to secure data processing Figure Credit: Romain Dallet Galaxy Genome Annotation (GGA) environment in the cloud A data and method commons A portable environment of interoperable tools RIs publish data, methods & services for management, storage and reuse
  • 18. The EOSC-Life Workflow Collaboratory People, workflows, services and standards for FAIR Workflows.
  • 19. Reflection: Computational Workflows Reproducibility Replication Regulation Labour saving Productivity Reliability Knowledge sharing Adaption Scholarly Objects Democratisation of computational analysis & methods Framings for the EOSC-Life FAIR Workflow Collaboratory
  • 20. The EOSC-Life FAIR Workflow Collaboratory FAIR Workflows
  • 21. FAIR Principles Findable, Accessible, Interoperable, Reusable A set of guiding principles to enhance the value of all digital resources and their reuse by people and by machines Assumption – data (& software) are first class objects that will be shared Accelerate science - find and reuse and interlink data (and tools, workflows, machine learning….) A community journey to common guidelines Consumers and producers all benefit.
  • 22. The FAIR Principles look RDA FAIR Data Maturity Model. Specification and Guidelines https://zenodo.org/record/3909563#.YORYkUzTX19 https://www.go-fair.org/fair-principles/ Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
  • 23. tl;dr FAIR Principles for Data https://www.go-fair.org/fair-principles/ Enhance automation Persistent human readable and machine-actionable metadata Linked metadata and community standards Persistent identifiers Clear licensing and access rules Protocols for machine accessibility Register / Index Assumption: operate in an ecosystem at scale and in legacy settings. Fairly AI Ready
  • 24. FAIR for Software Software is a digital object but research software is not (just) data https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg FAIR for Research Software (FAIR4RS) working group Lamprecht et al., 2019 FAIR4RS First Draft of FAIR4RS principles CodeMeta https://github.com/codemeta/codemeta/ Katz, et al PATTERNS 2, 2021
  • 25. FAIR Principles for Workflows Hybrid Processual Digital Objects Method “Data” Objects Workflows as FAIR Software FAIR+R and FAIR++ The principles revised Workflows as FAIR Digital Objects Data-like method objects The principles adapted Workflows as FAIR Data Instruments FAIRification of the dataflow The data principles supported C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000 Workflow Objects Software Objects
  • 26. FAIR Principles for Workflows Image credit: BioExcel Centre of Excellence Composition & Portability different components, codes, languages, third parties
  • 27. FAIR Interoperability and Reusability = Composability R: Reusable - can be understood, modified, built upon or incorporated into other software I: Software interoperates with other software through community standard APIs and community standard meta(data) Software include qualified references to other objects Richly described Well documented Licensed Sample input parameters and test data Checker workflows Track versions Programmatic access to (meta)data Libraries of canonical workflow blocks Make tools workflow-ready Wrap tools FAIR4RS Proposed Principles for FAIR Software Design for FAIR Data Design for Reuse Community Review Community Curation Certification Best Practice Licence combinations Access permissions Local -> Global identifiers
  • 28. Findable & Accessable register workflows with assigned PID + metadata in a searchable resource. https://workflowhub.eu Publishing Services Journals Digital Objects of Scholarship published, cited, exchanged, reviewed, validated & reused in new and different ways • Versioned identifiers • DOI assignment (https://doi.org/10.48546/workflowhub.workflow.29.2) • Collections, Canonical workflow libraries scripts Repos Containers Deploys Tools Agnostic and generous with many WfMSs with different degrees of support • Workflows in native places • Metadata standards framework, handles associated objects and links between objects. • Perpetual development by an open community
  • 29. licensing authors & credit analytics access search versions & status other workflows Biggest challenge? Metadata of course! Work with WfMS to auto-extract metadata + provide metadata services
  • 30. More than just a list 3 Spaces, Teams, People Linking up providers and users Building visibility & reputation Reciprocity to close the “Find – Get– Use – Credit” loop Citations Knowledge Graphs linking out to OpenAIRE, DataCite etc
  • 31. Customised FAIRDOM-SEEK https://fair-dom.org, https://fairdom-seek.org, https://fairdomhub.org Digital asset management platform for Project Hubs organising, cataloguing, sharing and publishing multiple kinds of research objects held in multiple repositories for multi-partner projects. Wolfgang Müller Martin Golebiewski Ulrike Wittig Olga Krebs Xiaoming Hu
  • 32. FAIR Workflow are FAIR Software lifecycle support for living objects Indicators of Status Workflow monitoring Register versions Version PIDs Support Github actions Track authors and contributions Incremental metadata and supplementary materials Track & lift out sub- workflows R1.2: (Meta)data and software are associated with detailed provenance
  • 33. Tool Registry Service API Accessible metadata and workflows retrievable by their PID using a standardized communication protocol GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
  • 34. FAIR Metadata for Machines and Humans WfMS neutral canonical descriptions https://www.commonwl.org Canonical description of the workflow Linked to containerised tools • Aid collaboration & knowledge transfer • Standardise expression of workflow • Describe engine neutral portable, reusable workflows • Reduce vendor / project lock-in • Enable workflow comparisons • “Abstract” CWL
  • 35. Design by canonical, modularised workflow blocks Build a library of tested and validated CWL blocks CWL: • Canonical descriptions • Recycle descriptions and sub-workflows • Platform independent pipeline exchange and comparison Rob Finn Folker Meyer AWE MEGAHIT Assembly pipeline [with thanks to Rob Finn]
  • 36. Extensible Metadata Framework that caters for all those processual FAIR criteria Common metadata about the workflow, tools & parameters Canonical workflow description of the steps of the workflow Type the input and outputs of the steps Run Provenance / Histories / Tests Format for packaging a workflow, its metadata and companion objects (links to containers, data etc) for exchange, archiving, reporting, citing. FAIR Digital Object All Open Communities
  • 37. Bioschemas lightweight metadata Extensible and Linked metadata in service of the Life Science Community Open community reusing industry de facto standard Computation workflow profile Formal parameter profile https://bioschemas.org Opinionated use of schema.org, the web resource mark-up used by search engines, knowledge graphs and increasingly science as a whole. Computational tool
  • 38. RO-Crate Digital Objects Lightweight way of packaging everything together regardless where or what it is https://www.researchobject.org/ro-crate/ Format for packaging up scattered resources and self describing that package and its parts - integrated view + context - metadata and PIDs reference digital and real things - datasets, workflows, services, software & people, places etc. Web-native, off the shelf - machine and human readable, search engine & developer friendly. Infrastructure independent & self-describing PIDs, JSON-LD, Schema.org, archive formats Extensible and open- ended to cope with diversity and legacy “Duck typing” using profiles + added schema.org and domain ontologies
  • 40. BioComputeObject - Regulation why and how to use a workflow IEEE P2791-2020 robust, safe exchange & reuse of HTS computational analytical workflows http://biocomputeobject.org Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al “Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results” PLOS Biology 2018m, https://doi.org/10.1371/journal.pbio.3000099 https://biocompute-objects.github.io/bco-ro-crate/ “Sidecar” third party metadata files inside the RO-Crate FAIR has to operate in a legacy ecosystem format
  • 41. Reproduciblity – Repeatability Provenance & Preservation Workflow-Run-RO-Crate Some heavy lifting … when is it FAIR enough? R1.2: (Meta)data and software are associated with detailed provenance - not just the workflow but the run record associated with the data it produced ….
  • 42. FAIR Digital Objects RO-Crate, a step towards FAIR Digital Object Middleware “Each FAIR digital object type has its own metadata requirements, and may have its own repositories and registries” FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units: https://doi.org/10.3390/publications8020021 https://fairdo.org https://fairdo.org/wg/fdo-cwfr/
  • 43. Our Workflow Metadata Underware is ready! Archiving General Executing Testing & Monitoring WfMS A2. metadata are accessible, even when the workflow is no longer available Metadata preservation beyond any one service in RO-Crate archive, republished in a long-term archive R1. workflows are richly described with a plurality of accurate and relevant attributes, R1.3 domain-relevant community standards Automating metadata by on-boarding WfMS and FAIR services. Metadata so that a workflow is read-reproducible as a method description R. The software is usable (it can be executed) and reusable (it can be understood, modified, built upon, or incorporated into other software). Services and standards containers, testing & monitoring, execution, GA4GH TRS API
  • 44. Reflection: FAIR takes a village its a JOINT responsibility and opportunity! In order for data to be FAIR, you need services that enable FAIR Be a good plug-in tool and data citizen enable programmatic access to datasets make clean tool interface avoid usage restrictions use open community data standards and formats simplify installation code for portability, parallelisation & reproducibility manage versions register! document! Be a good workflow maker......and user use and make FAIR identifiers for data license data outputs use open community data standards and formats validate parameters use a WfMS that tracks data provenance consider secure data processing manage versions design tests and test data credit tool and sub-workflow makers choose FAIR data services register! document! build libraries! use well documented FAIR enabling and FAIR workflows credit the makers!
  • 45. Reflection: FAIR takes a village its a JOINT responsibility and FAIR ≠ FREE Advocate standards & practice Sustain and manage infrastructure Credit and incentives Maturity models & metrics Certification and canonical libraries In order for data to be FAIR, you need services that enable FAIR Training, Stewardship & Sustainability Workflows are an entry point to the tools and datasets of EOSC-Life and functions for FAIR data.
  • 46. Summary: FAIR Computational Workflows EOSC-Life Workflow Collaboratory Production workhorses, transparent, reproducible processing & democratised access to data, infrastructure and complex processing. Hybrid Digital Objects of scholarship that should be FAIR themselves and support FAIR Data. FAIR assumed to operate in an ecosystem at scale and in legacy settings. FAIR takes a village where everyone shoulders responsibility not just data and service providers
  • 47. Acknowledgements The WorkflowHub Club, Bioschemas Community, RO-Crate Community, CWL Community, Galaxy Europe, EOSC-Life and ELIXIR Tools Platform. Special Thanks Stian Soiland-Reyes (U Manchester / U Amsterdam) Paul Brack, Stuart Owen, Finn Bacall, Alan Williams (U Manchester) Björn Grüning (U Freiburg) Frederik Coppens (VIB) Sarah Jones (GEANT) Herve Menager (Pasteur Institute) Sarah Cohen-Boulakia (U Paris Sacly) Dan Katz (U Illinois Urbana-Champaign) Simone Leo (CRS4) Laura Rodriguez-Navas (BSC) José Mª Fernández (BSC) EOSC-Life https://www.eosc-life.eu/ ELIXIR http://elixir-europe.org RO-Crate https://www.researchobject.org/ro-crate/ WorkflowHub https://workflowhub.eu/ and workflowhub.org Galaxy Europe https://galaxyproject.eu/ Bioschemas https://bioschemas.org Common Workflow Language https://www.commonwl.org/ FAIRDOM https://fair-dom.org WorkflowsRI https://workflowsri.org/