FAIR Computational Workflows

FAIR Computational Workflows
the what, why, how and who
Professor Carole Goble
The University of Manchester UK
EU Research Infrastructures: ELIXIR, IBISBA, EOSC-Life
Centre of Excellence: BioExcel
carole.goble@manchester.ac.uk
ICTeSSH 2021, 30th June 2021
https://ictessh.uns.ac.rs/
The Life Sciences
from compounds &
genomics to tissue banking,
from plants to marine to
humans…
https://lifescience-ri.eu/
Life Science Research
Infrastructures
clustered together
Life Science RIs, European Open Science Cloud
Building a data and method commons
Equivalent to SSHOC
1000s of tools
1000s of datasets
100s of data repositories
10s of registries
800+ metadata formats
Lots of data pre-
processing, processing,
post processing, analysis,
simulations…
Data intensive science multi-step tool-chains
to prepare, analyze, and share increasing volumes of complex data
CryoEM Image Analysis
Metagenomic Pipelines
Drug Discovery
Gravitational Wave Analysis
Climate Modelling
Ecological Niche Modelling
“Workflow” is an overloaded term…..
5
Instructions how to do something
Manual Protocols, SOPs, BPM
https://marketplace.sshopencloud.eu/search?categories=workflow&order=label
Data intensive science multi-step processing
prepare, analyze, and share increasing volumes of complex data
a tool in the toolbox
that links together tools Computational Workflows
inputs
outputs
tools, CLI,
containers,
workflows
Flexible workflow composition
mechanisms to construct & run
executable control and data flows
Access to computational infrastructure
and datasets, interoperability of the
tools, portability of the processing.
An entry point to the cloud / commons
resources.
Specification
description
Software
Execution
WfMS
Engine
Workflow
SARS-CoV-2 pre-processing, variant monitoring, analysis
https://covid19.galaxyproject.org
Automated monitoring of structured
data from the European COVID-19 Data
Portal
Scalable via access to a global
distributed compute network
• Improved data quality
• Uniformly analysed data for
downstream analysis & visualisation
• Submission of data to public
archives
https://elixir-europe.org/news/covid-19-variants-galaxy
Workflows and
Workflow
Management
System
Distributed analysis , Pulsar network
Managed online hosted Workflow as a Service Platform
Designed for direct use by end users - 32K users
Experts build workflows that others can use with their own data
Researchers build and reuse workflows that are shared
End users also use it to access and interact with a tool
Workflow and Tool histories and reporting [Björn Grüning]
Those workflows in the WorkflowHub Registry
Find, publish and cite workflows and
collections. Reuse, recycle, repurpose.
Data pipelines
Simulation & model sweeps
Data analysis,
Combining & integrating data
One-off analysis & prototyping
Pre-cooked workflows using my data, my
models, my configurations
Remixing, repurposing computational Lego
Repetition
Reproduce
Reuse, Recycle
Automated insight?
Meta-analysis?
Hypothesis generation?
Computational tasks
Data pipelines
Simulation & model sweeps
Data analysis,
combining & integrating data
One-off analysis & prototyping
Pre-cooked workflows using
my data, my models, my config.
Adapting remixing, repurposing
Repetition
Reproduce
Reuse, Recycle
adaptable to your
question/data
explicit, reproducible, repeatable,
reviewable transparent method
fast prototyping
shared know-how & tested recipes
scholarly publications &
supplementary materials
using other codes and best
tools, and reusing workflows
Computational resource use
Access to datasets and tools
Secure access to sensitive data
democratising processing
and computational know-how
interoperating datasets
and accessing infrastructures
Sharing & Publishing Tasks: Hybrid Digital Objects of Scholarship
that can be published, cited, exchanged, reviewed, validated & reused in different ways
12
Registries
Repos Containers Hosts
Running Services
Publishing Services
Journals
FAIR and Open Workflows
Publish the method – description snapshot
Publish the software – containers, hosted
deployments, open source
Living objects – dependencies and versions
Composite objects – FAIR mixable interoperable
tools: CLI, APIs & (meta)data standards
Reusable objects – quality, maintenance, portability
C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R.
Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data
Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000
Workflow Management System Framings
Computational Reproducibility
Labour saving
Knowledge sharing
Democratisation of computation analysis
Springer Nature, September 2019
But what’s this got to do with
Social Sciences and Humanities?
Humanities are at a pivotal moment
Social Science using new data sources
Computational Social Science
Computational Archival Science
The rise of AI and Machine Learning
Implicit workflows in notebooks and tools
Specimen Data Refinery NH Museum Collections
workflow ensembles, rerunning with new methods
Physical Object
Digital Object Walton S, Livermore L, Bánki O, Cubey RWN, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C,
Rey I, Santos CM, Scott B, Williams AR, Wu Z (2020) Landscape Analysis for the Specimen Data Refinery.
Research Ideas and Outcomes 6: e57602. https://doi.org/10.3897/rio.6.e57602
SSH examples
with thanks - Sean Bechhofer and David De Roure
Max Cycling for music
Hathitrust HTRC
Secure text mining
protecting copyright
Music genre analysis
Music Information Retrieval
Computational Analysis of Live Music Archives
Backroom computational archival science
Vocabulary indexing on
historical documents
Processing and modelling
social media data
It’s not like you don’t have a lot of tools….
[Daniela Duce, SAGE Publishing, Sponsors Session]
Workflows do not only work
with non-interactive tools.
Biologists have GUI-based
interactive tools too.
They do a lot of text mining.
And a lot of non-
consumptive analysis of
cohorts (called secure federated
processing of sensitive data).
What about manual steps?
Our data is too messy! Our creativity & intellectual know-how!
Notebooks calling workflows
Workflow as a Service platforms (like Galaxy)
The trick is automate what can and should be automated.
as a step in a workflow or
between workflows
interacting with a tool that
is a step
Humans in the Loop
Workflow
System
Landscape
Scripting
environments
Electronic Research Notebooks
Workflow
Management Systems
Repositories Registries
*https://s.apache.org/existing-workflow-systems
296 Systems*
Difference between a workflow and a script….
Separation of the workflow specification from its execution
Specification
description
Software
Execution
Precise description of a procedure
composed of multiple steps
coordinated by input/output data
relationships.
Execution of computational and
composted processes with data
consumed & produced by each step.
WfMS
Engine
Workflow
Sub
Workflows
Tools and
codes
Parameters
Inputs
Outputs
Infrastructure
Guidance
Associated
Objects
Data
Logs /
Histories /
Provenance
Services,
e.g. Test engines
+
Related workflows
Checker workflows
Contextual Entities
Metadata Graphs
Sample input
parameters, test data
Ad Hoc Scripting
Labour is moved from the workflow maker to the workflow system
https://xkcd.com/2054/
Workflow Management System Zoo
different species with different properties, from DIY to Community Platforms
Level of the Under-Ware
Domain specialisation & tools / datatypes
User target – Command line prog -> GUI
Take up of a WfMS depends on the
“plugged-in” availability of data type
specific codes, optimised processing
and everyone else using it
Light weight frameworks and engines to
fully fledged analysis platforms
Desktop -> Cloud
Interactive, automated
*https://s.apache.org/existing-workflow-systems
Maintaining the Zoo theme….Snake and Whales….
https://snakemake.github.io/
Workflows are rules
Graph of jobs for automatic parallelisation
Containerisation
Documentation Reports
Portable containers
CASAR: running (third party) codes
Composition, Abstraction, Scalability, Automation, Reporting
Manage Mess
Help Design
Handle
Heavy lifting
Report and
Reproduce
overcome incompatibilities
shield users from complexity of access
manage control and data flow
guide and validate composition
automation
dependencies and containers
changes in infrastructure
co-localising data and processing
scalable & optimised processing
test portability
security handling
error handling and alternate swapping
systematic reporting & logging of
called codes and data lineage
User-ware
Under-ware
Computational Workhorses
labour-saving– science labouris
The law of computational labour & cost
conservation:
“Labour and costs do not diminish, they shift to
different people at different parts of the tech stack in
different points in the research lifecycle”
Requires infrastructure.
Needs tools to be wrapped to make them components.
Need to be able to develop, run and keep up to date.
Need to find and understand them.
Need explanations to use properly and safely.
Shifting and sharing labour & costs ….
DIY Personal -> Community Cooperatives, something for everyone….
TOOL
DEVELOPER
WORKFLOW
USER
SYS ADMIN WORKFLOW
DEVELOPER
& CUSTODIAN
WORKFLOW
APPLICATION USER
COMPUTATIONAL
USER
Workflow System as a Platform Workflow System as a Service
Labour
Higher tech expertise
Reach
Lower Tech expertise
Preparing & maintaining
tools & workflows
Building libraries
Customising pre-cooked workflows
Embedded in applications
Guided workflow making
Managing
deployments
Interoperable, Usable & Reusable Find and Access
FAIR Principles for Data
tl;dr
Persistent machine-readable and
actionable metadata
Persistent identifiers
Clear licensing
Protocols for machine accessibility
Register / Index
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3,
160018 (2016). https://doi.org/10.1038/sdata.2016.18
https://www.go-fair.org/fair-principles/
Image credit ANDS https://www.ands.org.au/working-with-data/fairdata/training
FAIR Principles for Workflows
Hybrid Digital Objects of Scholarship
Bioschema.org type
Method Objects
The FAIR Data principles
can be adapted.
Software Objects
The FAIR principles
can be revised.
C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R.
Crusoe, K. Peters & D. Schober. FAIR computational workflows.
Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000 * FAIR4RS First Draft of FAIR4RS principles
Design for FAIR Data
Design for Reuse
RDA/ReSA/Force11
FAIR4Research Software WG
FAIR: Metadata for Machines & Access
Community efforts to describe workflows and get platforms on board to be
FAIR at source. Lots of JSON-LD
Common metadata for registration
and discovery, controlled vocabulary
Canonical workflow descriptions
machine and human readable
Type the input and outputs of the steps,
controlled vocabulary
Run Provenance
ontology
Format for run and test Records
FAIR Digital Objects
Package a workflow, its components and
associated objects, with associated
metadata into a citable object.
Format for Reporting, Exchanging
between services and Archiving.
Carrier of metadata.
Tools Registry Service API
Design for FAIR Data
Tools/Codes/Datasets
• enable programmatic access to data & metadata
• avoid usage restrictions on data
Workflows
• use and make FAIR identifiers for data
• license data outputs
• avoid proprietary formats
• validate parameters avoid faulty/unsafe results
• track data provenance
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3,
160018 (2016). https://doi.org/10.1038/sdata.2016.18
The Data Principles were
primarily intended as
guidelines for machine
processable data and
metadata for automation!
Design for FAIR Data and Reuse – Community efforts
Quality Assured, Interoperable and Reusable Workflows and Workflow Blocks
Build libraries, Design using blocks, Workflow best practices, Register them!
IWC - Intergalactic Workflow Commission
Review and curate
Canonical descriptions
Recycle descriptions and sub-workflows
Platform independent pipeline exchange and comparison
Register and Publish
Workflow Services: The EOSC-Life Collaboratory
Takes a Village – people -> workflows, services and standards
The “Social Science” of
the Workflow Village
Open Communities working together
Regardless of organisational, project and national
boundaries. And everyone gets credit!
Dedicated Core teams.
Open Science and Open Source Software cultural
norms
Open ecosystem – no one system, no one stack.
Respect and on-board pre-existing platforms &
communities
Community clustering around systems for
economies of scale, support and sustainability
Workflow Registry
…
Respect the ecosystem
Credit workflow developers
and custodians
Citation & Credit
Discussion forums for
workflow developers and
users
Licensing
Analytics for impact profile
building
Rapidly released for COVID workflows
Listed in the EU COVID Data Portal
1/3rd workflows COVID related
Spaces, Teams, People
Linking up providers and users
Building visibility & reputation
Reciprocity to close the
“Find – Get– Use – Credit” loop
Build Knowledge Graphs linking out to
OpenAIRE, DataCite and other tools
Workflow Registry
Rapidly released for COVID workflows
Listed in the EU COVID Data Portal
1/3rd workflows COVID related
The magic diamond to get everyone on board
COVID provided it all
Community(ies) with leadership
for services, standards, tools, needs
Sponsorship
by projects
Championing
by systems, users, policy makers
Drivers
Users
Baked in
Purpose
Resources
Delivery
Adoption
Workflow Registry
Rapidly released for COVID workflows
Listed in the EU COVID Data Portal
1/3rd workflows COVID related
Lockdown has been amazingly productive and really GOOD
for this kind of pan-project, pan–organisation work
Equitable Participation
https://galaxyproject.org/gcc/
Community lead Capacity Building
Helping people get on board….
TL;DL Summary
Computational Workflows are a significant tool in the toolbox of
computational research, supporting CASAR
Research(er) productivity, democratising infrastructure, reproducibility
Labour & cost shifts – a community can get organised
Workflows are hybrid Digital Objects of scholarship – method and
software, which affects their FAIRness criteria
FAIR, workflows and their infrastructure takes a village, as does all
computational research
Acknowledgements
The WorkflowHub Club, Bioschemas Community, RO-Crate
Community, CWL Community, Galaxy Europe, EOSC-Life
and ELIXIR Tools Platform.
Special Thanks
Stian Soiland-Reyes (U Manchester / U Amsterdam)
David De Roure (U Oxford)
Sean Bechhofer (U Manchester)
Björn Grüning (U Freiburg)
Frederik Coppens (VIB)
EOSC-Life https://www.eosc-life.eu/
ELIXIR http://elixir-europe.org
RO-Crate https://www.researchobject.org/ro-crate/
WorkflowHub https://workflowhub.eu/
Galaxy Europe https://galaxyproject.eu/
Bioschemas https://bioschemas.org
Common Workflow Language https://www.commonwl.org/
Dockstore https://dockstore.org/
Extras
1 of 41

Recommended

The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo... by
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...Carole Goble
45 views23 slides
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research... by
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science,  a Digital Research...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...Carole Goble
36 views33 slides
Research Software Sustainability takes a Village by
Research Software Sustainability takes a VillageResearch Software Sustainability takes a Village
Research Software Sustainability takes a VillageCarole Goble
40 views29 slides
FAIR Computational Workflows by
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
193 views29 slides
Open Research: Manchester leading and learning by
Open Research: Manchester leading and learningOpen Research: Manchester leading and learning
Open Research: Manchester leading and learningCarole Goble
143 views17 slides
RDMkit, a Research Data Management Toolkit. Built by the Community for the ... by
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...Carole Goble
710 views38 slides

More Related Content

More from Carole Goble

EOSC-Life Workflow Collaboratory by
EOSC-Life Workflow CollaboratoryEOSC-Life Workflow Collaboratory
EOSC-Life Workflow CollaboratoryCarole Goble
132 views22 slides
FAIR Computational Workflows by
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational WorkflowsCarole Goble
493 views48 slides
FAIR Data Bridging from researcher data management to ELIXIR archives in the... by
FAIR Data Bridging from researcher data management to ELIXIR archives in the...FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...Carole Goble
120 views17 slides
FAIR Computational Workflows by
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows Carole Goble
629 views49 slides
FAIR Workflows and Research Objects get a Workout by
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout Carole Goble
479 views31 slides
FAIRy stories: the FAIR Data principles in theory and in practice by
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
243 views52 slides

More from Carole Goble(20)

EOSC-Life Workflow Collaboratory by Carole Goble
EOSC-Life Workflow CollaboratoryEOSC-Life Workflow Collaboratory
EOSC-Life Workflow Collaboratory
Carole Goble132 views
FAIR Computational Workflows by Carole Goble
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble493 views
FAIR Data Bridging from researcher data management to ELIXIR archives in the... by Carole Goble
FAIR Data Bridging from researcher data management to ELIXIR archives in the...FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
Carole Goble120 views
FAIR Computational Workflows by Carole Goble
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble629 views
FAIR Workflows and Research Objects get a Workout by Carole Goble
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
Carole Goble479 views
FAIRy stories: the FAIR Data principles in theory and in practice by Carole Goble
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
Carole Goble243 views
RO-Crate: A framework for packaging research products into FAIR Research Objects by Carole Goble
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
Carole Goble425 views
The swings and roundabouts of a decade of fun and games with Research Objects by Carole Goble
The swings and roundabouts of a decade of fun and games with Research Objects The swings and roundabouts of a decade of fun and games with Research Objects
The swings and roundabouts of a decade of fun and games with Research Objects
Carole Goble168 views
How are we Faring with FAIR? (and what FAIR is not) by Carole Goble
How are we Faring with FAIR? (and what FAIR is not)How are we Faring with FAIR? (and what FAIR is not)
How are we Faring with FAIR? (and what FAIR is not)
Carole Goble814 views
What is Reproducibility? The R* brouhaha and how Research Objects can help by Carole Goble
What is Reproducibility? The R* brouhaha and how Research Objects can helpWhat is Reproducibility? The R* brouhaha and how Research Objects can help
What is Reproducibility? The R* brouhaha and how Research Objects can help
Carole Goble258 views
FAIR History and the Future by Carole Goble
FAIR History and the FutureFAIR History and the Future
FAIR History and the Future
Carole Goble308 views
ELIXIR UK Node presentation to the ELIXIR Board by Carole Goble
ELIXIR UK Node presentation to the ELIXIR BoardELIXIR UK Node presentation to the ELIXIR Board
ELIXIR UK Node presentation to the ELIXIR Board
Carole Goble500 views
FAIRy stories: tales from building the FAIR Research Commons by Carole Goble
FAIRy stories: tales from building the FAIR Research CommonsFAIRy stories: tales from building the FAIR Research Commons
FAIRy stories: tales from building the FAIR Research Commons
Carole Goble1.4K views
Let’s go on a FAIR safari! by Carole Goble
Let’s go on a FAIR safari!Let’s go on a FAIR safari!
Let’s go on a FAIR safari!
Carole Goble1.4K views
Reproducible Research: how could Research Objects help by Carole Goble
Reproducible Research: how could Research Objects helpReproducible Research: how could Research Objects help
Reproducible Research: how could Research Objects help
Carole Goble605 views
Reflections on a (slightly unusual) multi-disciplinary academic career by Carole Goble
Reflections on a (slightly unusual) multi-disciplinary academic careerReflections on a (slightly unusual) multi-disciplinary academic career
Reflections on a (slightly unusual) multi-disciplinary academic career
Carole Goble482 views
Better Software, Better Research by Carole Goble
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better Research
Carole Goble657 views
Reproducibility (and the R*) of Science: motivations, challenges and trends by Carole Goble
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trends
Carole Goble1.8K views
Research Object Community Update by Carole Goble
Research Object Community UpdateResearch Object Community Update
Research Object Community Update
Carole Goble195 views
Introduction to FAIRDOM by Carole Goble
Introduction to FAIRDOMIntroduction to FAIRDOM
Introduction to FAIRDOM
Carole Goble1.3K views

Recently uploaded

RemeOs science and clinical evidence by
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidencePetrusViitanen1
26 views96 slides
Max Welling ChemAI 231116.pptx by
Max Welling ChemAI 231116.pptxMax Welling ChemAI 231116.pptx
Max Welling ChemAI 231116.pptxMarco Tibaldi
144 views35 slides
How to be(come) a successful PhD student by
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD studentTom Mens
422 views62 slides
himalay baruah acid fast staining.pptx by
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptxHimalayBaruah
5 views16 slides
Workshop Chemical Robotics ChemAI 231116.pptx by
Workshop Chemical Robotics ChemAI 231116.pptxWorkshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptxMarco Tibaldi
95 views41 slides
journal of engineering and applied science.pdf by
journal of engineering and applied science.pdfjournal of engineering and applied science.pdf
journal of engineering and applied science.pdfKSAravindSrivastava
7 views7 slides

Recently uploaded(20)

RemeOs science and clinical evidence by PetrusViitanen1
RemeOs science and clinical evidenceRemeOs science and clinical evidence
RemeOs science and clinical evidence
PetrusViitanen126 views
Max Welling ChemAI 231116.pptx by Marco Tibaldi
Max Welling ChemAI 231116.pptxMax Welling ChemAI 231116.pptx
Max Welling ChemAI 231116.pptx
Marco Tibaldi144 views
How to be(come) a successful PhD student by Tom Mens
How to be(come) a successful PhD studentHow to be(come) a successful PhD student
How to be(come) a successful PhD student
Tom Mens422 views
himalay baruah acid fast staining.pptx by HimalayBaruah
himalay baruah acid fast staining.pptxhimalay baruah acid fast staining.pptx
himalay baruah acid fast staining.pptx
HimalayBaruah5 views
Workshop Chemical Robotics ChemAI 231116.pptx by Marco Tibaldi
Workshop Chemical Robotics ChemAI 231116.pptxWorkshop Chemical Robotics ChemAI 231116.pptx
Workshop Chemical Robotics ChemAI 231116.pptx
Marco Tibaldi95 views
Light Pollution for LVIS students by CWBarthlmew
Light Pollution for LVIS studentsLight Pollution for LVIS students
Light Pollution for LVIS students
CWBarthlmew5 views
Metatheoretical Panda-Samaneh Borji.pdf by samanehborji
Metatheoretical Panda-Samaneh Borji.pdfMetatheoretical Panda-Samaneh Borji.pdf
Metatheoretical Panda-Samaneh Borji.pdf
samanehborji16 views
Class 2 (12 july).pdf by climber9977
  Class 2 (12 july).pdf  Class 2 (12 july).pdf
Class 2 (12 july).pdf
climber99779 views
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf by KerryNuez1
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdfMODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
MODULE-9-Biotechnology, Genetically Modified Organisms, and Gene Therapy.pdf
KerryNuez121 views
Physical Characterization of Moon Impactor WE0913A by Sérgio Sacani
Physical Characterization of Moon Impactor WE0913APhysical Characterization of Moon Impactor WE0913A
Physical Characterization of Moon Impactor WE0913A
Sérgio Sacani42 views
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx by MN
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptxENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
ENTOMOLOGY PPT ON BOMBYCIDAE AND SATURNIIDAE.pptx
MN6 views
A training, certification and marketing scheme for informal dairy vendors in ... by ILRI
A training, certification and marketing scheme for informal dairy vendors in ...A training, certification and marketing scheme for informal dairy vendors in ...
A training, certification and marketing scheme for informal dairy vendors in ...
ILRI10 views

FAIR Computational Workflows

  • 1. FAIR Computational Workflows the what, why, how and who Professor Carole Goble The University of Manchester UK EU Research Infrastructures: ELIXIR, IBISBA, EOSC-Life Centre of Excellence: BioExcel carole.goble@manchester.ac.uk ICTeSSH 2021, 30th June 2021 https://ictessh.uns.ac.rs/
  • 2. The Life Sciences from compounds & genomics to tissue banking, from plants to marine to humans… https://lifescience-ri.eu/ Life Science Research Infrastructures clustered together
  • 3. Life Science RIs, European Open Science Cloud Building a data and method commons Equivalent to SSHOC 1000s of tools 1000s of datasets 100s of data repositories 10s of registries 800+ metadata formats Lots of data pre- processing, processing, post processing, analysis, simulations…
  • 4. Data intensive science multi-step tool-chains to prepare, analyze, and share increasing volumes of complex data CryoEM Image Analysis Metagenomic Pipelines Drug Discovery Gravitational Wave Analysis Climate Modelling Ecological Niche Modelling
  • 5. “Workflow” is an overloaded term….. 5 Instructions how to do something Manual Protocols, SOPs, BPM https://marketplace.sshopencloud.eu/search?categories=workflow&order=label
  • 6. Data intensive science multi-step processing prepare, analyze, and share increasing volumes of complex data a tool in the toolbox that links together tools Computational Workflows inputs outputs tools, CLI, containers, workflows Flexible workflow composition mechanisms to construct & run executable control and data flows Access to computational infrastructure and datasets, interoperability of the tools, portability of the processing. An entry point to the cloud / commons resources. Specification description Software Execution WfMS Engine Workflow
  • 7. SARS-CoV-2 pre-processing, variant monitoring, analysis https://covid19.galaxyproject.org Automated monitoring of structured data from the European COVID-19 Data Portal Scalable via access to a global distributed compute network • Improved data quality • Uniformly analysed data for downstream analysis & visualisation • Submission of data to public archives https://elixir-europe.org/news/covid-19-variants-galaxy Workflows and Workflow Management System
  • 8. Distributed analysis , Pulsar network Managed online hosted Workflow as a Service Platform Designed for direct use by end users - 32K users Experts build workflows that others can use with their own data Researchers build and reuse workflows that are shared End users also use it to access and interact with a tool Workflow and Tool histories and reporting [Björn Grüning]
  • 9. Those workflows in the WorkflowHub Registry Find, publish and cite workflows and collections. Reuse, recycle, repurpose.
  • 10. Data pipelines Simulation & model sweeps Data analysis, Combining & integrating data One-off analysis & prototyping Pre-cooked workflows using my data, my models, my configurations Remixing, repurposing computational Lego Repetition Reproduce Reuse, Recycle Automated insight? Meta-analysis? Hypothesis generation?
  • 11. Computational tasks Data pipelines Simulation & model sweeps Data analysis, combining & integrating data One-off analysis & prototyping Pre-cooked workflows using my data, my models, my config. Adapting remixing, repurposing Repetition Reproduce Reuse, Recycle adaptable to your question/data explicit, reproducible, repeatable, reviewable transparent method fast prototyping shared know-how & tested recipes scholarly publications & supplementary materials using other codes and best tools, and reusing workflows Computational resource use Access to datasets and tools Secure access to sensitive data democratising processing and computational know-how interoperating datasets and accessing infrastructures
  • 12. Sharing & Publishing Tasks: Hybrid Digital Objects of Scholarship that can be published, cited, exchanged, reviewed, validated & reused in different ways 12 Registries Repos Containers Hosts Running Services Publishing Services Journals FAIR and Open Workflows Publish the method – description snapshot Publish the software – containers, hosted deployments, open source Living objects – dependencies and versions Composite objects – FAIR mixable interoperable tools: CLI, APIs & (meta)data standards Reusable objects – quality, maintenance, portability C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000
  • 13. Workflow Management System Framings Computational Reproducibility Labour saving Knowledge sharing Democratisation of computation analysis
  • 14. Springer Nature, September 2019 But what’s this got to do with Social Sciences and Humanities? Humanities are at a pivotal moment Social Science using new data sources Computational Social Science Computational Archival Science The rise of AI and Machine Learning Implicit workflows in notebooks and tools
  • 15. Specimen Data Refinery NH Museum Collections workflow ensembles, rerunning with new methods Physical Object Digital Object Walton S, Livermore L, Bánki O, Cubey RWN, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C, Rey I, Santos CM, Scott B, Williams AR, Wu Z (2020) Landscape Analysis for the Specimen Data Refinery. Research Ideas and Outcomes 6: e57602. https://doi.org/10.3897/rio.6.e57602
  • 16. SSH examples with thanks - Sean Bechhofer and David De Roure Max Cycling for music Hathitrust HTRC Secure text mining protecting copyright Music genre analysis Music Information Retrieval Computational Analysis of Live Music Archives Backroom computational archival science Vocabulary indexing on historical documents Processing and modelling social media data
  • 17. It’s not like you don’t have a lot of tools…. [Daniela Duce, SAGE Publishing, Sponsors Session] Workflows do not only work with non-interactive tools. Biologists have GUI-based interactive tools too. They do a lot of text mining. And a lot of non- consumptive analysis of cohorts (called secure federated processing of sensitive data).
  • 18. What about manual steps? Our data is too messy! Our creativity & intellectual know-how! Notebooks calling workflows Workflow as a Service platforms (like Galaxy) The trick is automate what can and should be automated. as a step in a workflow or between workflows interacting with a tool that is a step Humans in the Loop
  • 19. Workflow System Landscape Scripting environments Electronic Research Notebooks Workflow Management Systems Repositories Registries *https://s.apache.org/existing-workflow-systems 296 Systems*
  • 20. Difference between a workflow and a script…. Separation of the workflow specification from its execution Specification description Software Execution Precise description of a procedure composed of multiple steps coordinated by input/output data relationships. Execution of computational and composted processes with data consumed & produced by each step. WfMS Engine Workflow Sub Workflows Tools and codes Parameters Inputs Outputs Infrastructure Guidance Associated Objects Data Logs / Histories / Provenance Services, e.g. Test engines + Related workflows Checker workflows Contextual Entities Metadata Graphs Sample input parameters, test data
  • 21. Ad Hoc Scripting Labour is moved from the workflow maker to the workflow system https://xkcd.com/2054/
  • 22. Workflow Management System Zoo different species with different properties, from DIY to Community Platforms Level of the Under-Ware Domain specialisation & tools / datatypes User target – Command line prog -> GUI Take up of a WfMS depends on the “plugged-in” availability of data type specific codes, optimised processing and everyone else using it Light weight frameworks and engines to fully fledged analysis platforms Desktop -> Cloud Interactive, automated *https://s.apache.org/existing-workflow-systems
  • 23. Maintaining the Zoo theme….Snake and Whales…. https://snakemake.github.io/ Workflows are rules Graph of jobs for automatic parallelisation Containerisation Documentation Reports Portable containers
  • 24. CASAR: running (third party) codes Composition, Abstraction, Scalability, Automation, Reporting Manage Mess Help Design Handle Heavy lifting Report and Reproduce overcome incompatibilities shield users from complexity of access manage control and data flow guide and validate composition automation dependencies and containers changes in infrastructure co-localising data and processing scalable & optimised processing test portability security handling error handling and alternate swapping systematic reporting & logging of called codes and data lineage User-ware Under-ware
  • 25. Computational Workhorses labour-saving– science labouris The law of computational labour & cost conservation: “Labour and costs do not diminish, they shift to different people at different parts of the tech stack in different points in the research lifecycle” Requires infrastructure. Needs tools to be wrapped to make them components. Need to be able to develop, run and keep up to date. Need to find and understand them. Need explanations to use properly and safely.
  • 26. Shifting and sharing labour & costs …. DIY Personal -> Community Cooperatives, something for everyone…. TOOL DEVELOPER WORKFLOW USER SYS ADMIN WORKFLOW DEVELOPER & CUSTODIAN WORKFLOW APPLICATION USER COMPUTATIONAL USER Workflow System as a Platform Workflow System as a Service Labour Higher tech expertise Reach Lower Tech expertise Preparing & maintaining tools & workflows Building libraries Customising pre-cooked workflows Embedded in applications Guided workflow making Managing deployments Interoperable, Usable & Reusable Find and Access
  • 27. FAIR Principles for Data tl;dr Persistent machine-readable and actionable metadata Persistent identifiers Clear licensing Protocols for machine accessibility Register / Index Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18 https://www.go-fair.org/fair-principles/ Image credit ANDS https://www.ands.org.au/working-with-data/fairdata/training
  • 28. FAIR Principles for Workflows Hybrid Digital Objects of Scholarship Bioschema.org type Method Objects The FAIR Data principles can be adapted. Software Objects The FAIR principles can be revised. C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000 * FAIR4RS First Draft of FAIR4RS principles Design for FAIR Data Design for Reuse RDA/ReSA/Force11 FAIR4Research Software WG
  • 29. FAIR: Metadata for Machines & Access Community efforts to describe workflows and get platforms on board to be FAIR at source. Lots of JSON-LD Common metadata for registration and discovery, controlled vocabulary Canonical workflow descriptions machine and human readable Type the input and outputs of the steps, controlled vocabulary Run Provenance ontology Format for run and test Records FAIR Digital Objects Package a workflow, its components and associated objects, with associated metadata into a citable object. Format for Reporting, Exchanging between services and Archiving. Carrier of metadata. Tools Registry Service API
  • 30. Design for FAIR Data Tools/Codes/Datasets • enable programmatic access to data & metadata • avoid usage restrictions on data Workflows • use and make FAIR identifiers for data • license data outputs • avoid proprietary formats • validate parameters avoid faulty/unsafe results • track data provenance Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18 The Data Principles were primarily intended as guidelines for machine processable data and metadata for automation!
  • 31. Design for FAIR Data and Reuse – Community efforts Quality Assured, Interoperable and Reusable Workflows and Workflow Blocks Build libraries, Design using blocks, Workflow best practices, Register them! IWC - Intergalactic Workflow Commission Review and curate Canonical descriptions Recycle descriptions and sub-workflows Platform independent pipeline exchange and comparison Register and Publish
  • 32. Workflow Services: The EOSC-Life Collaboratory Takes a Village – people -> workflows, services and standards
  • 33. The “Social Science” of the Workflow Village Open Communities working together Regardless of organisational, project and national boundaries. And everyone gets credit! Dedicated Core teams. Open Science and Open Source Software cultural norms Open ecosystem – no one system, no one stack. Respect and on-board pre-existing platforms & communities Community clustering around systems for economies of scale, support and sustainability
  • 34. Workflow Registry … Respect the ecosystem Credit workflow developers and custodians Citation & Credit Discussion forums for workflow developers and users Licensing Analytics for impact profile building Rapidly released for COVID workflows Listed in the EU COVID Data Portal 1/3rd workflows COVID related
  • 35. Spaces, Teams, People Linking up providers and users Building visibility & reputation Reciprocity to close the “Find – Get– Use – Credit” loop Build Knowledge Graphs linking out to OpenAIRE, DataCite and other tools Workflow Registry Rapidly released for COVID workflows Listed in the EU COVID Data Portal 1/3rd workflows COVID related
  • 36. The magic diamond to get everyone on board COVID provided it all Community(ies) with leadership for services, standards, tools, needs Sponsorship by projects Championing by systems, users, policy makers Drivers Users Baked in Purpose Resources Delivery Adoption Workflow Registry Rapidly released for COVID workflows Listed in the EU COVID Data Portal 1/3rd workflows COVID related
  • 37. Lockdown has been amazingly productive and really GOOD for this kind of pan-project, pan–organisation work Equitable Participation
  • 38. https://galaxyproject.org/gcc/ Community lead Capacity Building Helping people get on board….
  • 39. TL;DL Summary Computational Workflows are a significant tool in the toolbox of computational research, supporting CASAR Research(er) productivity, democratising infrastructure, reproducibility Labour & cost shifts – a community can get organised Workflows are hybrid Digital Objects of scholarship – method and software, which affects their FAIRness criteria FAIR, workflows and their infrastructure takes a village, as does all computational research
  • 40. Acknowledgements The WorkflowHub Club, Bioschemas Community, RO-Crate Community, CWL Community, Galaxy Europe, EOSC-Life and ELIXIR Tools Platform. Special Thanks Stian Soiland-Reyes (U Manchester / U Amsterdam) David De Roure (U Oxford) Sean Bechhofer (U Manchester) Björn Grüning (U Freiburg) Frederik Coppens (VIB) EOSC-Life https://www.eosc-life.eu/ ELIXIR http://elixir-europe.org RO-Crate https://www.researchobject.org/ro-crate/ WorkflowHub https://workflowhub.eu/ Galaxy Europe https://galaxyproject.eu/ Bioschemas https://bioschemas.org Common Workflow Language https://www.commonwl.org/ Dockstore https://dockstore.org/