Presented at the EGI/SHIWA Workshops on e-Science Workflows
Budapest, 2012-02-10
https://www.egi.eu/indico/contributionDisplay.py?contribId=6&confId=656
Increasingly, published research is based on digital artefacts such as scientific workflows, web services and public data sets. myExperiment is a social website for sharing workflows and data, used by scientists of different domains such as bioinformatics, chemistry and astrophysics.
myExperiment let scientists build a structured aggregation or pack of the artefacts supporting a scientific work, which we call a Research Object. Packs may include a workflow, input data, reference data and the workflow results, thereby giving other researchers the ability to independently reproduce the virtual experiment and verify the results or reuse the work to analyse new data.
Research Objects can be collectively annotated, shared, published, modified, combined and derived, but are also subject to external changes, evolution and decay. For instance, rerunning a Research Object becomes difficult or impossible if its workflow depends on tools and services which are no longer executable. Similarly, researchers analysing the work might struggle to reuse a Research Object if it is not clear how its artefacts can be combined.
To address these preservation concerns we are developing “myExperiment 2.0” as both a software architecture and a reference implementation, called Wf4Ever. The Wf4Ever Architecture uses a Research Object Model to specify aggregations of digital artefacts with rich annotations, recording their provenance and evolution, and allowing sharing and collaboration through a set of decoupled RESTful Linked Data services.
The Wf4Ever Toolkit combines techniques such as analysing and comparing workflow structures, provenance traces from automated workflow reruns and utilising workflow integrity and authenticity checking, in order to detect Research Object decay, replay past workflow executions, attempt automated repair and recommend replacement workflow fragments.
http://www.myexperiment.org/
http://www.wf4ever-project.org/
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2012 02-10 myExperiment 2.0 – Preserving digital Research Objects using the Wf4Ever architecture
1. myExperiment 2.0 –
Preserving digital Research Objects
using the Wf4Ever architecture
Stian Soiland-Reyes
myGrid, University of Manchester
EGI/SHIWA Workshops on e-Science Workflows
Budapest, 2012-02-10
2. Workflow-based Science
Background
Taverna - Scientific Workflow Management
System
~85000 downloads
~EU projects: SCAPE, BioVeL, HELIO,
http://www.taverna.org.uk/
e-Lico, VPH-SHARE, EGI-INSPiRE….
myExperiment - Web 3.0 virtual
environment, library and social
network for workflows
http://www.myexperiment.org/
~5000 registered users
~2200 workflows
~21 different systems
2
3. Workflow-based Science
What is a Scientific Workflow?
» Workflows coordinate the
execution of services and
link together resources.
» Data-driven rather than
process-driven: «Send
output from A to B and C»
» Semi-automated
computational execution in
scientific problem-solving
repeatable, reproducable,
reusable
» The implementation of a
scientific method
3
5. Reuse, Recycling, Repurposing
• Paul writes workflows for identifying biological
pathways implicated in resistance to Trypanosomiasis
in cattle
• Paul meets Jo. Jo is investigating whipworm in
mouse.
• Jo reuses one of Paul’s workflow without change.
• Jo identifies the biological pathways involved in sex
dependence in the mouse model, believed to be
involved in the ability of mice to expel the parasite.
• Previously a manual two year study by Jo had failed
to do this.
5
6. “A biologist would rather share their
toothbrush than their gene name”
Mike Ashburner and others
Professor in Dept of Genetics,
University of Cambridge, UK
7. http://www.myexperiment.org/
“Facebook for Scientists” A probe into researcher behaviour
...but different to Facebook!
A repository of research methods Open source (BSD) Ruby on Rails app
A social network of people and things REST and SPARQL, Linked Data
A Social Virtual Research Environment Influenced BioCatalogue, MethodBox
and SysMO-SEEK
myExperiment currently has 5378 members, 292 groups, 2273
workflows, 534 files and 217 packs
10. myExperiment API
APIs for developers
XML
facebook
iGoogle
android
HTML API
config
SPARQL endpoint
Managed REST API
tags ratings reviews profiles
Search
workflows credits groups Engine
packs friendships
files
` RDF
Store
mySQL
Enactor
10
11. myExperiment API
Taverna integration
» myExperiment plugin for Taverna
› Browse myExperiment workflows
• My workflows
• Tags
• Search
› Open workflow
• + Embed in existing workflow
› Upload workflows
• Provide metadata
11
13. Paul’s
Paul’s Pack Workflow 16 QTL
Research Results
Object produces
Included in
Published in
Included in
Feeds into
Logs produces Included in Included in
Metadata Slides Paper
produces Published in
Common pathways
Workflow 13 Results
14. The R.* dimensions
Reusable. The key tenet of Research Replayable. Studies might involve
Objects is to support the sharing and single investigations that happen in
reuse of data, methods and processes. milliseconds or protracted processes
Repurposeable. Reuse may also that take years.
involve the reuse of constituent parts of Referenceable. If research objects are
the Research Object. to augment or replace traditional
Repeatable. There should be sufficient publication methods, then they must be
referenceable or citeable.
information in a Research Object to be
able to repeat the study, perhaps years Revealable. Third parties must be able
later. to audit the steps performed in the
Reproducible. A third party can start research in order to be convinced of the
validity of results.
with the same inputs and methods and
see if a prior result can be confirmed. Respectful. Explicit representations of
the provenance, lineage and flow of
intellectual property.
Replacing the Paper: The Twelve Rs of the e-Research Record” on http://blogs.nature.com/eresearch/
15. Pack analysis
Workflow – pack contains a • Paper - source for a paper
number of workflows • Tutorial - tutorial material
Presentation - encapsulation of • Data - collection of data files
a single presentation • Derived data - results of
Collection - a number of things: workflow
workflows, presentations, • Benchmark - benchmarking data
papers
• Supplementary - stuff associated
Heterogeneous - where the with a paper
workflows do not appear to
have a clear common purpose • Noise - tests, tryouts, rubbish
Homogeneous - workflows • Oddity - none of the above
appear to be designed to work
together Analysis by Sean Bechhofer
16. Workflow Preservation
Research Objects
Provenance
Recommendation
Astronomy and Genomics
http://www.wf4ever-project.org/
17. Wf4Ever
Challenges
Preservation of scientific workflows » Scientific workflows aim at the heart of
in data-intensive science experimental science
› Enable automation of scientific
methods
› Encourage best practices
» Need to be preserved
› Reuse is fundamental for incremental
scientific development
› Method reproducibility is key for credit
and publication
» …but workflow preservation is complex
› Heterogeneous types of information
need to be aggregated, including
workflows and related resources into
research objects
› Research objects need to be trusted and
understandable n years from now
› Social aspects need to be addressed in
order to support reuse in scientific
communities
17
18. Wf4Ever
Stability, Completeness, Integrity, Authenticity, Quality
Workflow Decay
• Component level
• flux/decay/unavailability
• Data level
• formats/ids/standards
• Infrastructure level
• platform/resources
Experiment Decay
• Methodological changes
• New technologies
• New resources/components
• New data
18
24. The Wf4Ever Proposal
Technical Infrastructure
• Models
• Research Object
• Annotation
• Provenance
• Evolution and Versioning
• Semantic Web Encoding
• Services
• Foundational, Extension, User
• APIs, Architecture
• Web protocols/services
• Principles
• Map into standards
• Adopt standards
• Lightweight components
• Ecosystem
• Command line
• Third party systems
26
25. The Wf4Ever Proposal
Services
User
Clients
Extension
Services
Foundation
Services
27
26. Next steps
» Analyse decay within all myExperiment workflows
› Estimate: Roughly 50% don’t run correctly
› But why? Services gone? Did they ever work?
› How have the community evolved those workflows?
» Service and wf substitutions; recommendations
» Provenance analysis and use
› e.g. verifiability, replayability
» SHIWA approach – can handle “What if the
workflow system stops working”
28
27. Thank you!
Any Questions?
http://www.myexperiment.org/
http://www.wf4ever-project.org/
http://www.mygrid.org.uk/
This work is licensed under the Creative Commons Attribution 3.0 http://www.taverna.org.uk/
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative
Commons, 444 Castro Street, Suite 900, Mountain View, California,
94041, USA. 29