Presented at the 2nd BioVeL Workshop on taxonomic and phylogenetic workflows (http://www.biovel.eu/index.php?option=com_content&view=article&id=43:ms6-workshop&catid=22:biovel-meetings&Itemid=122)
The 10 Best Practices
for Workflow Design
BioVeL M6 Workshop
Göteborg, May 10-11, 2012
Kristina Hettne, Marco Roos (LUMC), Katy Wolstencroft , Carole Goble (myGrid)
Thanks: BioSemantics Group (LUMC), myGrid team (UoM), Yassene Mohamed, Harish Dharuri (LUMC)
Our specialty: Knowledge Discovery
http://biosemantics.org
Disambiguation*
Text Mining
Substrates for
Knowledge
Discovery
Methods for
Knowledge Discovery
Applications
•Predict protein-protein, protein-disease associations, gene prioritization
•Genotype-phenotype studies, e.g. Huntington’s Disease, Metabolic Syndrome
•Yours?
* Global disambiguation initiative: http://snipurl.com/conceptweballiance 2
Introduction
Why build good workflows?
Good workflow design = good science!
3
Introduction
Best practices for workflow design
Best Practices for workflow design
=
Best Practices experimental science
+
Best Practices software engineering
4
Best practice 6
Make workflow executable by others
How to check that others can execute your workflow?
» Try it! Proof of executability
› Ask a colleague
› Use an external t2web runner
» Tips
› Use Web Services
› If you use local command line tools
• Install tools on a publicly accessible server (e.g. applies to Rserve)
• Use system that your users can set up (e.g. BioLinux)
17
Best practice 8
The reuse workflow
Not a best practice,
but a tip: know-how is
Check important for reuse
Contact authors
workflows on
Neg. Retry
myExperiment
Pos. Use scripts from
Neg.
colleagues
Check services Search the
Contact authors
on internet
Neg. Retry
BioCatalogue
Pos.
Invent a new
wheel
Reuse, Attribute
Respect licences
22
Best Practice 10
Maintain
Best practices to support maintenance
» Regularly check your workflow
› Ask colleagues
» Enable support for maintenance
› Register your workflow on myExperiment
› Register Web Services on
» Enable peers to repair: annotate!
» Note about versioning
› No need to register all edits on myExperiment: use subversion
› Register important updates on myExperiment
26
Workflow Forever
Preservation of good workflows for
future applications
Workflow 74
“Protein Discovery”
2005
Workflow 2876
“Match gene lists
by literature” 2012
Workflow 2805
“Get Pathway genes”
2012
28
Wf4Ever
Outcomes for BioVeL
myExperiment 2.0
BioCatalogue
Taverna
Research Objects
Linked Data
Methods
Protocols for
Preservation
and
Conservation
29
The 10 Best Practices of Workflow Design
Thank you
Thank you for your attention
More information:
http://snipurl.com/workflowbestpractices
1. Make a sketch workflow
2. Use modules
3. Think about the output
4. Provide example inputs and outputs
5. Annotate
6. Make it executable from outside the local environment
7. Choose services carefully
8. Reuse existing workflows
9. Advertise
10. Maintain
30
Supporting information
Workflow jargon
› Scientific workflow
Paradigm to describe, manage, and share complex scientific analyses
› Workflow system
Software to design, execute, and monitor scientific workflows
› Module
= nested workflow = workflow in a workflow = workflow component
› Beanshell script
A Java-based scripting language.
Typically used for data type conversions in Taverna.
› Provenance
History or trace of a workflow run.
Allows you to look at intermediate data, which workflows and services
were run, with what data.
32
Editor's Notes
Designing a good workflow is part of doing good research!
This means that if you know about one or both of them, you should apply their principles to workflow design as well. (At the end we can say that using common sense about doing good science is a general best practice for creating workflows too.) Workflow design is a variant of software design Define hypothesis and approach Sketch a workflow of the approach Implement workflow Trial and error (iterate) Comment: where are the workflow design patterns?
Boxes without content, can be in Taverna using e.g. empty script boxes, a powerpoint flow chart, or a napkin; if it is digital (e.g. Taverna) then we can store it digitally. < Comment: add concept mining workflow and a sketch Cite Eleni: 'helps me to share workflow while developing it, that makes it better‘ > How? In Taverna using empty beanshells In PowerPoint In a sketch book Why? Provides a reference point of the main task(s) of the workflow through the implementation process Promots sharing between computer and workflow systems due to its non-explicit nature Helps design experiment Helps communication (supervisors, colleagues)
The workflow on the left explains the basic steps of a text mining process. The expanded workflow is much harder to understand. We can use each nested workflow as a workflow on its own. How? Describe and implement each of the executable processes in a workflow individually and independently In Taverna this can be done through nested workflows Why? Facilitates independent testing and validation of the execution of each of the individual modules Encourages re-use Note: Make sure that you publish the separate modules as well as the final nested workflow (unfortunately, myExperiment does not support this very well), or at least annotate the components when you publish the whole
How? Consider if you want to populate data models/databases or create outputs of disconnected collections of files Consider who the results are for (overview for users, or the next workflow component) General advice: at least have a report as an output (provenance will have the separate parts anyway) Use Taverna for provenance collection (intermediate results are captured by provenance engine) Why? Easier to think about this at the design stage than trying to adjust a ready workflow Structure potentially large output data
How? Example inputs and outputs can be recorded in Taverna Alternatively: add input or output files to a pack containing the workflow Use real example data Why? To help understand the workflow For validation For maintenance Note: Make sure that the input and the output examples are coupled. Keep in mind that the output has a timestamp. It may change due to changes in underlying databases.
How? Choose meaningful names for the workflow title, inputs, outputs, and for the processes that constitute the workflow. Focus on how a component is used in this workflow and why it is in there. If it exists, reference to information about what the component does in general (e.g. by referencing a service on BioCatalogue) Assume that a referenced resource may disappear or change at some time in the future Use Taverna description fields and example fields*. Taverna keeps it with the workflow and myExperiment uses this information. Keep any notes that are related to the workflow, but not part of it, linked to it* Example of useful &quot;extra&quot; information: execution time, keywords, contact information, attribution myExperiment offers some of this, but best to put it in the workflow descriptions Why? Doing good science Record what is needed for a publication later on Increase re-usability Cite Kostas: ‘many workflows are badly documented computer programs' The wf4ever project will provide additional support (and incentives) for describing (the purpose of) workflow components, related objects and references (e.g. data sets), and support for storing the elements of an experiment with their metadata in a structured way.
Facilitate understanding and reuse
How? Use Web Services, any Taverna widget except external tool, and external tool only when it runs over ssh on publicly accessible server Use Taverna with local tools, but installed on a publicly accessible server with the Taverna server Use local tools from an easy to set up environment such as biolinux (only for a certain niche of users) TRY IT!! Why? Others will be able to run the workflow Proof of reproducibility
How? Choose the service that is reliable based on: BioCatalogue reliability statistics (in practice: check on biocatalogue if it has a green light (momentarily not much more you can do)) How often it is used in other workflows Contact with service providers. Communicate! The reputation of the institution providing the service check trustworthiness of service provider (can also be a person, of whom you can check if they will remain at an institution to maintain the service) Why? Prevent workflow decay, prolong the life of the workflow Note to service developers: Many work around and ugly workflow practices come from having to deal with badly behaved services!
Web Services are digital, their creators not. Communication saves web services and workflows from decay.
A common misconception is that because they are workflows, they are automatically stable. It takes effort and often communication to reuse work, especially when using ‘state-of-the-art’ products made by scientists. How? Make your own workflows modular since this promotes reuse Search myExperiment and filter on most downloaded or most viewed Check if it has been used in a publication Use your contacts: maybe someone has tried to solve something similar before using a workflow? Try and try harder, contact authors! Why? Another user that is familiar with one of your workflows, is more likely to understand another workflow that you designed Beneficial when repairing workflows: By repairing a given workflow may entails repairing the workflows in which it is used as a subworkflow Fights redundancy Note: attribute others and respect licenses
http://myExperiment.org/workflows/74?version=12 http://myExperiment.org/packs/258 How? Share your workflow on (don’t forget contact info!): myExperiment other social media e-mailing it around to colleagues Cite your workflow when publishing, using a stable identifier like myExperiment Make use of the pack functionality in myExperiment to bundle your workflow with other important documents such as a publication Why? Good science – share your results Get cited – fame! Progress, let others build on your work without reinventing it
How? Act on information about services that are deprecated by changing services providing a note that that specific process in the workflow in not executable anymore Put your services on BioCatalogue (don't have to be the owner) and your workflows on myExperiment (notification iits planned) Regularly test the workfow (like 'unit tests') Why? Good practice – this is already demanded for some types of publications, like an application note in Bioinformatics Fight workflow decay, prolong the life of the workflow
A Scientific Workflow can be seen as the combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving i.e. the implementation of a scientific method Need to be preserved (and conserved). More on this later.