Designing a good workflow is part of doing good research!
This means that if you know about one or both of them, you should apply their principles to workflow design as well. (At the end we can say that using common sense about doing good science is a general best practice for creating workflows too.) Workflow design is a variant of software design Define hypothesis and approach Sketch a workflow of the approach Implement workflow Trial and error (iterate) Comment: where are the workflow design patterns?
Boxes without content, can be in Taverna using e.g. empty script boxes, a powerpoint flow chart, or a napkin; if it is digital (e.g. Taverna) then we can store it digitally. < Comment: add concept mining workflow and a sketch Cite Eleni: 'helps me to share workflow while developing it, that makes it better‘ > How? In Taverna using empty beanshells In PowerPoint In a sketch book Why? Provides a reference point of the main task(s) of the workflow through the implementation process Promots sharing between computer and workflow systems due to its non-explicit nature Helps design experiment Helps communication (supervisors, colleagues)
The workflow on the left explains the basic steps of a text mining process. The expanded workflow is much harder to understand. We can use each nested workflow as a workflow on its own. How? Describe and implement each of the executable processes in a workflow individually and independently In Taverna this can be done through nested workflows Why? Facilitates independent testing and validation of the execution of each of the individual modules Encourages re-use Note: Make sure that you publish the separate modules as well as the final nested workflow (unfortunately, myExperiment does not support this very well), or at least annotate the components when you publish the whole
How? Consider if you want to populate data models/databases or create outputs of disconnected collections of files Consider who the results are for (overview for users, or the next workflow component) General advice: at least have a report as an output (provenance will have the separate parts anyway) Use Taverna for provenance collection (intermediate results are captured by provenance engine) Why? Easier to think about this at the design stage than trying to adjust a ready workflow Structure potentially large output data
How? Example inputs and outputs can be recorded in Taverna Alternatively: add input or output files to a pack containing the workflow Use real example data Why? To help understand the workflow For validation For maintenance Note: Make sure that the input and the output examples are coupled. Keep in mind that the output has a timestamp. It may change due to changes in underlying databases.
How? Choose meaningful names for the workflow title, inputs, outputs, and for the processes that constitute the workflow. Focus on how a component is used in this workflow and why it is in there. If it exists, reference to information about what the component does in general (e.g. by referencing a service on BioCatalogue) Assume that a referenced resource may disappear or change at some time in the future Use Taverna description fields and example fields*. Taverna keeps it with the workflow and myExperiment uses this information. Keep any notes that are related to the workflow, but not part of it, linked to it* Example of useful &quot;extra&quot; information: execution time, keywords, contact information, attribution myExperiment offers some of this, but best to put it in the workflow descriptions Why? Doing good science Record what is needed for a publication later on Increase re-usability Cite Kostas: ‘many workflows are badly documented computer programs' The wf4ever project will provide additional support (and incentives) for describing (the purpose of) workflow components, related objects and references (e.g. data sets), and support for storing the elements of an experiment with their metadata in a structured way.
Facilitate understanding and reuse
How? Use Web Services, any Taverna widget except external tool, and external tool only when it runs over ssh on publicly accessible server Use Taverna with local tools, but installed on a publicly accessible server with the Taverna server Use local tools from an easy to set up environment such as biolinux (only for a certain niche of users) TRY IT!! Why? Others will be able to run the workflow Proof of reproducibility
How? Choose the service that is reliable based on: BioCatalogue reliability statistics (in practice: check on biocatalogue if it has a green light (momentarily not much more you can do)) How often it is used in other workflows Contact with service providers. Communicate! The reputation of the institution providing the service check trustworthiness of service provider (can also be a person, of whom you can check if they will remain at an institution to maintain the service) Why? Prevent workflow decay, prolong the life of the workflow Note to service developers: Many work around and ugly workflow practices come from having to deal with badly behaved services!
Web Services are digital, their creators not. Communication saves web services and workflows from decay.
A common misconception is that because they are workflows, they are automatically stable. It takes effort and often communication to reuse work, especially when using ‘state-of-the-art’ products made by scientists. How? Make your own workflows modular since this promotes reuse Search myExperiment and filter on most downloaded or most viewed Check if it has been used in a publication Use your contacts: maybe someone has tried to solve something similar before using a workflow? Try and try harder, contact authors! Why? Another user that is familiar with one of your workflows, is more likely to understand another workflow that you designed Beneficial when repairing workflows: By repairing a given workflow may entails repairing the workflows in which it is used as a subworkflow Fights redundancy Note: attribute others and respect licenses
http://myExperiment.org/workflows/74?version=12 http://myExperiment.org/packs/258 How? Share your workflow on (don’t forget contact info!): myExperiment other social media e-mailing it around to colleagues Cite your workflow when publishing, using a stable identifier like myExperiment Make use of the pack functionality in myExperiment to bundle your workflow with other important documents such as a publication Why? Good science – share your results Get cited – fame! Progress, let others build on your work without reinventing it
How? Act on information about services that are deprecated by changing services providing a note that that specific process in the workflow in not executable anymore Put your services on BioCatalogue (don't have to be the owner) and your workflows on myExperiment (notification iits planned) Regularly test the workfow (like 'unit tests') Why? Good practice – this is already demanded for some types of publications, like an application note in Bioinformatics Fight workflow decay, prolong the life of the workflow
A Scientific Workflow can be seen as the combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving i.e. the implementation of a scientific method Need to be preserved (and conserved). More on this later.
Could we skip this slide to save time?
10 Best Practices for Workflow Design
The 10 Best Practices for Workflow Design BioVeL M6 Workshop Göteborg, May 10-11, 2012 Kristina Hettne, Marco Roos (LUMC), Katy Wolstencroft , Carole Goble (myGrid)Thanks: BioSemantics Group (LUMC), myGrid team (UoM), Yassene Mohamed, Harish Dharuri (LUMC)
Our specialty: Knowledge Discovery http://biosemantics.org Disambiguation* Text Mining Substrates for Knowledge Discovery Methods for Knowledge Discovery Applications •Predict protein-protein, protein-disease associations, gene prioritization •Genotype-phenotype studies, e.g. Huntington’s Disease, Metabolic Syndrome •Yours?* Global disambiguation initiative: http://snipurl.com/conceptweballiance 2
Introduction Why build good workflows?Good workflow design = good science! 3
Introduction Best practices for workflow design Best Practices for workflow design =Best Practices experimental science +Best Practices software engineering 4
Best practice 5 AnnotateEach component in Taverna can be annotated 14
Best practice 5Annotate and help your users 15
6Make workflow executable from outside the local environment 16
Best practice 6 Make workflow executable by othersHow to check that others can execute your workflow?» Try it! Proof of executability › Ask a colleague › Use an external t2web runner» Tips › Use Web Services › If you use local command line tools • Install tools on a publicly accessible server (e.g. applies to Rserve) • Use system that your users can set up (e.g. BioLinux) 17
Best practice 8 The reuse workflow Not a best practice, but a tip: know-how is Check important for reuse Contact authorsworkflows on Neg. RetrymyExperiment Pos. Use scripts from Neg. colleaguesCheck services Search the Contact authors on internet Neg. Retry BioCatalogue Pos. Invent a new wheel Reuse, Attribute Respect licences 22
Best Practice 10 MaintainBest practices to support maintenance» Regularly check your workflow › Ask colleagues» Enable support for maintenance › Register your workflow on myExperiment › Register Web Services on» Enable peers to repair: annotate!» Note about versioning › No need to register all edits on myExperiment: use subversion › Register important updates on myExperiment 26
Workflow Forever Preservation of good workflows for future applications Workflow 74 “Protein Discovery” 2005Workflow 2876“Match gene listsby literature” 2012 Workflow 2805 “Get Pathway genes” 2012 28
Wf4Ever Outcomes for BioVeLmyExperiment 2.0BioCatalogueTavernaResearch ObjectsLinked DataMethodsProtocols for Preservation and Conservation 29
The 10 Best Practices of Workflow Design Thank youThank you for your attentionMore information:http://snipurl.com/workflowbestpractices1. Make a sketch workflow2. Use modules3. Think about the output4. Provide example inputs and outputs5. Annotate6. Make it executable from outside the local environment7. Choose services carefully8. Reuse existing workflows9. Advertise10. Maintain 30
Supporting information Workflow jargon› Scientific workflow Paradigm to describe, manage, and share complex scientific analyses› Workflow system Software to design, execute, and monitor scientific workflows› Module = nested workflow = workflow in a workflow = workflow component› Beanshell script A Java-based scripting language. Typically used for data type conversions in Taverna.› Provenance History or trace of a workflow run. Allows you to look at intermediate data, which workflows and services were run, with what data. 32