Successfully reported this slideshow.

Why Workflows Break

511 views

Published on

This is a talk that was presented by Khalid Belhajjame at the eScience conference that took place in 2012 in Chicago.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Why Workflows Break

  1. 1. Why Workflows Break - Understanding and Combating Decay in Taverna Workflows Jun Zhao, Jose Manuel Gomez-Perez, Khalid Belhajjame, Graham Klyne, Esteban Garcia-Cuesta, Aleix Garrido, Kristina Hettne, Marco Roos, David De Roure, and Carole Goble IEEE eScience 2012. Chicago, USA 10 October, 2012http://www.flickr.com/photos/sheepies/3798650645/ @ CC BY-NC 2.0
  2. 2. Reproducibility: Why Bother?◉ Results produced by scientists not only give insight, they lead to progress and are built upon◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then others scientist should be able to check it.◉ This should be also possible for experiments carried out in computational environments. IEEE eScience 2012. Chicago, USA 10 October, 2012
  3. 3. 47 of 53 “landmark” publications could not be replicated Inadequate cell lines and animal modelsNature, 483, 2012 Credit to Carole Goble JCDL 2012 Keynote
  4. 4. Reproducibility: Why Bother?◉ Results produced by scientists not only give insight, they lead to progress and are built upon◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it.◉ This should be also possible for experiments carried out in computational environments. IEEE eScience 2012. Chicago, USA 10 October, 2012
  5. 5. Reproducibility: Why Bother?◉ Results produced by scientists not only give insight, they lead to progress and are built upon◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it.◉ This should be also possible for experiments carried out in computational environments. IEEE eScience 2012. Chicago, USA 10 October, 2012
  6. 6. A famous quoteAn article about computational science in a scientificpublication is not the scholarship itself, it is merelyadvertising of the scholarship. The actualscholarship is the complete softwaredevelopment environment and the completeset of instructions which generated the figures.Jon B. Buckheit and David L. Donoho,WaveLab and reproducible research,1995 IEEE eScience 2012. Chicago, USA 10 October, 2012
  7. 7. Another quoteAbandoning the habit of secrecy in favor ofprocess transparency and peer review was thecrucial step by which alchemy becamechemistry.Eric S. Raymond, The art of UNIXprogramming, 2004 IEEE eScience 2012. Chicago, USA 10 October, 2012
  8. 8. Workflows: A Means forPreserving Scientific Methods Fortunately, there is a means that can be used to document the experiment that the scientist ran, and even re-run it! chromosome17 chromosome37Scientific workflows Kegg pathway Kegg pathway Kegg pathway Kegg pathway query query query query Increasingly adopted in modern sciences. Transparent documentation of Detect common Detect common pathways pathways experimental methods Common pathways Repeatable and configurable IEEE eScience 2012. Chicago, USA 10 October, 2012
  9. 9. Workflow Decay A decayed or reduced ability to be executed or produce the same resultsOur Contributions An empirical analysis for identifying andcategorizing the causes of workflow decay A software framework to assess workflowpreservation
  10. 10. Storyline The importance of reproducibility Workflow as a means for preserving scientific methods Understanding the causes of workflow decay Combating decay Lessons learnt and future work IEEE eScience 2012. Chicago, USA 10 October, 2012
  11. 11. Understanding The Causes of Workflow Decay We adopted an empirical approach To identify the causes of workflow decay To quantify their severity To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay IEEE eScience 2012. Chicago, USA 10 October, 2012
  12. 12. Experimental SetupTaverna workflows from Software environmentmyExperiment.org Taverna 2.3 Taverna 1 Taverna 2 Experiment metadata June-July 2012Selection process 4 researchers By the creation year By the creator By the domain IEEE eScience 2012. Chicago, USA 10 October, 2012
  13. 13. Analyzed Workflows Number of Taverna 1 workflows from 2007 to 2011 2007 2008 2009 2010 2011Tested 12 10 10 10 4*Total 74 341 101 26 13 Number of Taverna 2 workflows from 2009 to 2012 2009 2010 2011 2012 Tested 12 10 15 9 Total 97 308 289 184 IEEE eScience 2012. Chicago, USA 10 October, 2012
  14. 14. Profile of Analyzed Workflows IEEE eScience 2012. Chicago, USA 10 October, 2012
  15. 15. The Proportion of DecayTaverna 1 75% of the 92 tested workflows failed to be either executed or produce the same result (if testable) Those from early yearsTaverna 2 (2007-2009) had 91% failure rate IEEE eScience 2012. Chicago, USA 10 October, 2012
  16. 16. The Cause of DecayManual analysis By the validation report from Taverna workbench By interpreting experiment results reported by TavernaIdentified 4 categories of causes Missing example data Missing execution environment Insufficient descriptions about workflows Volatile third-party ResourcesOther unconsidered possible factors Changes in the local operating environment (hardware, OS, middleware, compiler, etc) IEEE eScience 2012. Chicago, USA 10 October, 2012
  17. 17. Decay Caused by Third-PartyCauses Resources Examples Refined CausesThird party resources Underlying dataset, particularly those Researcher hosting the data changedare not available locally hosted in-house dataset, is no institution, server is no longer available longer available Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflowsThird party resources Data is available but identified using Due to scalability reasons the inputare available but not different IDs than the ones known to data is superseded by new one makingaccessible the user the workflow not executable or providing wrong results Data is available but permission, Cannot get the input, which is a certificate, or network to access it is security token that can only be needed obtained by a registered user of ChemiSpider Services are available but need The security policies of the execution permission, certificate, or network to framework are updated due to new access and invoke them hosting institution rulesThird party resources Services are still available by using the The web services are updatedhave changed same identifiers but their functionality have changed IEEE eScience 2012. Chicago, USA 10 October, 2012
  18. 18. The Cause of DecayManual analysis By the validation report from Taverna workbench By interpreting experiment results reported by TavernaIdentified 4 categories of causes Missing example data Missing execution environment Insufficient descriptions about workflows Volatile third-party ResourcesOther unconsidered possible factors Changes in the local operating environment (hardware, OS, middleware, compiler, etc) IEEE eScience 2012. Chicago, USA 10 October, 2012
  19. 19. Summary of Decay Causes 50% of the decay was caused by volatility of 3rd-party resource Unavailable Inaccessible Updated Missing example data Unable to re-run Missing execution environment Such as local plugins Insufficient metadata Such as any required dependency libraries or permission information IEEE eScience 2012. Chicago, USA 10 October, 201
  20. 20. Storyline The importance of reproducibility Workflow as a means for preserving scientific methods Understanding the causes of workflow decay• Combating decay• Lessons learnt and future work IEEE eScience 2012. Chicago, USA 10 October, 2012
  21. 21. Combating Workflow Decay • Objective: To provide enough information to – Prevent decay – Detect decay – Repair decay • Approach: Research Objects + Checklists – Research Objects [1][2]: Aggregate workflow specifications t o jec together with auxiliary elements, such as example data inputs, Pr annotations, provenance traces that can be used to prevent ver f4E decay and/or repair the workflow in case of decay.W – Checklists: to check that sufficient information is preserved along with the workflows [1] http://wf4ever.github.com/ro/ [2] http://wf4ever.github.com/ro-primer/ IEEE eScience 2012. Chicago, USA 10 October, 2012
  22. 22. Checklists• Checklists are a well established toolfor guiding practices to ensure safety,quality and consistency in the conductof complex operations.• They have been adopted by thebiological research community topromote consistency across researchdatasets• In our case, we use checklists toassess if a research object containssufficient information for running theworkflow and checking that its resultsare replicable. IEEE eScience 2012. Chicago, USA 10 October, 2012
  23. 23. Cheklist-ing the Reproducibility of a WorkflowThe Minim model used in our approach is an adaptation of the MiM model [1][2].[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary andFramework for Scientific Linked Data. eScience 2012[2] https://raw.github.com/wf4ever/ro-manager/master/src/iaeval/Minim/minim.rdf IEEE eScience 2012. Chicago, USA 10 October, 2012
  24. 24. Use Case• 4 myExperiment packs – 2 from genomics, 1 from geography, and 1 domain-neutral• Experiment process: – Transform them into RO – Create checklist descriptions• Observations – 2 research objects were found not to contain the necessary information to run them, 2 others failed because of update to third party resources and environment of execution. IEEE eScience 2012. Chicago, USA 10 October, 2012
  25. 25. Storyline The importance of reproducibility Workflow as a means for preserving scientific methods Understanding the causes of workflow decay• Combating decay• Lessons Learnt and future work IEEE eScience 2012. Chicago, USA 10 October, 2012
  26. 26. Lessons Learnt1. Dependency is the root enemy of reproducible workflows2. Documentation, i.e., annotation, is vital3. Documentation should be easy to create IEEE eScience 2012. Chicago, USA 10 October, 2012
  27. 27. The Future Work• Decay detection, explanation, and repair• Reproducibility and provenance• Working with scientists is vital for reproducible science – GigaScience – BioVel – 2020 Science IEEE eScience 2012. Chicago, USA 10 October, 2012
  28. 28. AcknowledgementEU Wf4Ever project (270129)funded under EU FP7 (ICT- 2009.4.1).(http://www.wf4ever-project.org) The principles of provenance. Dagstuhl, March 1, 2012

×