Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Converting scripts into reproducible workflow research objects

302 views

Published on

This talk was given at the eScience 2016 conference. It presents a principled methodology for converting raw scripts into annotated workflow research objects.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Converting scripts into reproducible workflow research objects

  1. 1. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016
  2. 2. 2 Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers
  3. 3. 3 Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers How to understand, reproduce or reuse data and models of experiments?
  4. 4. 4 Background and Motivation ● Data-Intensive Experiments – Collection of scripts, programs and (big) data Manual collection and organization of data provenance Papers How to understand, reproduce or reuse data and models of experiments?
  5. 5. 5 Background and Motivation ● Script-based experiments What are the inputs and outputs? How to change this local program for a similar web service? Example of script code. Difficult to understand, to reuse, and to reproduce.
  6. 6. 6 Background and Motivation ● Scientific Workflows Example of Scientific Workflow Management System.
  7. 7. 7 Create Understand Reuse Reproduce Overview
  8. 8. 8 Create Understand Reuse Reproduce Overview +
  9. 9. 9 Create Understand Reuse Reproduce Overview + Step 2 Step 1 Step 3 Step 4 Step 5 Methodology
  10. 10. 10 Related Work ● Script-language specific. ● Workflow-engine specific. ● A new language is needed. ● Outcome is not an executable workflow. ● Do not collect provenance data of the conversion process.
  11. 11. 11 Two Kind of Experts ● Scientists – Domain experts who understand the experiment, and the script (sometimes called user); ● Curators: – Scientists who are also familiar with workflow and script programming or; – Computer scientists who are familiar enough with the domain to be able to implement our methodology; – Responsible for authoring, documenting and publishing workflows and associated resources.
  12. 12. 12 Requirements ● Produce workflow-like view of the script. ● Create an executable workflow and compare execution of workflow and script. ● Modify the workflow resources. ● Record provenance data. ● Aggregate all resources to support Reproducibility and Reuse. 1 2 3 4 5
  13. 13. 13 Requirements ● Produce workflow-like view of the script.1 Activity 1 Port 1 Port 2 Port 3 Port 1 Port 2 Activity 2 Port 3 Port 3 Activity n Port n Script-based experiment. Abstract workflow.
  14. 14. 14 Requirements ● Create executable workflow and compare execution of workflow and script. 2 Executable workflow. Script-based experiment.
  15. 15. 15 Requirements ● Modify the workflow resources.3 Local (a) (b) Algorithm A Algorithm B
  16. 16. 16 Requirements ● Record provenance data4 Activity 1 Output 1 Output 2 wasGeneratedBy wasGeneratedBy Sample used “2012-06-01” wasStartedAt Activity 2 used LucasWorkflow Run wasAssociatedWith used
  17. 17. 17 Requirements ● Aggregate all resources to support Reproducibility and Reuse. 5 Abstract workflows Concrete workflows Annotations Papers and Reports Provenance Authors Scripts Data
  18. 18. 18 Script Generate Abstract Workflow Generate Abstract Workflow Create an executable workflow Create an executable workflow Refine workflowRefine workflow Bundle Resources into a Research Object Bundle Resources into a Research Object Annotate and check quality Annotate and check quality Abstract workflow Concrete workflow 2 1 3 4 5 Methodology
  19. 19. 19 Workflow Research Object (WRO) ● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. ● WROs encapsulate scientific workflows and additional information regarding their context and resources. Research Object Model
  20. 20. 20 Running Example ● Molecular Dynamics Simulations – Many branches of material sciences, computational engineering, physics and chemistry. – Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases: set up, simulation and analysis of trajectories. – Inputs: protein structure, simulation parameters and force field files. – Output: trajectories and analysis results.
  21. 21. 21 Step Generate Abstract Workflow 1 Script code.
  22. 22. 22 Step Generate Abstract Workflow 1 Manually annotate Script code. Annotated script code.
  23. 23. 23 Step Generate Abstract Workflow 1 Manually annotate Create workflow-like view Script code. Annotated script code. Abstract workflow.
  24. 24. 24 Step Generate Abstract Workflow 1 code blocks Input/ouput YesWorkflow McPhillips et. al, 2015 - Code comments - Tags: ● @begin ● @end ● @desc ● @in ● @out ● ... T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language- independent tool for recovering workflow information from scripts,” International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015. Create Workflow-like view Abstract workflow. Annotated script code.
  25. 25. 25 Step Generate Abstract Workflow 1 Create Workflow-like view Abstract workflow. Annotated script code.
  26. 26. 26 Step Create an executable workflow 2 Abstract workflow.
  27. 27. 27 Step Create an executable workflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow.
  28. 28. 28 Step Create an executable workflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow.
  29. 29. 29 Step Create an executable workflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow. Script code.
  30. 30. 30 Step Refine executable workflow 3 Modify resources: ● Algorithms ● Data Sets ● Parallelization ● Web Services ● ... Executable workflow. New workflow version.
  31. 31. 31 Step Refine executable workflow 3 Create new version Modify resources: ● Algorithms ● Data Sets ● Parallelization ● Web Services ● ... Executable workflow. New workflow version.
  32. 32. 32 Steps Record provenance data: execution traces. 2 3 wasEnactedBy split Output 1 Output 2 wasGeneratedBy wasGeneratedBy Sample used “2012-06-01” wasStartedAt psgen used LucasWorkflow Run wasAssociatedWith used hasSpecification W3C PROV Executable workflow.
  33. 33. 33 Steps Record provenance data: conversion process. 2 3 wasDerivedFrom wasDerivedFrom wasDerivedFrom wasAssociatedWith CuratorCurator W3C PROV Executable workflow. New workflow version. Script code.
  34. 34. 34 Step Annotate and check quality ● Annotations describing the workflow. ● Use provenance data – To check the quality of the conversion process. ● Run checks to verify the soundness of the workflow. 4
  35. 35. 35 Step Annotate and check quality 4 Script code. Executable workflow.
  36. 36. 36 Step Annotate and check quality 4 Workflow version. Initial Executable workflow.
  37. 37. 37 Step Annotate and check quality ● Common mistakes during the conversion: – not clearly identified the main logical processing units in the script; – a mistake when migrating script code into the corresponding activity; – not provided the correct input files and parameters; – the coding of the workflow itself contained errors. 4
  38. 38. 38 Step Bundle Resources into a Research Object 5 Script Abstract workflow Concrete workflow(s) Annotations Paper Provenance Data Attributions
  39. 39. 39 Contributions ● A methodology that guides curators in a principled manner to transform scripts into reproducible and reusable WRO; ● This addresses an important issue in the area of script provenance;
  40. 40. 40 Conclusions ● We addressed issues wrt understanding, reuse and reproducibility of script-based experiments. ● The methodology created was: – elaborated based on requirements; – showcased via a real world use case from the field of Molecular Dynamics; ● We exploited tools and standards from the scientific community: – Scientific Workflows, YesWorkflow, Research Objects, the W3C PROV recommendations and the Web Annotation Data Model. ● The bundle is available at http://w3id.org/w2share/s2rwro/
  41. 41. 41 Next Steps ● Evaluation using other case studies; ● Evaluation of the cost of the effectiveness of our methodology; ● Extension of YesWorkflow to support the semantic annotation of blocks; ● Implementation of tools.
  42. 42. 42 Acknowledgments ● FAPESP (grant # 2014/23861-4) ● CCES/CEPID (grant # 2013/08293-7) – Center for Computational Engineering & Sciences ● LIS (Laboratory of Information Systems) ● Prof. Munir Skaf and his group from Institute of Chemistry - Unicamp.
  43. 43. Converting Scripts into Reproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016

×