Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016
2
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Papers
3
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Papers
How to understand,
reproduce or reuse
data and models of
experiments?
4
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Manual collection and
organization of data provenance
Papers
How to understand,
reproduce or reuse
data and models of
experiments?
5
Background and Motivation
● Script-based experiments
What are the inputs
and outputs?
How to change this
local program for a
similar web service?
Example of script code.
Difficult to
understand, to reuse,
and to reproduce.
6
Background and Motivation
● Scientific Workflows
Example of Scientific Workflow Management System.
7
Create
Understand
Reuse
Reproduce
Overview
8
Create
Understand
Reuse
Reproduce
Overview
+
9
Create
Understand
Reuse
Reproduce
Overview
+
Step 2
Step 1
Step 3
Step 4
Step 5
Methodology
10
Related Work
● Script-language specific.
● Workflow-engine specific.
● A new language is needed.
● Outcome is not an executable workflow.
● Do not collect provenance data of the
conversion process.
11
Two Kind of Experts
● Scientists
– Domain experts who understand the experiment, and
the script (sometimes called user);
● Curators:
– Scientists who are also familiar with workflow and
script programming or;
– Computer scientists who are familiar enough with the
domain to be able to implement our methodology;
– Responsible for authoring, documenting and
publishing workflows and associated resources.
12
Requirements
● Produce workflow-like view of the script.
● Create an executable workflow and compare
execution of workflow and script.
● Modify the workflow resources.
● Record provenance data.
● Aggregate all resources to support
Reproducibility and Reuse.
1
2
3
4
5
13
Requirements
● Produce workflow-like view of the script.1
Activity 1
Port 1 Port 2 Port 3
Port 1 Port 2
Activity 2
Port 3
Port 3
Activity n
Port n
Script-based experiment.
Abstract workflow.
14
Requirements
● Create executable workflow and compare
execution of workflow and script.
2
Executable workflow. Script-based experiment.
15
Requirements
● Modify the workflow resources.3
Local
(a)
(b)
Algorithm A Algorithm B
16
Requirements
● Record provenance data4
Activity 1
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
Activity 2
used
LucasWorkflow
Run
wasAssociatedWith
used
17
Requirements
● Aggregate all resources to support
Reproducibility and Reuse.
5
Abstract
workflows
Concrete
workflows
Annotations
Papers and
Reports
Provenance
Authors
Scripts
Data
18
Script
Generate Abstract
Workflow
Generate Abstract
Workflow
Create an
executable workflow
Create an
executable workflow
Refine workflowRefine workflow
Bundle Resources into
a Research Object
Bundle Resources into
a Research Object
Annotate and
check quality
Annotate and
check quality
Abstract
workflow
Concrete
workflow
2
1
3
4
5
Methodology
19
Workflow Research Object (WRO)
● Research Objects are
semantically rich
aggregations of resources
that bring together data,
methods and people in
scientific investigations.
● WROs encapsulate scientific
workflows and additional
information regarding their
context and resources.
Research Object Model
20
Running Example
● Molecular Dynamics Simulations
– Many branches of material sciences, computational
engineering, physics and chemistry.
– Scripts (shell script), programs (NAMD, VMD, Fortran)
– Phases: set up, simulation and analysis of trajectories.
– Inputs: protein structure, simulation parameters and
force field files.
– Output: trajectories and analysis results.
21
Step
Generate Abstract Workflow
1
Script code.
22
Step
Generate Abstract Workflow
1
Manually
annotate
Script code.
Annotated script code.
23
Step
Generate Abstract Workflow
1
Manually
annotate
Create
workflow-like
view
Script code.
Annotated script code.
Abstract workflow.
24
Step
Generate Abstract Workflow
1
code blocks
Input/ouput
YesWorkflow
McPhillips et. al, 2015
- Code comments
- Tags:
● @begin
● @end
● @desc
● @in
● @out
● ...
T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language-
independent tool for recovering workflow information from scripts,”
International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015.
Create
Workflow-like
view
Abstract workflow.
Annotated script code.
25
Step
Generate Abstract Workflow
1
Create
Workflow-like
view
Abstract workflow.
Annotated script code.
26
Step
Create an executable workflow
2
Abstract workflow.
27
Step
Create an executable workflow
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.
28
Step
Create an executable workflow
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.
29
Step
Create an executable workflow
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.
Script code.
30
Step
Refine executable workflow
3
Modify resources:
● Algorithms
● Data Sets
● Parallelization
● Web Services
● ...
Executable workflow.
New workflow version.
31
Step
Refine executable workflow
3
Create new
version
Modify resources:
● Algorithms
● Data Sets
● Parallelization
● Web Services
● ...
Executable workflow.
New workflow version.
32
Steps
Record provenance data: execution traces.
2 3
wasEnactedBy
split
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
psgen
used
LucasWorkflow
Run
wasAssociatedWith
used
hasSpecification
W3C PROV
Executable workflow.
33
Steps
Record provenance data: conversion process.
2 3
wasDerivedFrom
wasDerivedFrom
wasDerivedFrom
wasAssociatedWith
CuratorCurator
W3C PROV
Executable workflow.
New workflow version.
Script code.
34
Step
Annotate and check quality
● Annotations describing the workflow.
● Use provenance data
– To check the quality of the conversion process.
● Run checks to verify the soundness of the
workflow.
4
35
Step
Annotate and check quality
4
Script code.
Executable workflow.
36
Step
Annotate and check quality
4
Workflow version.
Initial Executable workflow.
37
Step
Annotate and check quality
● Common mistakes during the conversion:
– not clearly identified the main logical processing
units in the script;
– a mistake when migrating script code into the
corresponding activity;
– not provided the correct input files and parameters;
– the coding of the workflow itself contained errors.
4
38
Step
Bundle Resources into a Research Object
5
Script Abstract
workflow
Concrete
workflow(s)
Annotations
Paper
Provenance
Data
Attributions
39
Contributions
● A methodology that guides curators in a
principled manner to transform scripts into
reproducible and reusable WRO;
● This addresses an important issue in the area
of script provenance;
40
Conclusions
● We addressed issues wrt understanding, reuse and
reproducibility of script-based experiments.
● The methodology created was:
– elaborated based on requirements;
– showcased via a real world use case from the field of Molecular
Dynamics;
● We exploited tools and standards from the scientific
community:
– Scientific Workflows, YesWorkflow, Research Objects, the W3C
PROV recommendations and the Web Annotation Data Model.
● The bundle is available at http://w3id.org/w2share/s2rwro/
41
Next Steps
● Evaluation using other case studies;
● Evaluation of the cost of the effectiveness of
our methodology;
● Extension of YesWorkflow to support the
semantic annotation of blocks;
● Implementation of tools.
42
Acknowledgments
● FAPESP (grant # 2014/23861-4)
● CCES/CEPID (grant # 2013/08293-7)
– Center for Computational Engineering & Sciences
● LIS (Laboratory of Information Systems)
● Prof. Munir Skaf and his group from Institute of
Chemistry - Unicamp.
Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016

Converting Scripts into Reproducible Workflow Research Objects

  • 1.
    Converting Scripts intoReproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016
  • 2.
    2 Background and Motivation ●Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers
  • 3.
    3 Background and Motivation ●Data-Intensive Experiments – Collection of scripts, programs and (big) data Papers How to understand, reproduce or reuse data and models of experiments?
  • 4.
    4 Background and Motivation ●Data-Intensive Experiments – Collection of scripts, programs and (big) data Manual collection and organization of data provenance Papers How to understand, reproduce or reuse data and models of experiments?
  • 5.
    5 Background and Motivation ●Script-based experiments What are the inputs and outputs? How to change this local program for a similar web service? Example of script code. Difficult to understand, to reuse, and to reproduce.
  • 6.
    6 Background and Motivation ●Scientific Workflows Example of Scientific Workflow Management System.
  • 7.
  • 8.
  • 9.
  • 10.
    10 Related Work ● Script-languagespecific. ● Workflow-engine specific. ● A new language is needed. ● Outcome is not an executable workflow. ● Do not collect provenance data of the conversion process.
  • 11.
    11 Two Kind ofExperts ● Scientists – Domain experts who understand the experiment, and the script (sometimes called user); ● Curators: – Scientists who are also familiar with workflow and script programming or; – Computer scientists who are familiar enough with the domain to be able to implement our methodology; – Responsible for authoring, documenting and publishing workflows and associated resources.
  • 12.
    12 Requirements ● Produce workflow-likeview of the script. ● Create an executable workflow and compare execution of workflow and script. ● Modify the workflow resources. ● Record provenance data. ● Aggregate all resources to support Reproducibility and Reuse. 1 2 3 4 5
  • 13.
    13 Requirements ● Produce workflow-likeview of the script.1 Activity 1 Port 1 Port 2 Port 3 Port 1 Port 2 Activity 2 Port 3 Port 3 Activity n Port n Script-based experiment. Abstract workflow.
  • 14.
    14 Requirements ● Create executableworkflow and compare execution of workflow and script. 2 Executable workflow. Script-based experiment.
  • 15.
    15 Requirements ● Modify theworkflow resources.3 Local (a) (b) Algorithm A Algorithm B
  • 16.
    16 Requirements ● Record provenancedata4 Activity 1 Output 1 Output 2 wasGeneratedBy wasGeneratedBy Sample used “2012-06-01” wasStartedAt Activity 2 used LucasWorkflow Run wasAssociatedWith used
  • 17.
    17 Requirements ● Aggregate allresources to support Reproducibility and Reuse. 5 Abstract workflows Concrete workflows Annotations Papers and Reports Provenance Authors Scripts Data
  • 18.
    18 Script Generate Abstract Workflow Generate Abstract Workflow Createan executable workflow Create an executable workflow Refine workflowRefine workflow Bundle Resources into a Research Object Bundle Resources into a Research Object Annotate and check quality Annotate and check quality Abstract workflow Concrete workflow 2 1 3 4 5 Methodology
  • 19.
    19 Workflow Research Object(WRO) ● Research Objects are semantically rich aggregations of resources that bring together data, methods and people in scientific investigations. ● WROs encapsulate scientific workflows and additional information regarding their context and resources. Research Object Model
  • 20.
    20 Running Example ● MolecularDynamics Simulations – Many branches of material sciences, computational engineering, physics and chemistry. – Scripts (shell script), programs (NAMD, VMD, Fortran) – Phases: set up, simulation and analysis of trajectories. – Inputs: protein structure, simulation parameters and force field files. – Output: trajectories and analysis results.
  • 21.
  • 22.
  • 23.
  • 24.
    24 Step Generate Abstract Workflow 1 codeblocks Input/ouput YesWorkflow McPhillips et. al, 2015 - Code comments - Tags: ● @begin ● @end ● @desc ● @in ● @out ● ... T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language- independent tool for recovering workflow information from scripts,” International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015. Create Workflow-like view Abstract workflow. Annotated script code.
  • 25.
  • 26.
    26 Step Create an executableworkflow 2 Abstract workflow.
  • 27.
    27 Step Create an executableworkflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow.
  • 28.
    28 Step Create an executableworkflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow.
  • 29.
    29 Step Create an executableworkflow 2 Create implementation of activities Copy code blocks from the script. Abstract workflow. Executable workflow. Script code.
  • 30.
    30 Step Refine executable workflow 3 Modifyresources: ● Algorithms ● Data Sets ● Parallelization ● Web Services ● ... Executable workflow. New workflow version.
  • 31.
    31 Step Refine executable workflow 3 Createnew version Modify resources: ● Algorithms ● Data Sets ● Parallelization ● Web Services ● ... Executable workflow. New workflow version.
  • 32.
    32 Steps Record provenance data:execution traces. 2 3 wasEnactedBy split Output 1 Output 2 wasGeneratedBy wasGeneratedBy Sample used “2012-06-01” wasStartedAt psgen used LucasWorkflow Run wasAssociatedWith used hasSpecification W3C PROV Executable workflow.
  • 33.
    33 Steps Record provenance data:conversion process. 2 3 wasDerivedFrom wasDerivedFrom wasDerivedFrom wasAssociatedWith CuratorCurator W3C PROV Executable workflow. New workflow version. Script code.
  • 34.
    34 Step Annotate and checkquality ● Annotations describing the workflow. ● Use provenance data – To check the quality of the conversion process. ● Run checks to verify the soundness of the workflow. 4
  • 35.
    35 Step Annotate and checkquality 4 Script code. Executable workflow.
  • 36.
    36 Step Annotate and checkquality 4 Workflow version. Initial Executable workflow.
  • 37.
    37 Step Annotate and checkquality ● Common mistakes during the conversion: – not clearly identified the main logical processing units in the script; – a mistake when migrating script code into the corresponding activity; – not provided the correct input files and parameters; – the coding of the workflow itself contained errors. 4
  • 38.
    38 Step Bundle Resources intoa Research Object 5 Script Abstract workflow Concrete workflow(s) Annotations Paper Provenance Data Attributions
  • 39.
    39 Contributions ● A methodologythat guides curators in a principled manner to transform scripts into reproducible and reusable WRO; ● This addresses an important issue in the area of script provenance;
  • 40.
    40 Conclusions ● We addressedissues wrt understanding, reuse and reproducibility of script-based experiments. ● The methodology created was: – elaborated based on requirements; – showcased via a real world use case from the field of Molecular Dynamics; ● We exploited tools and standards from the scientific community: – Scientific Workflows, YesWorkflow, Research Objects, the W3C PROV recommendations and the Web Annotation Data Model. ● The bundle is available at http://w3id.org/w2share/s2rwro/
  • 41.
    41 Next Steps ● Evaluationusing other case studies; ● Evaluation of the cost of the effectiveness of our methodology; ● Extension of YesWorkflow to support the semantic annotation of blocks; ● Implementation of tools.
  • 42.
    42 Acknowledgments ● FAPESP (grant# 2014/23861-4) ● CCES/CEPID (grant # 2013/08293-7) – Center for Computational Engineering & Sciences ● LIS (Laboratory of Information Systems) ● Prof. Munir Skaf and his group from Institute of Chemistry - Unicamp.
  • 43.
    Converting Scripts intoReproducible Workflow Research Objects Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros lucas.carvalho@ic.unicamp.br Baltimore, Maryland, USA October 23-26, 2016