Converting Scripts into Reproducible Workflow Research Objects

Converting Scripts into Reproducible
Workflow Research Objects
Lucas A. M. C. Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros
lucas.carvalho@ic.unicamp.br
Baltimore, Maryland, USA
October 23-26, 2016

2
Background and Motivation
● Data-Intensive Experiments
– Collection of scripts, programs and (big) data
Papers

3
Papers
How to understand,
reproduce or reuse
data and models of
experiments?

4
Manual collection and
organization of data provenance
Papers
How to understand,
reproduce or reuse
data and models of
experiments?

5
● Script-based experiments
What are the inputs
and outputs?
How to change this
local program for a
similar web service?
Example of script code.
Difficult to
understand, to reuse,
and to reproduce.

6
● Scientific Workflows
Example of Scientific Workflow Management System.

7
Create
Understand
Reuse
Reproduce
Overview

8
Create
Understand
Reuse
Reproduce
Overview
+

9
Create
Understand
Reuse
Reproduce
Overview
+
Step 2
Step 1
Step 3
Step 4
Step 5
Methodology

10
Related Work
● Script-language specific.
● Workflow-engine specific.
● A new language is needed.
● Outcome is not an executable workflow.
● Do not collect provenance data of the
conversion process.

11
Two Kind of Experts
● Scientists
– Domain experts who understand the experiment, and
the script (sometimes called user);
● Curators:
– Scientists who are also familiar with workflow and
script programming or;
– Computer scientists who are familiar enough with the
domain to be able to implement our methodology;
– Responsible for authoring, documenting and
publishing workflows and associated resources.

12
Requirements
● Produce workflow-like view of the script.
● Create an executable workflow and compare
execution of workflow and script.
● Modify the workflow resources.
● Record provenance data.
● Aggregate all resources to support
Reproducibility and Reuse.
1
2
3
4
5

13
Requirements
● Produce workflow-like view of the script.1
Activity 1
Port 1 Port 2 Port 3
Port 1 Port 2
Activity 2
Port 3
Port 3
Activity n
Port n
Script-based experiment.
Abstract workflow.

14
Requirements
● Create executable workflow and compare
execution of workflow and script.
2
Executable workflow. Script-based experiment.

15
Requirements
● Modify the workflow resources.3
Local
(a)
(b)
Algorithm A Algorithm B

16
Requirements
● Record provenance data4
Activity 1
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
Activity 2
used
LucasWorkflow
Run
wasAssociatedWith
used

17
Requirements
● Aggregate all resources to support
Reproducibility and Reuse.
5
Abstract
workflows
Concrete
workflows
Annotations
Papers and
Reports
Provenance
Authors
Scripts
Data

18
Script
Generate Abstract
Workflow
Generate Abstract
Workflow
Create an
executable workflow
Create an
executable workflow
Refine workflowRefine workflow
Bundle Resources into
a Research Object
Bundle Resources into
a Research Object
Annotate and
check quality
Annotate and
check quality
Abstract
workflow
Concrete
workflow
2
1
3
4
5
Methodology

19
Workflow Research Object (WRO)
● Research Objects are
semantically rich
aggregations of resources
that bring together data,
methods and people in
scientific investigations.
● WROs encapsulate scientific
workflows and additional
information regarding their
context and resources.
Research Object Model

20
Running Example
● Molecular Dynamics Simulations
– Many branches of material sciences, computational
engineering, physics and chemistry.
– Scripts (shell script), programs (NAMD, VMD, Fortran)
– Phases: set up, simulation and analysis of trajectories.
– Inputs: protein structure, simulation parameters and
force field files.
– Output: trajectories and analysis results.

21
Step
Generate Abstract Workflow
1
Script code.

22
Step
1
Manually
annotate
Script code.
Annotated script code.

23
Step
1
Manually
annotate
Create
workflow-like
view
Script code.
Abstract workflow.

24
Step
1
code blocks
Input/ouput
YesWorkflow
McPhillips et. al, 2015
- Code comments
- Tags:
● @begin
● @end
● @desc
● @in
● @out
● ...
T. McPhillips et al. (2015), “Yesworkflow: A user-oriented, language-
independent tool for recovering workflow information from scripts,”
International Journal of Digital Curation, vol. 10, no. 1, pp. 298–313, 2015.
Create
Workflow-like
view
Abstract workflow.

25
Step
1
Create
Workflow-like
view
Abstract workflow.

26
Step
Create an executable workflow
2
Abstract workflow.

27
Step
2
Create implementation
of activities
Copy code blocks from
the script.
Abstract workflow.
Executable workflow.

28
Step
2
of activities
the script.
Abstract workflow.

29
Step
2
of activities
the script.
Abstract workflow.
Script code.

30
Step
Refine executable workflow
3
Modify resources:
● Algorithms
● Data Sets
● Parallelization
● Web Services
● ...
New workflow version.

31
Step
Refine executable workflow
3
Create new
version
Modify resources:
● Algorithms
● Data Sets
● Parallelization
● Web Services
● ...

32
Steps
Record provenance data: execution traces.
2 3
wasEnactedBy
split
Output 1 Output 2
wasGeneratedBy wasGeneratedBy
Sample
used
“2012-06-01”
wasStartedAt
psgen
used
LucasWorkflow
Run
wasAssociatedWith
used
hasSpecification
W3C PROV

33
Steps
Record provenance data: conversion process.
2 3
wasDerivedFrom
wasDerivedFrom
wasDerivedFrom
wasAssociatedWith
CuratorCurator
W3C PROV
Script code.

34
Step
Annotate and check quality
● Annotations describing the workflow.
● Use provenance data
– To check the quality of the conversion process.
● Run checks to verify the soundness of the
workflow.
4

35
Step
4
Script code.

36
Step
4
Workflow version.
Initial Executable workflow.

37
Step
● Common mistakes during the conversion:
– not clearly identified the main logical processing
units in the script;
– a mistake when migrating script code into the
corresponding activity;
– not provided the correct input files and parameters;
– the coding of the workflow itself contained errors.
4

38
Step
Bundle Resources into a Research Object
5
Script Abstract
workflow
Concrete
workflow(s)
Annotations
Paper
Provenance
Data
Attributions

39
Contributions
● A methodology that guides curators in a
principled manner to transform scripts into
reproducible and reusable WRO;
● This addresses an important issue in the area
of script provenance;

40
Conclusions
● We addressed issues wrt understanding, reuse and
reproducibility of script-based experiments.
● The methodology created was:
– elaborated based on requirements;
– showcased via a real world use case from the field of Molecular
Dynamics;
● We exploited tools and standards from the scientific
community:
– Scientific Workflows, YesWorkflow, Research Objects, the W3C
PROV recommendations and the Web Annotation Data Model.
● The bundle is available at http://w3id.org/w2share/s2rwro/

41
Next Steps
● Evaluation using other case studies;
● Evaluation of the cost of the effectiveness of
our methodology;
● Extension of YesWorkflow to support the
semantic annotation of blocks;
● Implementation of tools.

42
Acknowledgments
● FAPESP (grant # 2014/23861-4)
● CCES/CEPID (grant # 2013/08293-7)
– Center for Computational Engineering & Sciences
● LIS (Laboratory of Information Systems)
● Prof. Munir Skaf and his group from Institute of
Chemistry - Unicamp.

Converting Scripts into Reproducible Workflow Research Objects

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Converting Scripts into Reproducible Workflow Research Objects

Similar to Converting Scripts into Reproducible Workflow Research Objects (20)

More from Lucas Augusto Carvalho

More from Lucas Augusto Carvalho (14)

Recently uploaded

Recently uploaded (20)

Converting Scripts into Reproducible Workflow Research Objects