Aussois bda-mdd-2018

Computational
Reproducibility: Workflows,
Provenance and Scripts
Khalid Belhajjame
PSL, LAMSADE, Université Paris-Dauphine
kbelhajj@gmail.com
https://www.slideshare.net/kbelhajj
BDA MDD 2018
1

Data-Oriented
Science
  Computing is transforming the practice of science. The so-called
“Fourth Paradigm of scientific research” [1] refers to the current
era, where scientists utilize computational tools and technologies
to manage, share, federate, analyze, visualize data to underpin
scientific findings.
  The objective of data-oriented science is is to create a richer
research ecosystem in which emphasis is given not only to the
build-up of scientific knowledge, but also to the build-up and
dissemination of other work-products of research such as data,
protocols, models and tools.
  Why is that?
[1] Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft
Research, 2009.BDA MDD 2018

Scholarly Articles Are Not Enough
  Scholarly articles remain the main trusted means for scientists to
communicate their findings
  However, they are noticeably insufficient to communicate all the
actual scientific knowledge behind the reported findings.
  There is a need for communicating and preserving other artifacts to
enable the understanding, verification, and reuse. In other words, ….
reproducibility
BDA MDD 2018

47 of 53
“landmark”
publications
could not be
replicated
Inadequate cell lines and
animal models
Nature, 483, 2012
basic studies on cancer are
unreliable, with grim
consequences for producing
new medicines in the future
BDA MDD 2018

The research result, obtained by Stapel and co-workers Roos Vonk (Radboud
University) and Marcel Zeelenberg (nl) (Tilburg University), showing that meat
eaters are more selfish than vegetarians, which was widely publicized in Dutch
media is suspected to be based on faked data.
BDA MDD 2018

Reproducibility is not just about finding
cheaters … it is above all a noble cause
BDA MDD 2018

Researchers in experimental biology use carefully lab
notebooks to document different aspects of their experiments.
  This is not the case for computational scientists who tend to
run their analysis with no clear record of the exact process they
followed or intermediary datasets (results) they used and
generated.
  It is therefore possible that numerous published results may be
unreliable or even completely invalid.
Culture of Reproducibility
BDA MDD 2018

Culture of Reproducibility
  Often, there is no record of the process (workflow) that
produced the published computational results in scholarly
communications.
  Even the code is missing, or underwent changes.
  It cannot be used to process the data referred to, (if we are
lucky).
BDA MDD 2018

Open and transparent Communication
“The reproducible research movement
recognizes that traditional scientific
research and publication practices now fall
short …, and encourages all those involved
in the production of computational
science ... to facilitate and practice really
reproducible research.”
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
We witnessed recently the emergence of a
number of methods and tools for enabling
reproducibility
BDA MDD 2018

Scope of this seminar
  We will focus on the reproducibility of scientific
workflows.
  These have been adopted in modern sciences, notably
life sciences and bio-diversity for encoding and enacting
scientific experiments
  We will look at what it means to reproduce a scientific
workflow, and draw a map of some solutions that have
been proposed in this direction
BDA MDD 2018

Agenda
  Scientific workflows
  Scientific Workflow Reproducibility
  Workflow Preservation Against Decay
  From Workflows to Scripts
BDA MDD 2018

Scientific workflow
• Workflow technology is increasingly
used for specifying and enacting scientific
experiments.
• A scientific workflow is a series of
analysis operations connected using data
links.
• Analysis operations can be supplied
locally or can be independently
developed web services.
BDA MDD 2018

Science with workflows
GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
in Thai Population Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
HelioPhysics
Library Doc
Preservation
Systems Biology
of Micro-
Organisms
Observing Systems
Simulation Experiments
JPL, NASA
BioDiversity
Invasive Species
Modelling
[Credit Carole A. Goble]BDA MDD 2018

Workflows for systematic
resource use
• Access heterogeneous resources.
• Explicit, runnable, repeatable analytical
process.
• Explore parameter spaces.
• Sweep an analysis over datasets.
• Transparent and efficient analyses with
provenance collected from workflow
executions
Workflow
Provenance
Data
BDA MDD 2018

Demo: Example
showing the use of
Taverna
BDA MDD 2018

Reproducibility Terminology
  Reproducibility has been studied in science in larger
contexts than computational reproducibility, in
particular where wet experiments are involved.
  A plethora of terms are used including repeat,
replicate, reproduce, redo, rerun, recompute, reuse and
repurpose etc. to name a few.
We will focus on 4 Rs: Repeat, Replicate, Reproduce
and Reuse.
  For each of them, we will give the definition in wet-lab
contexts and propose a definition in a computational
setting.
BDA MDD 2018

Repeat
  A wet experiment is said to be repeated when the
experiment is performed in the same lab as the original
experiment, that is, on the same scientific environment.
  By analogy, an in silico experiment is said to be repeated
when it is performed in the same computational setting as
the original experiment.
  The major goal of the repeat task is to check whether the
initial experiment was correct and can be performed again.
  The difficulty lies in recording as much information as
possible to repeat the experiment so that the same
conclusion can be drawn.
BDA MDD 2018

Replicate
  A wet experiment is said to be replicated when the
experiment is performed in a different (wet) ”lab” than
the original experiment.
  By analogy, a replicated in silico experiment is
performed in a new setting and computational
environment, although similar to the original ones).
  When replicated, a result has a high level of robustness:
the result remains valid in a similar (even though
different) setting has been considered.
  A continuum of situations can be considered between
a repeated and replicated experiments.
BDA MDD 2018

Reproduce
  Reproduce is defined in the broadest possible sense of the
term and denotes the situation where an experiment is
performed within a different set-up but with the aim to
validate the same scientific hypothesis.
  In other words, what matters is the conclusion obtained and
not the methodology considered to reach it.
  Completely different approaches can be designed,
completely different data sets can be used, as long as both
experiments converge to the same scientific conclusion.
  A reproducible result is thus a high- quality result,
confirmed while obtained in various ways.
BDA MDD 2018

Reuse
  A very important concept related to reproducibility is
Reuse which denotes the case where a different
experiment is performed, with similarities with an
original experiment.
  A specific kind of reuse occurs when a single
experiment is reused in a new context (and thus
adapted to new needs), the experiment is then said to
be repurposed.
BDA MDD 2018

Repeat, Replicate, Reproduce and Reuse
  Reproduce and reuse are the most important scientific targets.
  However, before investigating alternative ways of obtaining a result
(to reach reproducibility) or before reusing a given methodology in
a new context (to reach reuse), the original experiment has to be
carefully tested (possibly by reviewers and/or any peers),
demonstrating its ability to be at least repeated and hopefully
replicated
  The database community lags well behind other computer
science communities, e.g., the Semantic Web community
  ISWC and ESWC encourages the authors to submit with the
paper auxiliary resources about the experiment they used as
well as the software/prototype they built if any.
BDA MDD 2018

Reproducibility
and Scientific Workflows
We now introduce definitions of reproducibility concepts in the
particular context of use of scientific workflow systems.
In our definition, we distinguish six components of an analysis
designed using a scientific workflow.
1.  S, the workflow specification, providing the analysis steps
associated with tools, chained in a given order,
2.  I, the input of the workflow used for its execution, that is, the
concrete data sets and parameter settings specified for any tools,
3.  E, the workflow context and runtime environment, that is, the
computational context of the execution (OS, libs, etc.).
Additionally, we consider R and C, the result of the analysis (typically
the final data sets) and the high level conclusion that can be reached
from this analysis, respectively.
BDA MDD 2018

Repeatability of a Scientific Workflow
  Given two analyses A and A’ performed using scientific
workflows, we say that A’ repeats A if and only if A and
A’ are identical on all their components.
Replicability of a Scientific Workflow
workflows, we say that A’ replicates A if and only if they
reach the same conclusion while their specification and
input components are similar and other components may
differ (in particular no condition is set on the run-time
environment).
  Terms such as rerun, re-compute typically consider situations
where the workflow specification is unchanged.
BDA MDD 2018

Reproducibility of a Scientific Workflow
workflows, we say that A’ reproduces A if and only if
they reach the same conclusion. No condition is set on
any other components of the analysis.
Reuse of a Scientific Workflow
workflows, we say that A’ reuses A if and only if the
specification or input of A’ is part of the specification
or input of A’.
  No other condition is set, especially the conclusion to
reach may be different.
BDA MDD 2018

Reproducibility of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ reproduces A if and only if
they reach the same conclusion. No condition is set on
any other components of the analysis.
Reuse of a Scientific Workflow
workflows, we say that A’ reuses A if and only if the
specification or input of A’ is part of the specification
or input of A’.
  No other condition is set, especially the conclusion to
reach may be different.
• Paul writes workflows for identifying
biological pathways implicated in
resistance to Trypanosomiasis in cattle
• Paul meets Jo who is investigating
Whipworm in mouse.
• Jo reuses one of Paul’s workflow without
change.
• Jo identifies the biological pathways
involved in sex dependence in the mouse
model, believed to be involved in the
ability of mice to expel the parasite.
• Previously a manual two year study by Jo
had failed to do this.
Computational Workflows
Carole Goble
Reuse can be impressive when it works
…but is generally hard to achieve
Real-Life Example Of Reuse
BDA MDD 2018

Which level of reproducibility
are we at?
  Repeatability and Replicability L
  Even these two are hard to achieve most of the time.
  Needless to speak about reuse at this point. There are
few use cases that show the potential of workflow reuse,
but we are still at the stage of use cases.
  Solutions for enabling scientific workflow repeatability
and replication has mainly focused on their
preservation against decay
BDA MDD 2018

Workflow Preservation
  Public repositories such as myExperiment and
CrowdLabs have been used by scientists to publish
workflow specification and share them over the web.
  The availability of workflow specification is however
not sufficient for enabling their repeatability and
replicability.
  Indeed, an empirical study that we conducted showed
that the majority of workflow suffers from decay.
BDA MDD 2018

Understanding The Causes of
Workflow Decay
  We adopted an empirical approach
  To identify the causes of workflow decay
  To quantify their severity
  To do so, we analyzed a sample of real
workflows to determine if they suffer from
decay and the reasons that caused their decay
BDA MDD 2018

Experimental Setup
  Taverna workflows from
myExperiment.org
  Taverna 1
  Taverna 2
  Selection process
  By the creation year
  By the creator
  By the domain
  Software environment
  Taverna 2.3
  Experiment metadata
  4 researchers
BDA MDD 2018

Analyzed Workflows
Number of Taverna 1 workflows from 2007 to 2011
2007 2008 2009 2010 2011
Tested 11 10 10 10 4*
Total 74 341 101 26 13
Number of Taverna 2 workflows from 2009 to 2012
2009 2010 2011 2012
Tested 12 10 15 9
Total 97 308 289 184
BDA MDD 2018

Profile of Analyzed Workflows
BDA MDD 2018

The Proportion of Decay
  75% of the 92 tested
workflows failed to be
either executed or
produce the same result (if
testable)
  Those from earlier years
(2007-2009) had 91%
failure rate
Taverna 1
Taverna 2
BDA MDD 2018

The Cause of Decay
  Manual analysis
  By the validation report from Taverna workbench
  By interpreting experiment results reported by Taverna
  Identified 4 categories of causes
  Missing example data
  Missing execution environment
  Insufficient descriptions about workflows
  Volatile third-party Resources
BDA MDD 2018

Decay Caused by Third-Party Resources
Causes Refined Causes Examples
Third party resources
are not available
Underlying dataset, particularly those
locally hosted in-house dataset, is no
longer available
Researcher hosting the data changed
institution, server is no longer available
Services are deprecated DDBJ web services are not longer
provided despite the fact that they are
used in many myExperiment
workflows
are available but not
accessible
Data is available but identified using
different IDs that the one known to the
user
Due to scalability reasons the input
data is superseded by new one making
the workflow not executable or
providing wrong results
Data is available but permission,
certificate, or network to access it is
needed
Cannot get the input, which is a
security token that can only be
obtained by a registered user of
ChemiSpider
Services are available but need
permission, certificate, or network to
access and invoke them
The security policies of the execution
framework are updated due to new
hosting institution rules
have changed
Services are still available by using the
same identifiers but their functionality
have changed
The web services are updated
BDA MDD 2018

Summary of Decay Causes
  50% of the decay was caused by
volatility of 3rd-party resource
  Unavailable
  Inaccessible
  Updated
  Missing example data
  Unable to re-run
  Missing execution environment
  Such as local plugins
  Insufficient metadata
  Such as any required
dependency libraries or
permission information
BDA MDD 2018

Combating Workflow Decay
BDA MDD 2018

Combating Workflow Decay
  Objective: Provide enough information to
  Prevent decay
  Detect decay
  Repair decay
  Approach: Research Objects + Checklists
  Research Object: Aggregate workflow specifications together
with auxiliary elements, such as example data inputs,
annotations, provenance traces that can ne used to prevent
decay and/or repair the workflow in case of decay.
  Checklists: to check that sufficient information is preserved
along with workflows
BDA MDD 2018

!"##$%&'()*+,!*)
!"#$%&'()*#)+',()-#)&*.,#/#)&
0(,1!(2*3#/$,'.&'()*.,#/#)&
').%&*"4#/*.,#/#)&
.5,5-#&#,*+54%#/*3#")#3
,#6%',#3*/#,+'$#/*5$$#//'74#
Checklists
• Checklists are a well- established tool
for guiding practices to ensure safety,
quality and consistency in the conduct
of complex operations.
• They have been adopted by the
biological research community to
promote consistency across research
datasets
• In our case, we use checklists to
assess if a research object contains
sufficient information for running the
workflow and checking that its results
are replicable.
BDA MDD 2018

Cheklisting the Reproducibility
of a Workflow
!"#"$%&'()&
"*'#+'%"&
$,"$-#./%
0.(.1&
)"/$2.3%.4(
5"/"'2$,&
678"$%
9*'#+'%.4(&
2"342%
:"7
;+234/"
The Minim model used in our approach is an adaptation of the MiM model [1].
[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework
for Scientific Linked Data. eScience 2012
BDA MDD 2018

Use Case
  4 myExperiment packs
  2 from genomics, 1 from geography, and 1 domain-neutral
  Experiment process:
  Transform them into RO
  Create checklist descriptions
  Observations
  2 research objects did not contains example inputs, the other
2 failed because of update to third party resources and
environment of execution.
BDA MDD 2018

Lessons Learnt
1.  Dependency is the root enemy of reproducible
workflows
2.  Documentation, i.e., annotation, is vital
3.  Documentation should be easy to create
BDA MDD 2018

Benefits Of Research Objects
  A research object aggregates all elements that are
necessary to understand research investigations.
  Methods (experiments) are viewed as first class citizens
  Promote reuse
  Enable the verification of reproducibility of the results
BDA MDD 2018

Research Obejects Specifications and Tooling can be found at
http://www.researchobject.org/specifications/
BDA MDD 2018

Research Object Model: Overview
The model specification can be found at http://wf4ever.github.com/ro/
And the primer at http://wf4ever.github.com/ro-primer/
BDA MDD 2018

Workflow Template and Workflow Run
BDA MDD 2018

Grounding Workflow-centric Research
Objects Using Semantic Technologies
  Workflow-centric research objects are encoded using RDF, according to a set of
ontologies that are publicly available
  Research objects use the Object Exchange and Reuse (ORE) model, to represent
aggregation.
ORE
BDA MDD 2018

We use the Annotation Ontology (AO), to annotate research object
resources and their relationships.
Grounding Workflow-centric Research
Objects Using Semantic Technologies
BDA MDD 2018

57
Scientist
Live RO Live RO
RO snapshot
<<copy>>
Identified by a URI
Some metadata
Some curation
Mostly private (for my group)
RO snapshot
<<copy>>
Identified by a URI
Some metadata
Some curation
Mostly private (for my group
and for paper reviewers)
Librarian/Curator
Scientist
My supervisor calls
me to report my work
My supervisor calls
me again and we
decide to publish our
RO+paper
<<versionOf>>
Archived RO
<<copy, filter
and curate>>
Identified by a URI
Good metadata
and curation
Mostly public
Reviews
received and
final version
published
<<versionOf>>
A new PhD
student
continues my
work
<<copy>>

Using Research Objects for the Preservation
of Workflows/Experiments
Case study: investigating the epigenetic mechanisms
involved in Huntington’s disease (HD). It is the most
commonly inherited neurodegenerative disorder in
Europe, that affects 1 out of 10 000 people.
The scientist in this use case were convinced to use
Research Object as a model for packaging their
investigation
BDA MDD 2018

Preserving Scientific Wokflows
when they have not been
packaged into research objects
  … Which is the case of most of workflows.
  And even if they are packaged into research objects,
scientific workflows can still suffer from decay.
BDA MDD 2018

Scientific Workflow
Preservation
  Issue: As we have seen from the results of the empirical study
we presented earlier, workflow preservation is frequently
hampered by the volatility of the web services implementing
the analysis operations that constitute workflows.
  Objective: to provide a means for scientists to repair
workflows by identifying service operations that can play the
same role as the unavailable ones.
BDA MDD 2018

Outline
✔  Context: Preservation of Scientific Workflows
■  Discovering Substitute Services Using Semantic
Annotation of Web Services
■  Discovering Substitute Services Using Existing
Workflow Specifications and Provenance traces
■  Conclusions
BDA MDD 2018

Ontologies Used For Annotating
Web Services
 Task ontology: captures information about the action carried
out by service operations within a domain of interest, e.g.,
Sequence_alignment and Protein_identification
 Domain ontology: captures information about the application
domains covered by operation parameters, e.g., Protein_record and
DNA_sequence
BDA MDD 2018

Task Replaceability
Task replaceability: For an operation op2 to be able to substitute
an operation op1, op2 must fulfil a task that is equivalent to or
subsumes the task op1 performs:
BDA MDD 2018

Parameter compatibility
Parameter replaceability: To be compatible the domain of the
output must be the same as or subconcept of the domain of the
subsequent input.
BDA MDD 2018

Limitations
While the method just presented is sound, its practical applicability is
hindered by the following facts
§  Semantic annotations of web services are scarce.
§  Our experience suggests that a large proportion of existing semantic
annotations suffer from inaccuracies
§  As a result, a substitute that is discovered for replacing an unavailable
operation using such annotations may turn out to be unsuitable, and,
inversely, a suitable substitute may be discarded.
BDA MDD 2018

Discovering Substitute Services Using Existing Workflow
Specifications and Provenance traces
Existing Workflow
Specifications
Provenance traces of missing
operations
BDA MDD 2018

Parameter Compatibility
Formally, let wf1 be a workflow in which the operation op1 is unavailable.
The operation op2 can replace the operation op1 in terms of its inputs and
outputs if:
BDA MDD 2018

Task Compatibility
  In addition to the compatibility in terms of inputs and outputs, we have to
check that the candidate substitute performs a task compatible with that of the
unavailable operation.
  To perform this test, we exploit the following observation. An operation op2 is
able to replace the operation op1 in terms of task, if for every possible input
instances that op1 is able to consume, op2 delivers the same output as that
obtained by invoking op1.
  To perform the above test, however, we will have to call the missing operation
op1!
  A solution that we adopt for overcoming the above problem makes use of
workflow provenance logs. These are traces that contain intermediate data that
were used as input and delivered as output by the constituent operations of a
workflow when enacted.
BDA MDD 2018

Task Compatibility (cont.)
§  An operation op2 may be compatible in terms of task with op1
if:
op2 delivers the same results that op1 delivered in past
executions, that are logged within provenance logs, when fed
using the same input values.
§  Notice that we say may be compatible. This is because we may
not be able to compare the outputs obtained for every possible
input value of the operation op1.
BDA MDD 2018

Relaxing Substitutability
Conditions
  The condition that we have described for checking the
suitability of an operation as a substitute for another one may
be stronger than is required in practice.
  There are various parameter representations that are adopted
in bioinformatics.
  Because of representation mismatch, a service operation that
performs a task similar to the missing operation may be found
to be unsuitable.
BDA MDD 2018

Example of values delivered by two operations using the same
input value
Value1
Value2
CosSym(value1,value2) = 0.007
BDA MDD 2018

Relaxing Substitutability
Conditions
To overcome this problem, we use a two step process when
comparing the values of parameters:
1.  Given a parameter value, we derive its representation.
2.  If the representation is associated with a key attribute
(identifier), extract the value of such an attribute
If two parameter values are associated with identifiers, then they
are compared by comparing their identifiers.
BDA MDD 2018

Example of values delivered by two operations using the same
input value
Value1
Value2
Fasta Format
Uniprot Format
BDA MDD 2018

Data Examples for Characterizing
Scientific Operations
  We have conducted an empirical evaluation to assess
the effectiveness of the method described.
  The issue that we faced is the ability to have examples
that characterize the missing operation, and that can be
used for comparison with available modules.
  This motivated a proposal that we have worked on for
characterizing analysis operations using data examples.
BDA MDD 2018

Data Example
Describes >
BDA MDD 2018

Generating Data Examples
  Data examples can be used as a means to
describe the behavior of analysis operations.
  Enumerating all possible data examples that
can be used to describe a given operation may
be expensive, and may contain redundant data
examples that describe the same behavior.
  Issue: which data examples should be used to characterize the
functionality of a given operation?
  Solution: We have showed how software testing techniques can
be adapted to the problem of generating data examples without
relying on the availability of the operation specification, which
often is not accessible.
Trick: Use domain ontologies for
partitioning the space of possible values
BDA MDD 2018

From Workflow to Scripts…
and then Back
  Scientific Workflows have proved their utility, and they
are used in practice by scientist
  However, the majority of scientists utilize scripting
languages to specify and enact their data analysis.
  In order to promote their reproducibility, we have seen
a number of proposals in recent years that seek to bring
some advantages that characterize workflows to scripts.
  We will see some of them in what follows.
BDA MDD 2018

Meanwhile, on a nearby planet …
Interactive Visualization
R and Python and the Winners
BDA MDD 2018

Why Bother?
  Workflow provides key features to enable reproducibility
that scripts lack
  Modularity
  This features lack in scripts in general
  Workflow can be repurposed in a straightforward manner
by customizing the resources and the dependencies
  Scalability: some workflow systems can handle large
amounts of data
  Provenance: Most workflow systems are instrumented to
capture provenance information about workflow execution
YesWorkflow to the rescue
BDA MDD 2018

Science Example: Paleoclimate ReconstrucRon
BDA MDD 2018

YesWorkflow = Script + Comments
  Scripts can be hard to digest,
communicate
  Idea:
  Add structured comments (cf. JavaDoc) =>
reveal workflow structure and dataflow
  => obtain some scientific workflow benefits
BDA MDD 2018

YesWorkflow Generates Three
Views from the Script
BDA MDD 2018

User Comments: YesWorkflow Annotations
BDA MDD 2018

Paleoclimate+ReconstrucRon+…+++
B."Ludäscher"""""""""""""""""""""""""""""""""""""""""""""""""YesWorkflow:"Workflow"Views"from"Scripts."IDCC'15,"London"" 9"
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
•  …"explained"using"YesWorkflow+
Kyle"B.,"(computa9onal)"archeologist:""
"It!took!me!about!20!minutes!to!comment.!Less!
than!an!hour!to!learn!and!YWAannotate,!allAtold."!
BDA MDD 2018

YesWorkflow Architecture
  • YW-Extract
  – ... structured comments
  YW-Model
  Program Block, Workflow
  Port (data, parameters) – Channels
(dataflow)
  YW-Graph
  using GraphViz/DOT files
BDA MDD 2018

What About Provenance?
  There are some solutions that allow capturing the
provenance of a script.
  Use (R, Python, ..) libraries and/or code
instrumentation to capture runtime observables
  file read/write, function calls, program variables & state,
…
  noWorkflow system
  [Murta-Braganholo-Chiriga=-Koop-Freire-IPAW14]
  exploit Python profiling library to capture run=me
provenance
Can be messy as they capture every operating system event/call!
BDA MDD 2018

Actually, We Can Construct the Provenance Without
Recording it in the First Place!
YW+annota)ons:(Model(your(Workﬂow!(
YesWorkﬂow(Provenance(@(TaPP'15( 17(
BDA MDD 2018

and You have the Provenance for Freerun/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt

YWDRECON:+Prospec=ve(&(Retrospec)ve(
Provenance(…((almost)(for(free!((
YesWorkflow(Provenance(@(TaPP'15( 23(
cassette_id
sample_score_cutoff
sample_spreadsheet
file:cassette_{cassette_id}_spreadsheet.csv
calibration_image
file:calibration.img
initialize_run
run_log
file:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_log
file:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_number
raw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_image
file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_log
file:run/collected_images.csv
•  URIDtemplates+link(conceptual(en==es(
to(run)me+provenance+“le|(behind”(by(
the(script(author(…((
•  …(facilita=ng(provenance(reconstruc=on(BDA MDD 2018

Back to Workflow Land
Converting Scripts into Reproducible Workflow
Research Objects
95
BDA MDD 2018

38
Step
Bundle Resources into a Research Object
5
Script Abstract
workow
Concrete
workow(s)
Annotations
Paper
Provenance
Data
Attributions
BDA MDD 2018

Conclusions
  Research in enabling reproducibility has seen a real push in recent
year, with some great initiatives, software products and data
repositories
Figshare, Dataverse, OpenAir, DataONE, RDA
  Workflows and Scripts are no exception, and there have been
some good proposals from a handful of researchers as well as
practitioners.
  MADICS Workfing Group on Reproducibility.
  We are just scratching the surface and there are numerous issues
that still need to be addressed.
  workflow/scripts similarities, comparison of scientific results,
incremental re-computation, to cite a few are still open topics.
BDA MDD 2018

Acknowledgement
  Pinar Alper,
  Lucas Augusto
Carvalho
  Shawn Bowers
  Sarah Cohen Boulakia
  Alban Gaignard
  Daniel Garijo
  Carole Goble,
  Bertram Ludascher
  Timothy McPhilips
  Claudia Medeiros
  Paolo Missier
Stian Soiland-Reyes
BDA MDD 2018

References
  Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance patterns.
Future Generation Comp. Syst. 75: 310-329 (2017)
  Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina M. Hettne, Raúl Palma, Eleni Mina, Óscar
Corcho, José Manuél Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole A. Goble: Using a suite of ontologies for
preserving workflow-centric research objects. J. Web Sem. 32: 16-42 (2015)
  Khalid Belhajjame, Carole A. Goble, Stian Soiland-Reyes, David De Roure: Fostering Scientific Workflow Preservation
through Discovery of Substitute Services. eScience 2011: 97-104
  Sarah Cohen Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard,
Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal,
Christophe Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and
opportunities. Future Generation Comp. Syst. 75: 284-298 (2017)
  Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Cross-workflow Provenance.
SeWeBMeDA@ESWC 2017: 50-64
  Lucas Augusto Montalvão Costa Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros: Converting scripts into
reproducible workflow research objects. eScience 2016: 71-80
  Timothy M. McPhillips, Shawn Bowers, Khalid Belhajjame, Bertram Ludäscher: Retrospective Provenance Without a
Runtime Provenance Recorder. TaPP 2015
  Timothy M. McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao,
Fernando Chirigati, Saumen C. Dey, Juliana Freire, Deborah N. Huntzinger, Christopher Jones, David Koop, Paolo
Missier, Mark Schildhauer, Christopher R. Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludäscher:
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR
abs/1502.02403 (2015)
BDA MDD 2018

Computational
Reproducibility: Workflows,
Provenance and Scripts
Khalid Belhajjame
PSL, LAMSADE, Université Paris-Dauphine
kbelhajj@gmail.com
https://www.slideshare.net/kbelhajj
BDA MDD 2018
101

Aussois bda-mdd-2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Aussois bda-mdd-2018

Similar to Aussois bda-mdd-2018 (20)

More from Khalid Belhajjame

More from Khalid Belhajjame (20)

Recently uploaded

Recently uploaded (20)

Aussois bda-mdd-2018