SlideShare a Scribd company logo
Reproducibility: Workflows,
Provenance and Scripts
Khalid Belhajjame
PSL, LAMSADE, Université Paris-Dauphine
BDA MDD 2018
  Computing is transforming the practice of science. The so-called
“Fourth Paradigm of scientific research” [1] refers to the current
era, where scientists utilize computational tools and technologies
to manage, share, federate, analyze, visualize data to underpin
scientific findings.
  The objective of data-oriented science is is to create a richer
research ecosystem in which emphasis is given not only to the
build-up of scientific knowledge, but also to the build-up and
dissemination of other work-products of research such as data,
protocols, models and tools.
  Why is that?
[1] Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft
Research, 2009.BDA MDD 2018
Scholarly Articles Are Not Enough
  Scholarly articles remain the main trusted means for scientists to
communicate their findings
  However, they are noticeably insufficient to communicate all the
actual scientific knowledge behind the reported findings.
  There is a need for communicating and preserving other artifacts to
enable the understanding, verification, and reuse. In other words, ….
BDA MDD 2018
47 of 53
could not be
Inadequate cell lines and
animal models
Nature, 483, 2012
basic	studies	on	cancer	are	
unreliable,	with	grim	
consequences	for	producing	
new	medicines	in	the	future
BDA MDD 2018
The research result, obtained by Stapel and co-workers Roos Vonk (Radboud
University) and Marcel Zeelenberg (nl) (Tilburg University), showing that meat
eaters are more selfish than vegetarians, which was widely publicized in Dutch
media is suspected to be based on faked data.
BDA MDD 2018
Reproducibility is not just about finding
cheaters … it is above all a noble cause
BDA MDD 2018
  Researchers in experimental biology use carefully lab
notebooks to document different aspects of their experiments.
  This is not the case for computational scientists who tend to
run their analysis with no clear record of the exact process they
followed or intermediary datasets (results) they used and
  It is therefore possible that numerous published results may be
unreliable or even completely invalid.
Culture of Reproducibility
BDA MDD 2018
Culture of Reproducibility
  Often, there is no record of the process (workflow) that
produced the published computational results in scholarly
  Even the code is missing, or underwent changes.
  It cannot be used to process the data referred to, (if we are
BDA MDD 2018
Open and transparent Communication
“The reproducible research movement
recognizes that traditional scientific
research and publication practices now fall
short …, and encourages all those involved
in the production of computational
science ... to facilitate and practice really
reproducible research.”
V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to
reproducible: Reproducibility in computational and experimental mathematics.
We witnessed recently the emergence of a
number of methods and tools for enabling
BDA MDD 2018
Scope of this seminar
  We will focus on the reproducibility of scientific
  These have been adopted in modern sciences, notably
life sciences and bio-diversity for encoding and enacting
scientific experiments
  We will look at what it means to reproduce a scientific
workflow, and draw a map of some solutions that have
been proposed in this direction
BDA MDD 2018
  Scientific workflows
  Scientific Workflow Reproducibility
  Workflow Preservation Against Decay
  From Workflows to Scripts
BDA MDD 2018
Scientific workflow
• Workflow technology is increasingly
used for specifying and enacting scientific
• A scientific workflow is a series of
analysis operations connected using data
• Analysis operations can be supplied
locally or can be independently
developed web services.
BDA MDD 2018
Science with workflows
GWAS, Pharmacogenomics
Association study of
Nevirapine-induced skin rash
in Thai Population Trypanosomiasis (sleeping
sickness parasite) in
African Cattle
Astronomy &
Library Doc
Systems Biology
of Micro-
Observing Systems
Simulation Experiments
Invasive Species
[Credit Carole A. Goble]BDA MDD 2018
Workflows for systematic
resource use
• Access heterogeneous resources.
• Explicit, runnable, repeatable analytical
• Explore parameter spaces.
• Sweep an analysis over datasets.
• Transparent and efficient analyses with
provenance collected from workflow
BDA MDD 2018
Workflow Systems
BDA MDD 2018
Demo: Example
showing the use of
BDA MDD 2018
  Scientific workflows
  Scientific Workflow Reproducibility
  Workflow Preservation Against Decay
  From Workflows to Scripts
BDA MDD 2018
Reproducibility Terminology
  Reproducibility has been studied in science in larger
contexts than computational reproducibility, in
particular where wet experiments are involved.
  A plethora of terms are used including repeat,
replicate, reproduce, redo, rerun, recompute, reuse and
repurpose etc. to name a few.
We will focus on 4 Rs: Repeat, Replicate, Reproduce
and Reuse.
  For each of them, we will give the definition in wet-lab
contexts and propose a definition in a computational
BDA MDD 2018
BDA MDD 2018
  A wet experiment is said to be repeated when the
experiment is performed in the same lab as the original
experiment, that is, on the same scientific environment.
  By analogy, an in silico experiment is said to be repeated
when it is performed in the same computational setting as
the original experiment.
  The major goal of the repeat task is to check whether the
initial experiment was correct and can be performed again.
  The difficulty lies in recording as much information as
possible to repeat the experiment so that the same
conclusion can be drawn.
BDA MDD 2018
  A wet experiment is said to be replicated when the
experiment is performed in a different (wet) ”lab” than
the original experiment.
  By analogy, a replicated in silico experiment is
performed in a new setting and computational
environment, although similar to the original ones).
  When replicated, a result has a high level of robustness:
the result remains valid in a similar (even though
different) setting has been considered.
  A continuum of situations can be considered between
a repeated and replicated experiments.
BDA MDD 2018
  Reproduce is defined in the broadest possible sense of the
term and denotes the situation where an experiment is
performed within a different set-up but with the aim to
validate the same scientific hypothesis.
  In other words, what matters is the conclusion obtained and
not the methodology considered to reach it.
  Completely different approaches can be designed,
completely different data sets can be used, as long as both
experiments converge to the same scientific conclusion.
  A reproducible result is thus a high- quality result,
confirmed while obtained in various ways.
BDA MDD 2018
  A very important concept related to reproducibility is
Reuse which denotes the case where a different
experiment is performed, with similarities with an
original experiment.
  A specific kind of reuse occurs when a single
experiment is reused in a new context (and thus
adapted to new needs), the experiment is then said to
be repurposed.
BDA MDD 2018
Repeat, Replicate, Reproduce and Reuse
  Reproduce and reuse are the most important scientific targets.
  However, before investigating alternative ways of obtaining a result
(to reach reproducibility) or before reusing a given methodology in
a new context (to reach reuse), the original experiment has to be
carefully tested (possibly by reviewers and/or any peers),
demonstrating its ability to be at least repeated and hopefully
  The database community lags well behind other computer
science communities, e.g., the Semantic Web community
  ISWC and ESWC encourages the authors to submit with the
paper auxiliary resources about the experiment they used as
well as the software/prototype they built if any.
BDA MDD 2018
and Scientific Workflows
We now introduce definitions of reproducibility concepts in the
particular context of use of scientific workflow systems.
In our definition, we distinguish six components of an analysis
designed using a scientific workflow.
1.  S, the workflow specification, providing the analysis steps
associated with tools, chained in a given order,
2.  I, the input of the workflow used for its execution, that is, the
concrete data sets and parameter settings specified for any tools,
3.  E, the workflow context and runtime environment, that is, the
computational context of the execution (OS, libs, etc.).
Additionally, we consider R and C, the result of the analysis (typically
the final data sets) and the high level conclusion that can be reached
from this analysis, respectively.
BDA MDD 2018
Repeatability of a Scientific Workflow
  Given two analyses A and A’ performed using scientific
workflows, we say that A’ repeats A if and only if A and
A’ are identical on all their components.
Replicability of a Scientific Workflow
  Given two analyses A and A’ performed using scientific
workflows, we say that A’ replicates A if and only if they
reach the same conclusion while their specification and
input components are similar and other components may
differ (in particular no condition is set on the run-time
  Terms such as rerun, re-compute typically consider situations
where the workflow specification is unchanged.
BDA MDD 2018
Reproducibility of a Scientific Workflow
  Given two analyses A and A’ performed using scientific
workflows, we say that A’ reproduces A if and only if
they reach the same conclusion. No condition is set on
any other components of the analysis.
Reuse of a Scientific Workflow
  Given two analyses A and A’ performed using scientific
workflows, we say that A’ reuses A if and only if the
specification or input of A’ is part of the specification
or input of A’.
  No other condition is set, especially the conclusion to
reach may be different.
BDA MDD 2018
Reproducibility of a Scientific Workflow
Given two analyses A and A’ performed using scientific
workflows, we say that A’ reproduces A if and only if
they reach the same conclusion. No condition is set on
any other components of the analysis.
Reuse of a Scientific Workflow
  Given two analyses A and A’ performed using scientific
workflows, we say that A’ reuses A if and only if the
specification or input of A’ is part of the specification
or input of A’.
  No other condition is set, especially the conclusion to
reach may be different.
• Paul writes workflows for identifying
biological pathways implicated in
resistance to Trypanosomiasis in cattle
• Paul meets Jo who is investigating
Whipworm in mouse.
• Jo reuses one of Paul’s workflow without
• Jo identifies the biological pathways
involved in sex dependence in the mouse
model, believed to be involved in the
ability of mice to expel the parasite.
• Previously a manual two year study by Jo
had failed to do this.
Computational Workflows
Carole Goble
Reuse can be impressive when it works
…but is generally hard to achieve
Real-Life Example Of Reuse
BDA MDD 2018
Which level of reproducibility
are we at?
  Repeatability and Replicability L
  Even these two are hard to achieve most of the time.
  Needless to speak about reuse at this point. There are
few use cases that show the potential of workflow reuse,
but we are still at the stage of use cases.
  Solutions for enabling scientific workflow repeatability
and replication has mainly focused on their
preservation against decay
BDA MDD 2018
  Scientific workflows
  Scientific Workflow Reproducibility
  Workflow Preservation Against Decay
  From Workflows to Scripts
BDA MDD 2018
Workflow Preservation
  Public repositories such as myExperiment and
CrowdLabs have been used by scientists to publish
workflow specification and share them over the web.
  The availability of workflow specification is however
not sufficient for enabling their repeatability and
  Indeed, an empirical study that we conducted showed
that the majority of workflow suffers from decay.
BDA MDD 2018
Understanding The Causes of
Workflow Decay
  We adopted an empirical approach
  To identify the causes of workflow decay
  To quantify their severity
  To do so, we analyzed a sample of real
workflows to determine if they suffer from
decay and the reasons that caused their decay
BDA MDD 2018
Experimental Setup
  Taverna workflows from
  Taverna 1
  Taverna 2
  Selection process
  By the creation year
  By the creator
  By the domain
  Software environment
  Taverna 2.3
  Experiment metadata
  4 researchers
BDA MDD 2018
Analyzed Workflows
Number of Taverna 1 workflows from 2007 to 2011
2007 2008 2009 2010 2011
Tested 11 10 10 10 4*
Total 74 341 101 26 13
Number of Taverna 2 workflows from 2009 to 2012
2009 2010 2011 2012
Tested 12 10 15 9
Total 97 308 289 184
BDA MDD 2018
Profile of Analyzed Workflows
BDA MDD 2018
The Proportion of Decay
  75% of the 92 tested
workflows failed to be
either executed or
produce the same result (if
  Those from earlier years
(2007-2009) had 91%
failure rate
Taverna 1
Taverna 2
BDA MDD 2018
The Cause of Decay
  Manual analysis
  By the validation report from Taverna workbench
  By interpreting experiment results reported by Taverna
  Identified 4 categories of causes
  Missing example data
  Missing execution environment
  Insufficient descriptions about workflows
  Volatile third-party Resources
BDA MDD 2018
Decay Caused by Third-Party Resources
Causes Refined Causes Examples
Third party resources
are not available
Underlying dataset, particularly those
locally hosted in-house dataset, is no
longer available
Researcher hosting the data changed
institution, server is no longer available
Services are deprecated DDBJ web services are not longer
provided despite the fact that they are
used in many myExperiment
Third party resources
are available but not
Data is available but identified using
different IDs that the one known to the
Due to scalability reasons the input
data is superseded by new one making
the workflow not executable or
providing wrong results
Data is available but permission,
certificate, or net- work to access it is
Cannot get the input, which is a
security token that can only be
obtained by a registered user of
Services are available but need
permission, certificate, or network to
access and invoke them
The security policies of the execution
framework are updated due to new
host- ing institution rules
Third party resources
have changed
Services are still available by using the
same identifiers but their functionality
have changed
The web services are updated
BDA MDD 2018
The Cause of Decay
  Manual analysis
  By the validation report from Taverna workbench
  By interpreting experiment results reported by Taverna
  Identified 4 categories of causes
  Missing example data
  Missing execution environment
  Insufficient descriptions about workflows
  Volatile third-party Resources
BDA MDD 2018
Summary of Decay Causes
  50% of the decay was caused by
volatility of 3rd-party resource
  Missing example data
  Unable to re-run
  Missing execution environment
  Such as local plugins
  Insufficient metadata
  Such as any required
dependency libraries or
permission information
BDA MDD 2018
Combating Workflow Decay
BDA MDD 2018
Combating Workflow Decay
  Objective: Provide enough information to
  Prevent decay
  Detect decay
  Repair decay
  Approach: Research Objects + Checklists
  Research Object: Aggregate workflow specifications together
with auxiliary elements, such as example data inputs,
annotations, provenance traces that can ne used to prevent
decay and/or repair the workflow in case of decay.
  Checklists: to check that sufficient information is preserved
along with workflows
BDA MDD 2018
• Checklists are a well- established tool
for guiding practices to ensure safety,
quality and consistency in the conduct
of complex operations.
• They have been adopted by the
biological research community to
promote consistency across research
• In our case, we use checklists to
assess if a research object contains
sufficient information for running the
workflow and checking that its results
are replicable.
BDA MDD 2018
Cheklisting the Reproducibility
of a Workflow
The Minim model used in our approach is an adaptation of the MiM model [1].
[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework
for Scientific Linked Data. eScience 2012
BDA MDD 2018
Use Case
  4 myExperiment packs
  2 from genomics, 1 from geography, and 1 domain-neutral
  Experiment process:
  Transform them into RO
  Create checklist descriptions
  2 research objects did not contains example inputs, the other
2 failed because of update to third party resources and
environment of execution.
BDA MDD 2018
Lessons Learnt
1.  Dependency is the root enemy of reproducible
2.  Documentation, i.e., annotation, is vital
3.  Documentation should be easy to create
BDA MDD 2018
Research Objects
BDA MDD 2018
Benefits Of Research Objects
  A research object aggregates all elements that are
necessary to understand research investigations.
  Methods (experiments) are viewed as first class citizens
  Promote reuse
  Enable the verification of reproducibility of the results
BDA MDD 2018
Research Obejects Specifications and Tooling can be found at
BDA MDD 2018
Research Object Model: Overview
The model specification can be found at
And the primer at
BDA MDD 2018
Workflow Template and Workflow Run
BDA MDD 2018
BDA MDD 2018
BDA MDD 2018
BDA MDD 2018
Grounding Workflow-centric Research
Objects Using Semantic Technologies
  Workflow-centric research objects are encoded using RDF, according to a set of
ontologies that are publicly available
  Research objects use the Object Exchange and Reuse (ORE) model, to represent
BDA MDD 2018
  We use the Annotation Ontology (AO), to annotate research object
resources and their relationships.
Grounding Workflow-centric Research
Objects Using Semantic Technologies
BDA MDD 2018
Live RO Live RO
RO snapshot
Identified by a URI
Some metadata
Some curation
Mostly private (for my group)
RO snapshot
Identified by a URI
Some metadata
Some curation
Mostly private (for my group
and for paper reviewers)
My supervisor calls
me to report my work
My supervisor calls
me again and we
decide to publish our
Archived RO
<<copy, filter
and curate>>
Identified by a URI
Good metadata
and curation
Mostly public
received and
final version
A new PhD
continues my
Using Research Objects for the Preservation
of Workflows/Experiments
Case study: investigating the epigenetic mechanisms
involved in Huntington’s disease (HD). It is the most
commonly inherited neurodegenerative disorder in
Europe, that affects 1 out of 10 000 people.
The scientist in this use case were convinced to use
Research Object as a model for packaging their
BDA MDD 2018
Preserving Scientific Wokflows
when they have not been
packaged into research objects
  … Which is the case of most of workflows.
  And even if they are packaged into research objects,
scientific workflows can still suffer from decay.
BDA MDD 2018
Scientific Workflow
  Issue: As we have seen from the results of the empirical study
we presented earlier, workflow preservation is frequently
hampered by the volatility of the web services implementing
the analysis operations that constitute workflows.
  Objective: to provide a means for scientists to repair
workflows by identifying service operations that can play the
same role as the unavailable ones.
BDA MDD 2018
✔  Context: Preservation of Scientific Workflows
■  Discovering Substitute Services Using Semantic
Annotation of Web Services
■  Discovering Substitute Services Using Existing
Workflow Specifications and Provenance traces
■  Conclusions
BDA MDD 2018
Ontologies Used For Annotating
Web Services
 Task ontology: captures information about the action carried
out by service operations within a domain of interest, e.g.,
Sequence_alignment and Protein_identification
 Domain ontology: captures information about the application
domains covered by operation parameters, e.g., Protein_record and
BDA MDD 2018
Task Replaceability
Task replaceability: For an operation op2 to be able to substitute
an operation op1, op2 must fulfil a task that is equivalent to or
subsumes the task op1 performs:
BDA MDD 2018
Parameter compatibility
Parameter replaceability: To be compatible the domain of the
output must be the same as or subconcept of the domain of the
subsequent input.
BDA MDD 2018
While the method just presented is sound, its practical applicability is
hindered by the following facts
§  Semantic annotations of web services are scarce.
§  Our experience suggests that a large proportion of existing semantic
annotations suffer from inaccuracies
§  As a result, a substitute that is discovered for replacing an unavailable
operation using such annotations may turn out to be unsuitable, and,
inversely, a suitable substitute may be discarded.
BDA MDD 2018
Discovering Substitute Services Using Existing Workflow
Specifications and Provenance traces
Existing Workflow
Provenance traces of missing
BDA MDD 2018
Parameter Compatibility
Formally, let wf1 be a workflow in which the operation op1 is unavailable.
The operation op2 can replace the operation op1 in terms of its inputs and
outputs if:
BDA MDD 2018
Task Compatibility
  In addition to the compatibility in terms of inputs and outputs, we have to
check that the candidate substitute performs a task compatible with that of the
unavailable operation.
  To perform this test, we exploit the following observation. An operation op2 is
able to replace the operation op1 in terms of task, if for every possible input
instances that op1 is able to consume, op2 delivers the same output as that
obtained by invoking op1.
  To perform the above test, however, we will have to call the missing operation
  A solution that we adopt for overcoming the above problem makes use of
workflow provenance logs. These are traces that contain intermediate data that
were used as input and delivered as output by the constituent operations of a
workflow when enacted.
BDA MDD 2018
Task Compatibility (cont.)
§  An operation op2 may be compatible in terms of task with op1
op2 delivers the same results that op1 delivered in past
executions, that are logged within provenance logs, when fed
using the same input values.
§  Notice that we say may be compatible. This is because we may
not be able to compare the outputs obtained for every possible
input value of the operation op1.
BDA MDD 2018
Relaxing Substitutability
  The condition that we have described for checking the
suitability of an operation as a substitute for another one may
be stronger than is required in practice.
  There are various parameter representations that are adopted
in bioinformatics.
  Because of representation mismatch, a service operation that
performs a task similar to the missing operation may be found
to be unsuitable.
BDA MDD 2018
Example of values delivered by two operations using the same
input value
CosSym(value1,value2) = 0.007
BDA MDD 2018
Relaxing Substitutability
To overcome this problem, we use a two step process when
comparing the values of parameters:
1.  Given a parameter value, we derive its representation.
2.  If the representation is associated with a key attribute
(identifier), extract the value of such an attribute
If two parameter values are associated with identifiers, then they
are compared by comparing their identifiers.
BDA MDD 2018
Example of values delivered by two operations using the same
input value
Fasta Format
Uniprot Format
BDA MDD 2018
Data Examples for Characterizing
Scientific Operations
  We have conducted an empirical evaluation to assess
the effectiveness of the method described.
  The issue that we faced is the ability to have examples
that characterize the missing operation, and that can be
used for comparison with available modules.
  This motivated a proposal that we have worked on for
characterizing analysis operations using data examples.
BDA MDD 2018
Data Example
Describes >
BDA MDD 2018
Generating Data Examples
  Data examples can be used as a means to
describe the behavior of analysis operations.
  Enumerating all possible data examples that
can be used to describe a given operation may
be expensive, and may contain redundant data
examples that describe the same behavior.
  Issue: which data examples should be used to characterize the
functionality of a given operation?
  Solution: We have showed how software testing techniques can
be adapted to the problem of generating data examples without
relying on the availability of the operation specification, which
often is not accessible.
Trick: Use domain ontologies for
partitioning the space of possible values
BDA MDD 2018
  Scientific workflows
  Scientific Workflow Reproducibility
  Workflow Preservation Against Decay
  From Workflows to Scripts
BDA MDD 2018
From Workflow to Scripts…
and then Back
  Scientific Workflows have proved their utility, and they
are used in practice by scientist
  However, the majority of scientists utilize scripting
languages to specify and enact their data analysis.
  In order to promote their reproducibility, we have seen
a number of proposals in recent years that seek to bring
some advantages that characterize workflows to scripts.
  We will see some of them in what follows.
BDA MDD 2018
Meanwhile, on a nearby planet …
Interactive Visualization
R and Python and the Winners
BDA MDD 2018
Why Bother?
  Workflow provides key features to enable reproducibility
that scripts lack
  This features lack in scripts in general
  Workflow can be repurposed in a straightforward manner
by customizing the resources and the dependencies
  Scalability: some workflow systems can handle large
amounts of data
  Provenance: Most workflow systems are instrumented to
capture provenance information about workflow execution
YesWorkflow to the rescue
BDA MDD 2018
Science Example: Paleoclimate ReconstrucRon
BDA MDD 2018
BDA MDD 2018
YesWorkflow = Script + Comments
  Scripts can be hard to digest,
  Add structured comments (cf. JavaDoc) =>
reveal workflow structure and dataflow
  => obtain some scientific workflow benefits
BDA MDD 2018
YesWorkflow Generates Three
Views from the Script
BDA MDD 2018
User Comments: YesWorkflow Annotations
BDA MDD 2018
B."Ludäscher"""""""""""""""""""""""""""""""""""""""""""""""""YesWorkflow:"Workflow"Views"from"Scripts."IDCC'15,"London"" 9"
dendro_series_for_reconstruction CAR_Analysis_unique
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
•  …"explained"using"YesWorkflow+
BDA MDD 2018
YesWorkflow Architecture
  • YW-Extract
  – ... structured comments
  Program Block, Workflow
  Port (data, parameters) – Channels
  using GraphViz/DOT files
BDA MDD 2018
What About Provenance?
  There are some solutions that allow capturing the
provenance of a script.
  Use (R, Python, ..) libraries and/or code
instrumentation to capture runtime observables
  file read/write, function calls, program variables & state,
  noWorkflow system
  exploit Python profiling library to capture run=me
Can be messy as they capture every operating system event/call!
BDA MDD 2018
Actually, We Can Construct the Provenance Without
Recording it in the First Place!
YesWorkflow(Provenance(@(TaPP'15( 17(
BDA MDD 2018
and You have the Provenance for Freerun/  
├──  raw  
│      └──  q55  
│              ├──  DRT240  
│              │      ├──  e10000  
│              │      │      ├──  image_001.raw  
...          ...  ...  ...  
│              │      │      └──  image_037.raw  
│              │      └──  e11000  
│              │              ├──  image_001.raw  
...          ...          ...  
│              │              └──  image_037.raw  
│              └──  DRT322  
│                      ├──  e10000  
│                      │      ├──  image_001.raw  
...                  ...  ...  
│                      │      └──  image_030.raw  
│                      └──  e11000  
│                              ├──  image_001.raw  
...                          ...  
│                              └──  image_030.raw  
├──  data  
│      ├──  DRT240  
│      │      ├──  DRT240_10000eV_001.img  
...  ...  ...  
│      │      └──  DRT240_11000eV_037.img  
│      └──  DRT322  
│              ├──  DRT322_10000eV_001.img  
...          ...  
│              └──  DRT322_11000eV_030.img  
├──  collected_images.csv  
├──  rejected_samples.txt  
└──  run_log.txt  
YesWorkflow(Provenance(@(TaPP'15( 23(
rejected_sample accepted_sample num_images energies
sample_id energy frame_number
total_intensitypixel_count corrected_image_path
•  URIDtemplates+link(conceptual(en==es(
•  …(facilita=ng(provenance(reconstruc=on(BDA MDD 2018
BDA MDD 2018
BDA MDD 2018
BDA MDD 2018
BDA MDD 2018
Back to Workflow Land
Converting Scripts into Reproducible Workflow
Research Objects
BDA MDD 2018
BDA MDD 2018
Bundle Resources into a Research Object
Script Abstract
BDA MDD 2018
  Research in enabling reproducibility has seen a real push in recent
year, with some great initiatives, software products and data
Figshare, Dataverse, OpenAir, DataONE, RDA
  Workflows and Scripts are no exception, and there have been
some good proposals from a handful of researchers as well as
  MADICS Workfing Group on Reproducibility.
  We are just scratching the surface and there are numerous issues
that still need to be addressed.
  workflow/scripts similarities, comparison of scientific results,
incremental re-computation, to cite a few are still open topics.
BDA MDD 2018
  Pinar Alper,
  Lucas Augusto
  Shawn Bowers
  Sarah Cohen Boulakia
  Alban Gaignard
  Daniel Garijo
  Carole Goble,
  Bertram Ludascher
  Timothy McPhilips
  Claudia Medeiros
  Paolo Missier
Stian Soiland-Reyes
BDA MDD 2018
  Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance patterns.
Future Generation Comp. Syst. 75: 310-329 (2017)
  Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina M. Hettne, Raúl Palma, Eleni Mina, Óscar
Corcho, José Manuél Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole A. Goble: Using a suite of ontologies for
preserving workflow-centric research objects. J. Web Sem. 32: 16-42 (2015)
  Khalid Belhajjame, Carole A. Goble, Stian Soiland-Reyes, David De Roure: Fostering Scientific Workflow Preservation
through Discovery of Substitute Services. eScience 2011: 97-104
  Sarah Cohen Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard,
Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal,
Christophe Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and
opportunities. Future Generation Comp. Syst. 75: 284-298 (2017)
  Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Cross-workflow Provenance.
SeWeBMeDA@ESWC 2017: 50-64
  Lucas Augusto Montalvão Costa Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros: Converting scripts into
reproducible workflow research objects. eScience 2016: 71-80
  Timothy M. McPhillips, Shawn Bowers, Khalid Belhajjame, Bertram Ludäscher: Retrospective Provenance Without a
Runtime Provenance Recorder. TaPP 2015
  Timothy M. McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao,
Fernando Chirigati, Saumen C. Dey, Juliana Freire, Deborah N. Huntzinger, Christopher Jones, David Koop, Paolo
Missier, Mark Schildhauer, Christopher R. Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludäscher:
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR
abs/1502.02403 (2015)
BDA MDD 2018
Reproducibility: Workflows,
Provenance and Scripts
Khalid Belhajjame
PSL, LAMSADE, Université Paris-Dauphine
BDA MDD 2018

More Related Content

What's hot

Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
IRJET Journal
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET Journal
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Genomika Diagnósticos
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth Science
Raul Palma
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
Valery Tkachenko
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Valery Tkachenko
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
Jagdeep Singh Malhi
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
karthika karthi
Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016
Michiel Stock
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
Ijetcas14 338
Ijetcas14 338Ijetcas14 338
Ijetcas14 338
Iasir Journals
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
Enayat Rajabi
IJERA Editor
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
Carole Goble

What's hot (20)

Fault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clusteringFault detection of imbalanced data using incremental clustering
Fault detection of imbalanced data using incremental clustering
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth Science
Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...Using publicly available resources to build a comprehensive knowledgebase of ...
Using publicly available resources to build a comprehensive knowledgebase of ...
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictionsDeep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
Hadoop map reduce
Hadoop map reduceHadoop map reduce
Hadoop map reduce
Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016Poster genome engineering & Synthetic Biology 2016
Poster genome engineering & Synthetic Biology 2016
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
Ijetcas14 338
Ijetcas14 338Ijetcas14 338
Ijetcas14 338
Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)Interlinking educational data to Web of Data (Thesis presentation)
Interlinking educational data to Web of Data (Thesis presentation)
Enhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging areaEnhancement techniques for data warehouse staging area
Enhancement techniques for data warehouse staging area
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)What is Reproducibility? The R* brouhaha (and how Research Objects can help)
What is Reproducibility? The R* brouhaha (and how Research Objects can help)

Similar to Aussois bda-mdd-2018

Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
Carole Goble
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Richard Zijdeman
Reproducibility by Other Means: Transparent Research Objects
Reproducibility by Other Means: Transparent Research ObjectsReproducibility by Other Means: Transparent Research Objects
Reproducibility by Other Means: Transparent Research Objects
Timothy McPhillips
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
Carole Goble
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Carole Goble
Reproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trends
Carole Goble
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
myGrid team
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
Carole Goble
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
Richard Zijdeman
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
Carole Goble
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
Eamonn Maguire
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
Khalid Belhajjame
Dag Endresen
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
Carole Goble
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Bertram Ludäscher
Journal Club - Best Practices for Scientific Computing
Journal Club - Best Practices for Scientific ComputingJournal Club - Best Practices for Scientific Computing
Journal Club - Best Practices for Scientific Computing
Bram Zandbelt
SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...
Natalie Stanford

Similar to Aussois bda-mdd-2018 (20)

Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Reproducibility by Other Means: Transparent Research Objects
Reproducibility by Other Means: Transparent Research ObjectsReproducibility by Other Means: Transparent Research Objects
Reproducibility by Other Means: Transparent Research Objects
Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014Results may vary: Collaborations Workshop, Oxford 2014
Results may vary: Collaborations Workshop, Oxford 2014
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
Reproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trendsReproducibility (and the R*) of Science: motivations, challenges and trends
Reproducibility (and the R*) of Science: motivations, challenges and trends
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016Reproducibility, Research Objects and Reality, Leiden 2016
Reproducibility, Research Objects and Reality, Leiden 2016
Data legend dh_benelux_2017.key
Data legend dh_benelux_2017.keyData legend dh_benelux_2017.key
Data legend dh_benelux_2017.key
Minimal viable data reuse
Minimal viable data reuseMinimal viable data reuse
Minimal viable data reuse
A Big Picture in Research Data Management
A Big Picture in Research Data ManagementA Big Picture in Research Data Management
A Big Picture in Research Data Management
Reproducible, Open Data Science in the Life Sciences
Reproducible, Open  Data Science in the  Life SciencesReproducible, Open  Data Science in the  Life Sciences
Reproducible, Open Data Science in the Life Sciences
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Mtsr2015 goble-keynote
Mtsr2015 goble-keynoteMtsr2015 goble-keynote
Mtsr2015 goble-keynote
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Journal Club - Best Practices for Scientific Computing
Journal Club - Best Practices for Scientific ComputingJournal Club - Best Practices for Scientific Computing
Journal Club - Best Practices for Scientific Computing
SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...SEEKing our way to better presentation of data and models from scientific inv...
SEEKing our way to better presentation of data and models from scientific inv...

More from Khalid Belhajjame

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
Khalid Belhajjame
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Khalid Belhajjame
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
Khalid Belhajjame
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
Khalid Belhajjame
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
Khalid Belhajjame
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
Khalid Belhajjame
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
Khalid Belhajjame
Ikc 2015
Ikc 2015Ikc 2015
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Khalid Belhajjame
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
Khalid Belhajjame
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
Khalid Belhajjame
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
Khalid Belhajjame
Credible workshop
Credible workshopCredible workshop
Credible workshop
Khalid Belhajjame
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Khalid Belhajjame
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
Khalid Belhajjame
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
Khalid Belhajjame
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
Khalid Belhajjame
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
Khalid Belhajjame
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
Khalid Belhajjame
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
Khalid Belhajjame

More from Khalid Belhajjame (20)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)Tapp 2014 (belhajjame)
Tapp 2014 (belhajjame)
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
Credible workshop
Credible workshopCredible workshop
Credible workshop
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
Detecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow ResultsDetecting Duplicate Records in Scientific Workflow Results
Detecting Duplicate Records in Scientific Workflow Results
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)

Recently uploaded

Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf (প্রয়োজনীয় বাংলা বই)
Nguyen Thanh Tu Collection
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
Jean Carlos Nunes Paixão
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
Colégio Santa Teresinha
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx

Recently uploaded (20)

Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdfবাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
বাংলাদেশ অর্থনৈতিক সমীক্ষা (Economic Review) ২০২৪ UJS App.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
A Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdfA Independência da América Espanhola LAPBOOK.pdf
A Independência da América Espanhola LAPBOOK.pdf
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Cognitive Development Adolescence Psychology
Cognitive Development Adolescence PsychologyCognitive Development Adolescence Psychology
Cognitive Development Adolescence Psychology
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx

Aussois bda-mdd-2018

  • 1. Computational Reproducibility: Workflows, Provenance and Scripts Khalid Belhajjame PSL, LAMSADE, Université Paris-Dauphine BDA MDD 2018 1
  • 2. Data-Oriented Science   Computing is transforming the practice of science. The so-called “Fourth Paradigm of scientific research” [1] refers to the current era, where scientists utilize computational tools and technologies to manage, share, federate, analyze, visualize data to underpin scientific findings.   The objective of data-oriented science is is to create a richer research ecosystem in which emphasis is given not only to the build-up of scientific knowledge, but also to the build-up and dissemination of other work-products of research such as data, protocols, models and tools.   Why is that? [1] Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.BDA MDD 2018
  • 3. Scholarly Articles Are Not Enough   Scholarly articles remain the main trusted means for scientists to communicate their findings   However, they are noticeably insufficient to communicate all the actual scientific knowledge behind the reported findings.   There is a need for communicating and preserving other artifacts to enable the understanding, verification, and reuse. In other words, …. reproducibility BDA MDD 2018
  • 4. 47 of 53 “landmark” publications could not be replicated Inadequate cell lines and animal models Nature, 483, 2012 basic studies on cancer are unreliable, with grim consequences for producing new medicines in the future BDA MDD 2018
  • 5. The research result, obtained by Stapel and co-workers Roos Vonk (Radboud University) and Marcel Zeelenberg (nl) (Tilburg University), showing that meat eaters are more selfish than vegetarians, which was widely publicized in Dutch media is suspected to be based on faked data. BDA MDD 2018
  • 6. Reproducibility is not just about finding cheaters … it is above all a noble cause BDA MDD 2018
  • 7.   Researchers in experimental biology use carefully lab notebooks to document different aspects of their experiments.   This is not the case for computational scientists who tend to run their analysis with no clear record of the exact process they followed or intermediary datasets (results) they used and generated.   It is therefore possible that numerous published results may be unreliable or even completely invalid. Culture of Reproducibility BDA MDD 2018
  • 8. Culture of Reproducibility   Often, there is no record of the process (workflow) that produced the published computational results in scholarly communications.   Even the code is missing, or underwent changes.   It cannot be used to process the data referred to, (if we are lucky). BDA MDD 2018
  • 9. Open and transparent Communication “The reproducible research movement recognizes that traditional scientific research and publication practices now fall short …, and encourages all those involved in the production of computational science ... to facilitate and practice really reproducible research.” V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to reproducible: Reproducibility in computational and experimental mathematics. We witnessed recently the emergence of a number of methods and tools for enabling reproducibility BDA MDD 2018
  • 10. Scope of this seminar   We will focus on the reproducibility of scientific workflows.   These have been adopted in modern sciences, notably life sciences and bio-diversity for encoding and enacting scientific experiments   We will look at what it means to reproduce a scientific workflow, and draw a map of some solutions that have been proposed in this direction BDA MDD 2018
  • 11. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  • 12. Scientific workflow • Workflow technology is increasingly used for specifying and enacting scientific experiments. • A scientific workflow is a series of analysis operations connected using data links. • Analysis operations can be supplied locally or can be independently developed web services. BDA MDD 2018
  • 13. Science with workflows GWAS, Pharmacogenomics Association study of Nevirapine-induced skin rash in Thai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy & HelioPhysics Library Doc Preservation Systems Biology of Micro- Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble]BDA MDD 2018
  • 14. Workflows for systematic resource use • Access heterogeneous resources. • Explicit, runnable, repeatable analytical process. • Explore parameter spaces. • Sweep an analysis over datasets. • Transparent and efficient analyses with provenance collected from workflow executions Workflow Provenance Data BDA MDD 2018
  • 16. Demo: Example showing the use of Taverna BDA MDD 2018
  • 17. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  • 18. Reproducibility Terminology   Reproducibility has been studied in science in larger contexts than computational reproducibility, in particular where wet experiments are involved.   A plethora of terms are used including repeat, replicate, reproduce, redo, rerun, recompute, reuse and repurpose etc. to name a few. We will focus on 4 Rs: Repeat, Replicate, Reproduce and Reuse.   For each of them, we will give the definition in wet-lab contexts and propose a definition in a computational setting. BDA MDD 2018
  • 20. Repeat   A wet experiment is said to be repeated when the experiment is performed in the same lab as the original experiment, that is, on the same scientific environment.   By analogy, an in silico experiment is said to be repeated when it is performed in the same computational setting as the original experiment.   The major goal of the repeat task is to check whether the initial experiment was correct and can be performed again.   The difficulty lies in recording as much information as possible to repeat the experiment so that the same conclusion can be drawn. BDA MDD 2018
  • 21. Replicate   A wet experiment is said to be replicated when the experiment is performed in a different (wet) ”lab” than the original experiment.   By analogy, a replicated in silico experiment is performed in a new setting and computational environment, although similar to the original ones).   When replicated, a result has a high level of robustness: the result remains valid in a similar (even though different) setting has been considered.   A continuum of situations can be considered between a repeated and replicated experiments. BDA MDD 2018
  • 22. Reproduce   Reproduce is defined in the broadest possible sense of the term and denotes the situation where an experiment is performed within a different set-up but with the aim to validate the same scientific hypothesis.   In other words, what matters is the conclusion obtained and not the methodology considered to reach it.   Completely different approaches can be designed, completely different data sets can be used, as long as both experiments converge to the same scientific conclusion.   A reproducible result is thus a high- quality result, confirmed while obtained in various ways. BDA MDD 2018
  • 23. Reuse   A very important concept related to reproducibility is Reuse which denotes the case where a different experiment is performed, with similarities with an original experiment.   A specific kind of reuse occurs when a single experiment is reused in a new context (and thus adapted to new needs), the experiment is then said to be repurposed. BDA MDD 2018
  • 24. Repeat, Replicate, Reproduce and Reuse   Reproduce and reuse are the most important scientific targets.   However, before investigating alternative ways of obtaining a result (to reach reproducibility) or before reusing a given methodology in a new context (to reach reuse), the original experiment has to be carefully tested (possibly by reviewers and/or any peers), demonstrating its ability to be at least repeated and hopefully replicated   The database community lags well behind other computer science communities, e.g., the Semantic Web community   ISWC and ESWC encourages the authors to submit with the paper auxiliary resources about the experiment they used as well as the software/prototype they built if any. BDA MDD 2018
  • 25. Reproducibility and Scientific Workflows We now introduce definitions of reproducibility concepts in the particular context of use of scientific workflow systems. In our definition, we distinguish six components of an analysis designed using a scientific workflow. 1.  S, the workflow specification, providing the analysis steps associated with tools, chained in a given order, 2.  I, the input of the workflow used for its execution, that is, the concrete data sets and parameter settings specified for any tools, 3.  E, the workflow context and runtime environment, that is, the computational context of the execution (OS, libs, etc.). Additionally, we consider R and C, the result of the analysis (typically the final data sets) and the high level conclusion that can be reached from this analysis, respectively. BDA MDD 2018
  • 26. Repeatability of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ repeats A if and only if A and A’ are identical on all their components. Replicability of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ replicates A if and only if they reach the same conclusion while their specification and input components are similar and other components may differ (in particular no condition is set on the run-time environment).   Terms such as rerun, re-compute typically consider situations where the workflow specification is unchanged. BDA MDD 2018
  • 27. Reproducibility of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ reproduces A if and only if they reach the same conclusion. No condition is set on any other components of the analysis. Reuse of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ reuses A if and only if the specification or input of A’ is part of the specification or input of A’.   No other condition is set, especially the conclusion to reach may be different. BDA MDD 2018
  • 28. Reproducibility of a Scientific Workflow Given two analyses A and A’ performed using scientific workflows, we say that A’ reproduces A if and only if they reach the same conclusion. No condition is set on any other components of the analysis. Reuse of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ reuses A if and only if the specification or input of A’ is part of the specification or input of A’.   No other condition is set, especially the conclusion to reach may be different. • Paul writes workflows for identifying biological pathways implicated in resistance to Trypanosomiasis in cattle • Paul meets Jo who is investigating Whipworm in mouse. • Jo reuses one of Paul’s workflow without change. • Jo identifies the biological pathways involved in sex dependence in the mouse model, believed to be involved in the ability of mice to expel the parasite. • Previously a manual two year study by Jo had failed to do this. Computational Workflows Carole Goble Reuse can be impressive when it works …but is generally hard to achieve Real-Life Example Of Reuse BDA MDD 2018
  • 29. Which level of reproducibility are we at?   Repeatability and Replicability L   Even these two are hard to achieve most of the time.   Needless to speak about reuse at this point. There are few use cases that show the potential of workflow reuse, but we are still at the stage of use cases.   Solutions for enabling scientific workflow repeatability and replication has mainly focused on their preservation against decay BDA MDD 2018
  • 30. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  • 31. Workflow Preservation   Public repositories such as myExperiment and CrowdLabs have been used by scientists to publish workflow specification and share them over the web.   The availability of workflow specification is however not sufficient for enabling their repeatability and replicability.   Indeed, an empirical study that we conducted showed that the majority of workflow suffers from decay. BDA MDD 2018
  • 32. Understanding The Causes of Workflow Decay   We adopted an empirical approach   To identify the causes of workflow decay   To quantify their severity   To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay BDA MDD 2018
  • 33. Experimental Setup   Taverna workflows from   Taverna 1   Taverna 2   Selection process   By the creation year   By the creator   By the domain   Software environment   Taverna 2.3   Experiment metadata   4 researchers BDA MDD 2018
  • 34. Analyzed Workflows Number of Taverna 1 workflows from 2007 to 2011 2007 2008 2009 2010 2011 Tested 11 10 10 10 4* Total 74 341 101 26 13 Number of Taverna 2 workflows from 2009 to 2012 2009 2010 2011 2012 Tested 12 10 15 9 Total 97 308 289 184 BDA MDD 2018
  • 35. Profile of Analyzed Workflows BDA MDD 2018
  • 36. The Proportion of Decay   75% of the 92 tested workflows failed to be either executed or produce the same result (if testable)   Those from earlier years (2007-2009) had 91% failure rate Taverna 1 Taverna 2 BDA MDD 2018
  • 37. The Cause of Decay   Manual analysis   By the validation report from Taverna workbench   By interpreting experiment results reported by Taverna   Identified 4 categories of causes   Missing example data   Missing execution environment   Insufficient descriptions about workflows   Volatile third-party Resources BDA MDD 2018
  • 38. Decay Caused by Third-Party Resources Causes Refined Causes Examples Third party resources are not available Underlying dataset, particularly those locally hosted in-house dataset, is no longer available Researcher hosting the data changed institution, server is no longer available Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflows Third party resources are available but not accessible Data is available but identified using different IDs that the one known to the user Due to scalability reasons the input data is superseded by new one making the workflow not executable or providing wrong results Data is available but permission, certificate, or net- work to access it is needed Cannot get the input, which is a security token that can only be obtained by a registered user of ChemiSpider Services are available but need permission, certificate, or network to access and invoke them The security policies of the execution framework are updated due to new host- ing institution rules Third party resources have changed Services are still available by using the same identifiers but their functionality have changed The web services are updated BDA MDD 2018
  • 39. The Cause of Decay   Manual analysis   By the validation report from Taverna workbench   By interpreting experiment results reported by Taverna   Identified 4 categories of causes   Missing example data   Missing execution environment   Insufficient descriptions about workflows   Volatile third-party Resources BDA MDD 2018
  • 40. Summary of Decay Causes   50% of the decay was caused by volatility of 3rd-party resource   Unavailable   Inaccessible   Updated   Missing example data   Unable to re-run   Missing execution environment   Such as local plugins   Insufficient metadata   Such as any required dependency libraries or permission information BDA MDD 2018
  • 42. Combating Workflow Decay   Objective: Provide enough information to   Prevent decay   Detect decay   Repair decay   Approach: Research Objects + Checklists   Research Object: Aggregate workflow specifications together with auxiliary elements, such as example data inputs, annotations, provenance traces that can ne used to prevent decay and/or repair the workflow in case of decay.   Checklists: to check that sufficient information is preserved along with workflows BDA MDD 2018
  • 43. !"##$%&'()*+,!*) !"#$%&'()*#)+',()-#)&*.,#/#)& 0(,1!(2*3#/$,'.&'()*.,#/#)& ').%&*"4#/*.,#/#)& .5,5-#&#,*+54%#/*3#")#3 ,#6%',#3*/#,+'$#/*5$$#//'74# Checklists • Checklists are a well- established tool for guiding practices to ensure safety, quality and consistency in the conduct of complex operations. • They have been adopted by the biological research community to promote consistency across research datasets • In our case, we use checklists to assess if a research object contains sufficient information for running the workflow and checking that its results are replicable. BDA MDD 2018
  • 44. Cheklisting the Reproducibility of a Workflow !"#"$%&'()& "*'#+'%"& $,"$-#./% 0.(.1& )"/$2.3%.4( 5"/"'2$,& 678"$% 9*'#+'%.4(& 2"342% :"7 ;+234/" The Minim model used in our approach is an adaptation of the MiM model [1]. [1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework for Scientific Linked Data. eScience 2012 BDA MDD 2018
  • 45. Use Case   4 myExperiment packs   2 from genomics, 1 from geography, and 1 domain-neutral   Experiment process:   Transform them into RO   Create checklist descriptions   Observations   2 research objects did not contains example inputs, the other 2 failed because of update to third party resources and environment of execution. BDA MDD 2018
  • 46. Lessons Learnt 1.  Dependency is the root enemy of reproducible workflows 2.  Documentation, i.e., annotation, is vital 3.  Documentation should be easy to create BDA MDD 2018
  • 48. Benefits Of Research Objects   A research object aggregates all elements that are necessary to understand research investigations.   Methods (experiments) are viewed as first class citizens   Promote reuse   Enable the verification of reproducibility of the results BDA MDD 2018
  • 49. Research Obejects Specifications and Tooling can be found at BDA MDD 2018
  • 50. Research Object Model: Overview The model specification can be found at And the primer at BDA MDD 2018
  • 51. Workflow Template and Workflow Run BDA MDD 2018
  • 55. Grounding Workflow-centric Research Objects Using Semantic Technologies   Workflow-centric research objects are encoded using RDF, according to a set of ontologies that are publicly available   Research objects use the Object Exchange and Reuse (ORE) model, to represent aggregation. ORE BDA MDD 2018
  • 56.   We use the Annotation Ontology (AO), to annotate research object resources and their relationships. Grounding Workflow-centric Research Objects Using Semantic Technologies BDA MDD 2018
  • 57. 57 Scientist Live RO Live RO RO snapshot <<copy>> Identified by a URI Some metadata Some curation Mostly private (for my group) RO snapshot <<copy>> Identified by a URI Some metadata Some curation Mostly private (for my group and for paper reviewers) Librarian/Curator Scientist My supervisor calls me to report my work My supervisor calls me again and we decide to publish our RO+paper <<versionOf>> Archived RO <<copy, filter and curate>> Identified by a URI Good metadata and curation Mostly public Reviews received and final version published <<versionOf>> A new PhD student continues my work <<copy>>
  • 58. Using Research Objects for the Preservation of Workflows/Experiments Case study: investigating the epigenetic mechanisms involved in Huntington’s disease (HD). It is the most commonly inherited neurodegenerative disorder in Europe, that affects 1 out of 10 000 people. The scientist in this use case were convinced to use Research Object as a model for packaging their investigation BDA MDD 2018
  • 59. Preserving Scientific Wokflows when they have not been packaged into research objects   … Which is the case of most of workflows.   And even if they are packaged into research objects, scientific workflows can still suffer from decay. BDA MDD 2018
  • 60. Scientific Workflow Preservation   Issue: As we have seen from the results of the empirical study we presented earlier, workflow preservation is frequently hampered by the volatility of the web services implementing the analysis operations that constitute workflows.   Objective: to provide a means for scientists to repair workflows by identifying service operations that can play the same role as the unavailable ones. BDA MDD 2018
  • 61. Outline ✔  Context: Preservation of Scientific Workflows ■  Discovering Substitute Services Using Semantic Annotation of Web Services ■  Discovering Substitute Services Using Existing Workflow Specifications and Provenance traces ■  Conclusions BDA MDD 2018
  • 62. Ontologies Used For Annotating Web Services  Task ontology: captures information about the action carried out by service operations within a domain of interest, e.g., Sequence_alignment and Protein_identification  Domain ontology: captures information about the application domains covered by operation parameters, e.g., Protein_record and DNA_sequence BDA MDD 2018
  • 63. Task Replaceability Task replaceability: For an operation op2 to be able to substitute an operation op1, op2 must fulfil a task that is equivalent to or subsumes the task op1 performs: BDA MDD 2018
  • 64. Parameter compatibility Parameter replaceability: To be compatible the domain of the output must be the same as or subconcept of the domain of the subsequent input. BDA MDD 2018
  • 65. Limitations While the method just presented is sound, its practical applicability is hindered by the following facts §  Semantic annotations of web services are scarce. §  Our experience suggests that a large proportion of existing semantic annotations suffer from inaccuracies §  As a result, a substitute that is discovered for replacing an unavailable operation using such annotations may turn out to be unsuitable, and, inversely, a suitable substitute may be discarded. BDA MDD 2018
  • 66. Discovering Substitute Services Using Existing Workflow Specifications and Provenance traces Existing Workflow Specifications Provenance traces of missing operations BDA MDD 2018
  • 67. Parameter Compatibility Formally, let wf1 be a workflow in which the operation op1 is unavailable. The operation op2 can replace the operation op1 in terms of its inputs and outputs if: BDA MDD 2018
  • 68. Task Compatibility   In addition to the compatibility in terms of inputs and outputs, we have to check that the candidate substitute performs a task compatible with that of the unavailable operation.   To perform this test, we exploit the following observation. An operation op2 is able to replace the operation op1 in terms of task, if for every possible input instances that op1 is able to consume, op2 delivers the same output as that obtained by invoking op1.   To perform the above test, however, we will have to call the missing operation op1!   A solution that we adopt for overcoming the above problem makes use of workflow provenance logs. These are traces that contain intermediate data that were used as input and delivered as output by the constituent operations of a workflow when enacted. BDA MDD 2018
  • 69. Task Compatibility (cont.) §  An operation op2 may be compatible in terms of task with op1 if: op2 delivers the same results that op1 delivered in past executions, that are logged within provenance logs, when fed using the same input values. §  Notice that we say may be compatible. This is because we may not be able to compare the outputs obtained for every possible input value of the operation op1. BDA MDD 2018
  • 70. Relaxing Substitutability Conditions   The condition that we have described for checking the suitability of an operation as a substitute for another one may be stronger than is required in practice.   There are various parameter representations that are adopted in bioinformatics.   Because of representation mismatch, a service operation that performs a task similar to the missing operation may be found to be unsuitable. BDA MDD 2018
  • 71. Example of values delivered by two operations using the same input value Value1 Value2 CosSym(value1,value2) = 0.007 BDA MDD 2018
  • 72. Relaxing Substitutability Conditions To overcome this problem, we use a two step process when comparing the values of parameters: 1.  Given a parameter value, we derive its representation. 2.  If the representation is associated with a key attribute (identifier), extract the value of such an attribute If two parameter values are associated with identifiers, then they are compared by comparing their identifiers. BDA MDD 2018
  • 73. Example of values delivered by two operations using the same input value Value1 Value2 Fasta Format Uniprot Format BDA MDD 2018
  • 74. Data Examples for Characterizing Scientific Operations   We have conducted an empirical evaluation to assess the effectiveness of the method described.   The issue that we faced is the ability to have examples that characterize the missing operation, and that can be used for comparison with available modules.   This motivated a proposal that we have worked on for characterizing analysis operations using data examples. BDA MDD 2018
  • 76. Generating Data Examples   Data examples can be used as a means to describe the behavior of analysis operations.   Enumerating all possible data examples that can be used to describe a given operation may be expensive, and may contain redundant data examples that describe the same behavior.   Issue: which data examples should be used to characterize the functionality of a given operation?   Solution: We have showed how software testing techniques can be adapted to the problem of generating data examples without relying on the availability of the operation specification, which often is not accessible. Trick: Use domain ontologies for partitioning the space of possible values BDA MDD 2018
  • 77. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  • 78. From Workflow to Scripts… and then Back   Scientific Workflows have proved their utility, and they are used in practice by scientist   However, the majority of scientists utilize scripting languages to specify and enact their data analysis.   In order to promote their reproducibility, we have seen a number of proposals in recent years that seek to bring some advantages that characterize workflows to scripts.   We will see some of them in what follows. BDA MDD 2018
  • 79. Meanwhile, on a nearby planet … Interactive Visualization R and Python and the Winners BDA MDD 2018
  • 80. Why Bother?   Workflow provides key features to enable reproducibility that scripts lack   Modularity   This features lack in scripts in general   Workflow can be repurposed in a straightforward manner by customizing the resources and the dependencies   Scalability: some workflow systems can handle large amounts of data   Provenance: Most workflow systems are instrumented to capture provenance information about workflow execution YesWorkflow to the rescue BDA MDD 2018
  • 81. Science Example: Paleoclimate ReconstrucRon BDA MDD 2018
  • 83. YesWorkflow = Script + Comments   Scripts can be hard to digest, communicate   Idea:   Add structured comments (cf. JavaDoc) => reveal workflow structure and dataflow   => obtain some scientific workflow benefits BDA MDD 2018
  • 84. YesWorkflow Generates Three Views from the Script BDA MDD 2018
  • 85. User Comments: YesWorkflow Annotations BDA MDD 2018
  • 86. Paleoclimate+ReconstrucRon+…+++ B."Ludäscher"""""""""""""""""""""""""""""""""""""""""""""""""YesWorkflow:"Workflow"Views"from"Scripts."IDCC'15,"London"" 9" GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years •  …"explained"using"YesWorkflow+ Kyle"B.,"(computa9onal)"archeologist:"" "It!took!me!about!20!minutes!to!comment.!Less! than!an!hour!to!learn!and!YWAannotate,!allAtold."! BDA MDD 2018
  • 87. YesWorkflow Architecture   • YW-Extract   – ... structured comments   YW-Model   Program Block, Workflow   Port (data, parameters) – Channels (dataflow)   YW-Graph   using GraphViz/DOT files BDA MDD 2018
  • 88. What About Provenance?   There are some solutions that allow capturing the provenance of a script.   Use (R, Python, ..) libraries and/or code instrumentation to capture runtime observables   file read/write, function calls, program variables & state, …   noWorkflow system   [Murta-Braganholo-Chiriga=-Koop-Freire-IPAW14]   exploit Python profiling library to capture run=me provenance Can be messy as they capture every operating system event/call! BDA MDD 2018
  • 89. Actually, We Can Construct the Provenance Without Recording it in the First Place! YW+annota)ons:(Model(your(Workflow!( YesWorkflow(Provenance(@(TaPP'15( 17( BDA MDD 2018
  • 90. and You have the Provenance for Freerun/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     YWDRECON:+Prospec=ve(&(Retrospec)ve( Provenance(…((almost)(for(free!(( YesWorkflow(Provenance(@(TaPP'15( 23( cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv •  URIDtemplates+link(conceptual(en==es( to(run)me+provenance+“le|(behind”(by( the(script(author(…(( •  …(facilita=ng(provenance(reconstruc=on(BDA MDD 2018
  • 95. Back to Workflow Land Converting Scripts into Reproducible Workflow Research Objects 95 BDA MDD 2018
  • 97. 38 Step Bundle Resources into a Research Object 5 Script Abstract workow Concrete workow(s) Annotations Paper Provenance Data Attributions BDA MDD 2018
  • 98. Conclusions   Research in enabling reproducibility has seen a real push in recent year, with some great initiatives, software products and data repositories Figshare, Dataverse, OpenAir, DataONE, RDA   Workflows and Scripts are no exception, and there have been some good proposals from a handful of researchers as well as practitioners.   MADICS Workfing Group on Reproducibility.   We are just scratching the surface and there are numerous issues that still need to be addressed.   workflow/scripts similarities, comparison of scientific results, incremental re-computation, to cite a few are still open topics. BDA MDD 2018
  • 99. Acknowledgement   Pinar Alper,   Lucas Augusto Carvalho   Shawn Bowers   Sarah Cohen Boulakia   Alban Gaignard   Daniel Garijo   Carole Goble,   Bertram Ludascher   Timothy McPhilips   Claudia Medeiros   Paolo Missier Stian Soiland-Reyes BDA MDD 2018
  • 100. References   Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance patterns. Future Generation Comp. Syst. 75: 310-329 (2017)   Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina M. Hettne, Raúl Palma, Eleni Mina, Óscar Corcho, José Manuél Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole A. Goble: Using a suite of ontologies for preserving workflow-centric research objects. J. Web Sem. 32: 16-42 (2015)   Khalid Belhajjame, Carole A. Goble, Stian Soiland-Reyes, David De Roure: Fostering Scientific Workflow Preservation through Discovery of Substitute Services. eScience 2011: 97-104   Sarah Cohen Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard, Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal, Christophe Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Comp. Syst. 75: 284-298 (2017)   Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Cross-workflow Provenance. SeWeBMeDA@ESWC 2017: 50-64   Lucas Augusto Montalvão Costa Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros: Converting scripts into reproducible workflow research objects. eScience 2016: 71-80   Timothy M. McPhillips, Shawn Bowers, Khalid Belhajjame, Bertram Ludäscher: Retrospective Provenance Without a Runtime Provenance Recorder. TaPP 2015   Timothy M. McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen C. Dey, Juliana Freire, Deborah N. Huntzinger, Christopher Jones, David Koop, Paolo Missier, Mark Schildhauer, Christopher R. Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludäscher: YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR abs/1502.02403 (2015) BDA MDD 2018
  • 101. Computational Reproducibility: Workflows, Provenance and Scripts Khalid Belhajjame PSL, LAMSADE, Université Paris-Dauphine BDA MDD 2018 101