Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Aussois bda-mdd-2018


Published on

This talk on Reproducibility, Workflows, Provenance and Scripts was given at the French Database Summer School BDA in Aussois,

Published in: Education
  • Be the first to comment

  • Be the first to like this

Aussois bda-mdd-2018

  1. 1. Computational Reproducibility: Workflows, Provenance and Scripts Khalid Belhajjame PSL, LAMSADE, Université Paris-Dauphine BDA MDD 2018 1
  2. 2. Data-Oriented Science   Computing is transforming the practice of science. The so-called “Fourth Paradigm of scientific research” [1] refers to the current era, where scientists utilize computational tools and technologies to manage, share, federate, analyze, visualize data to underpin scientific findings.   The objective of data-oriented science is is to create a richer research ecosystem in which emphasis is given not only to the build-up of scientific knowledge, but also to the build-up and dissemination of other work-products of research such as data, protocols, models and tools.   Why is that? [1] Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.BDA MDD 2018
  3. 3. Scholarly Articles Are Not Enough   Scholarly articles remain the main trusted means for scientists to communicate their findings   However, they are noticeably insufficient to communicate all the actual scientific knowledge behind the reported findings.   There is a need for communicating and preserving other artifacts to enable the understanding, verification, and reuse. In other words, …. reproducibility BDA MDD 2018
  4. 4. 47 of 53 “landmark” publications could not be replicated Inadequate cell lines and animal models Nature, 483, 2012 basic studies on cancer are unreliable, with grim consequences for producing new medicines in the future BDA MDD 2018
  5. 5. The research result, obtained by Stapel and co-workers Roos Vonk (Radboud University) and Marcel Zeelenberg (nl) (Tilburg University), showing that meat eaters are more selfish than vegetarians, which was widely publicized in Dutch media is suspected to be based on faked data. BDA MDD 2018
  6. 6. Reproducibility is not just about finding cheaters … it is above all a noble cause BDA MDD 2018
  7. 7.   Researchers in experimental biology use carefully lab notebooks to document different aspects of their experiments.   This is not the case for computational scientists who tend to run their analysis with no clear record of the exact process they followed or intermediary datasets (results) they used and generated.   It is therefore possible that numerous published results may be unreliable or even completely invalid. Culture of Reproducibility BDA MDD 2018
  8. 8. Culture of Reproducibility   Often, there is no record of the process (workflow) that produced the published computational results in scholarly communications.   Even the code is missing, or underwent changes.   It cannot be used to process the data referred to, (if we are lucky). BDA MDD 2018
  9. 9. Open and transparent Communication “The reproducible research movement recognizes that traditional scientific research and publication practices now fall short …, and encourages all those involved in the production of computational science ... to facilitate and practice really reproducible research.” V. Stodden, D. H. Bailey, J. M. Borwein, R. J. LeVeque, W. Rider, and W. Stein. Setting the default to reproducible: Reproducibility in computational and experimental mathematics. We witnessed recently the emergence of a number of methods and tools for enabling reproducibility BDA MDD 2018
  10. 10. Scope of this seminar   We will focus on the reproducibility of scientific workflows.   These have been adopted in modern sciences, notably life sciences and bio-diversity for encoding and enacting scientific experiments   We will look at what it means to reproduce a scientific workflow, and draw a map of some solutions that have been proposed in this direction BDA MDD 2018
  11. 11. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  12. 12. Scientific workflow • Workflow technology is increasingly used for specifying and enacting scientific experiments. • A scientific workflow is a series of analysis operations connected using data links. • Analysis operations can be supplied locally or can be independently developed web services. BDA MDD 2018
  13. 13. Science with workflows GWAS, Pharmacogenomics Association study of Nevirapine-induced skin rash in Thai Population Trypanosomiasis (sleeping sickness parasite) in African Cattle Astronomy & HelioPhysics Library Doc Preservation Systems Biology of Micro- Organisms Observing Systems Simulation Experiments JPL, NASA BioDiversity Invasive Species Modelling [Credit Carole A. Goble]BDA MDD 2018
  14. 14. Workflows for systematic resource use • Access heterogeneous resources. • Explicit, runnable, repeatable analytical process. • Explore parameter spaces. • Sweep an analysis over datasets. • Transparent and efficient analyses with provenance collected from workflow executions Workflow Provenance Data BDA MDD 2018
  15. 15. Workflow Systems BDA MDD 2018
  16. 16. Demo: Example showing the use of Taverna BDA MDD 2018
  17. 17. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  18. 18. Reproducibility Terminology   Reproducibility has been studied in science in larger contexts than computational reproducibility, in particular where wet experiments are involved.   A plethora of terms are used including repeat, replicate, reproduce, redo, rerun, recompute, reuse and repurpose etc. to name a few. We will focus on 4 Rs: Repeat, Replicate, Reproduce and Reuse.   For each of them, we will give the definition in wet-lab contexts and propose a definition in a computational setting. BDA MDD 2018
  19. 19. BDA MDD 2018
  20. 20. Repeat   A wet experiment is said to be repeated when the experiment is performed in the same lab as the original experiment, that is, on the same scientific environment.   By analogy, an in silico experiment is said to be repeated when it is performed in the same computational setting as the original experiment.   The major goal of the repeat task is to check whether the initial experiment was correct and can be performed again.   The difficulty lies in recording as much information as possible to repeat the experiment so that the same conclusion can be drawn. BDA MDD 2018
  21. 21. Replicate   A wet experiment is said to be replicated when the experiment is performed in a different (wet) ”lab” than the original experiment.   By analogy, a replicated in silico experiment is performed in a new setting and computational environment, although similar to the original ones).   When replicated, a result has a high level of robustness: the result remains valid in a similar (even though different) setting has been considered.   A continuum of situations can be considered between a repeated and replicated experiments. BDA MDD 2018
  22. 22. Reproduce   Reproduce is defined in the broadest possible sense of the term and denotes the situation where an experiment is performed within a different set-up but with the aim to validate the same scientific hypothesis.   In other words, what matters is the conclusion obtained and not the methodology considered to reach it.   Completely different approaches can be designed, completely different data sets can be used, as long as both experiments converge to the same scientific conclusion.   A reproducible result is thus a high- quality result, confirmed while obtained in various ways. BDA MDD 2018
  23. 23. Reuse   A very important concept related to reproducibility is Reuse which denotes the case where a different experiment is performed, with similarities with an original experiment.   A specific kind of reuse occurs when a single experiment is reused in a new context (and thus adapted to new needs), the experiment is then said to be repurposed. BDA MDD 2018
  24. 24. Repeat, Replicate, Reproduce and Reuse   Reproduce and reuse are the most important scientific targets.   However, before investigating alternative ways of obtaining a result (to reach reproducibility) or before reusing a given methodology in a new context (to reach reuse), the original experiment has to be carefully tested (possibly by reviewers and/or any peers), demonstrating its ability to be at least repeated and hopefully replicated   The database community lags well behind other computer science communities, e.g., the Semantic Web community   ISWC and ESWC encourages the authors to submit with the paper auxiliary resources about the experiment they used as well as the software/prototype they built if any. BDA MDD 2018
  25. 25. Reproducibility and Scientific Workflows We now introduce definitions of reproducibility concepts in the particular context of use of scientific workflow systems. In our definition, we distinguish six components of an analysis designed using a scientific workflow. 1.  S, the workflow specification, providing the analysis steps associated with tools, chained in a given order, 2.  I, the input of the workflow used for its execution, that is, the concrete data sets and parameter settings specified for any tools, 3.  E, the workflow context and runtime environment, that is, the computational context of the execution (OS, libs, etc.). Additionally, we consider R and C, the result of the analysis (typically the final data sets) and the high level conclusion that can be reached from this analysis, respectively. BDA MDD 2018
  26. 26. Repeatability of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ repeats A if and only if A and A’ are identical on all their components. Replicability of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ replicates A if and only if they reach the same conclusion while their specification and input components are similar and other components may differ (in particular no condition is set on the run-time environment).   Terms such as rerun, re-compute typically consider situations where the workflow specification is unchanged. BDA MDD 2018
  27. 27. Reproducibility of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ reproduces A if and only if they reach the same conclusion. No condition is set on any other components of the analysis. Reuse of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ reuses A if and only if the specification or input of A’ is part of the specification or input of A’.   No other condition is set, especially the conclusion to reach may be different. BDA MDD 2018
  28. 28. Reproducibility of a Scientific Workflow Given two analyses A and A’ performed using scientific workflows, we say that A’ reproduces A if and only if they reach the same conclusion. No condition is set on any other components of the analysis. Reuse of a Scientific Workflow   Given two analyses A and A’ performed using scientific workflows, we say that A’ reuses A if and only if the specification or input of A’ is part of the specification or input of A’.   No other condition is set, especially the conclusion to reach may be different. • Paul writes workflows for identifying biological pathways implicated in resistance to Trypanosomiasis in cattle • Paul meets Jo who is investigating Whipworm in mouse. • Jo reuses one of Paul’s workflow without change. • Jo identifies the biological pathways involved in sex dependence in the mouse model, believed to be involved in the ability of mice to expel the parasite. • Previously a manual two year study by Jo had failed to do this. Computational Workflows Carole Goble Reuse can be impressive when it works …but is generally hard to achieve Real-Life Example Of Reuse BDA MDD 2018
  29. 29. Which level of reproducibility are we at?   Repeatability and Replicability L   Even these two are hard to achieve most of the time.   Needless to speak about reuse at this point. There are few use cases that show the potential of workflow reuse, but we are still at the stage of use cases.   Solutions for enabling scientific workflow repeatability and replication has mainly focused on their preservation against decay BDA MDD 2018
  30. 30. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  31. 31. Workflow Preservation   Public repositories such as myExperiment and CrowdLabs have been used by scientists to publish workflow specification and share them over the web.   The availability of workflow specification is however not sufficient for enabling their repeatability and replicability.   Indeed, an empirical study that we conducted showed that the majority of workflow suffers from decay. BDA MDD 2018
  32. 32. Understanding The Causes of Workflow Decay   We adopted an empirical approach   To identify the causes of workflow decay   To quantify their severity   To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay BDA MDD 2018
  33. 33. Experimental Setup   Taverna workflows from   Taverna 1   Taverna 2   Selection process   By the creation year   By the creator   By the domain   Software environment   Taverna 2.3   Experiment metadata   4 researchers BDA MDD 2018
  34. 34. Analyzed Workflows Number of Taverna 1 workflows from 2007 to 2011 2007 2008 2009 2010 2011 Tested 11 10 10 10 4* Total 74 341 101 26 13 Number of Taverna 2 workflows from 2009 to 2012 2009 2010 2011 2012 Tested 12 10 15 9 Total 97 308 289 184 BDA MDD 2018
  35. 35. Profile of Analyzed Workflows BDA MDD 2018
  36. 36. The Proportion of Decay   75% of the 92 tested workflows failed to be either executed or produce the same result (if testable)   Those from earlier years (2007-2009) had 91% failure rate Taverna 1 Taverna 2 BDA MDD 2018
  37. 37. The Cause of Decay   Manual analysis   By the validation report from Taverna workbench   By interpreting experiment results reported by Taverna   Identified 4 categories of causes   Missing example data   Missing execution environment   Insufficient descriptions about workflows   Volatile third-party Resources BDA MDD 2018
  38. 38. Decay Caused by Third-Party Resources Causes Refined Causes Examples Third party resources are not available Underlying dataset, particularly those locally hosted in-house dataset, is no longer available Researcher hosting the data changed institution, server is no longer available Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflows Third party resources are available but not accessible Data is available but identified using different IDs that the one known to the user Due to scalability reasons the input data is superseded by new one making the workflow not executable or providing wrong results Data is available but permission, certificate, or net- work to access it is needed Cannot get the input, which is a security token that can only be obtained by a registered user of ChemiSpider Services are available but need permission, certificate, or network to access and invoke them The security policies of the execution framework are updated due to new host- ing institution rules Third party resources have changed Services are still available by using the same identifiers but their functionality have changed The web services are updated BDA MDD 2018
  39. 39. The Cause of Decay   Manual analysis   By the validation report from Taverna workbench   By interpreting experiment results reported by Taverna   Identified 4 categories of causes   Missing example data   Missing execution environment   Insufficient descriptions about workflows   Volatile third-party Resources BDA MDD 2018
  40. 40. Summary of Decay Causes   50% of the decay was caused by volatility of 3rd-party resource   Unavailable   Inaccessible   Updated   Missing example data   Unable to re-run   Missing execution environment   Such as local plugins   Insufficient metadata   Such as any required dependency libraries or permission information BDA MDD 2018
  41. 41. Combating Workflow Decay BDA MDD 2018
  42. 42. Combating Workflow Decay   Objective: Provide enough information to   Prevent decay   Detect decay   Repair decay   Approach: Research Objects + Checklists   Research Object: Aggregate workflow specifications together with auxiliary elements, such as example data inputs, annotations, provenance traces that can ne used to prevent decay and/or repair the workflow in case of decay.   Checklists: to check that sufficient information is preserved along with workflows BDA MDD 2018
  43. 43. !"##$%&'()*+,!*) !"#$%&'()*#)+',()-#)&*.,#/#)& 0(,1!(2*3#/$,'.&'()*.,#/#)& ').%&*"4#/*.,#/#)& .5,5-#&#,*+54%#/*3#")#3 ,#6%',#3*/#,+'$#/*5$$#//'74# Checklists • Checklists are a well- established tool for guiding practices to ensure safety, quality and consistency in the conduct of complex operations. • They have been adopted by the biological research community to promote consistency across research datasets • In our case, we use checklists to assess if a research object contains sufficient information for running the workflow and checking that its results are replicable. BDA MDD 2018
  44. 44. Cheklisting the Reproducibility of a Workflow !"#"$%&'()& "*'#+'%"& $,"$-#./% 0.(.1& )"/$2.3%.4( 5"/"'2$,& 678"$% 9*'#+'%.4(& 2"342% :"7 ;+234/" The Minim model used in our approach is an adaptation of the MiM model [1]. [1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework for Scientific Linked Data. eScience 2012 BDA MDD 2018
  45. 45. Use Case   4 myExperiment packs   2 from genomics, 1 from geography, and 1 domain-neutral   Experiment process:   Transform them into RO   Create checklist descriptions   Observations   2 research objects did not contains example inputs, the other 2 failed because of update to third party resources and environment of execution. BDA MDD 2018
  46. 46. Lessons Learnt 1.  Dependency is the root enemy of reproducible workflows 2.  Documentation, i.e., annotation, is vital 3.  Documentation should be easy to create BDA MDD 2018
  47. 47. Research Objects BDA MDD 2018
  48. 48. Benefits Of Research Objects   A research object aggregates all elements that are necessary to understand research investigations.   Methods (experiments) are viewed as first class citizens   Promote reuse   Enable the verification of reproducibility of the results BDA MDD 2018
  49. 49. Research Obejects Specifications and Tooling can be found at BDA MDD 2018
  50. 50. Research Object Model: Overview The model specification can be found at And the primer at BDA MDD 2018
  51. 51. Workflow Template and Workflow Run BDA MDD 2018
  52. 52. Example BDA MDD 2018
  53. 53. Example BDA MDD 2018
  54. 54. Example BDA MDD 2018
  55. 55. Grounding Workflow-centric Research Objects Using Semantic Technologies   Workflow-centric research objects are encoded using RDF, according to a set of ontologies that are publicly available   Research objects use the Object Exchange and Reuse (ORE) model, to represent aggregation. ORE BDA MDD 2018
  56. 56.   We use the Annotation Ontology (AO), to annotate research object resources and their relationships. Grounding Workflow-centric Research Objects Using Semantic Technologies BDA MDD 2018
  57. 57. 57 Scientist Live RO Live RO RO snapshot <<copy>> Identified by a URI Some metadata Some curation Mostly private (for my group) RO snapshot <<copy>> Identified by a URI Some metadata Some curation Mostly private (for my group and for paper reviewers) Librarian/Curator Scientist My supervisor calls me to report my work My supervisor calls me again and we decide to publish our RO+paper <<versionOf>> Archived RO <<copy, filter and curate>> Identified by a URI Good metadata and curation Mostly public Reviews received and final version published <<versionOf>> A new PhD student continues my work <<copy>>
  58. 58. Using Research Objects for the Preservation of Workflows/Experiments Case study: investigating the epigenetic mechanisms involved in Huntington’s disease (HD). It is the most commonly inherited neurodegenerative disorder in Europe, that affects 1 out of 10 000 people. The scientist in this use case were convinced to use Research Object as a model for packaging their investigation BDA MDD 2018
  59. 59. Preserving Scientific Wokflows when they have not been packaged into research objects   … Which is the case of most of workflows.   And even if they are packaged into research objects, scientific workflows can still suffer from decay. BDA MDD 2018
  60. 60. Scientific Workflow Preservation   Issue: As we have seen from the results of the empirical study we presented earlier, workflow preservation is frequently hampered by the volatility of the web services implementing the analysis operations that constitute workflows.   Objective: to provide a means for scientists to repair workflows by identifying service operations that can play the same role as the unavailable ones. BDA MDD 2018
  61. 61. Outline ✔  Context: Preservation of Scientific Workflows ■  Discovering Substitute Services Using Semantic Annotation of Web Services ■  Discovering Substitute Services Using Existing Workflow Specifications and Provenance traces ■  Conclusions BDA MDD 2018
  62. 62. Ontologies Used For Annotating Web Services  Task ontology: captures information about the action carried out by service operations within a domain of interest, e.g., Sequence_alignment and Protein_identification  Domain ontology: captures information about the application domains covered by operation parameters, e.g., Protein_record and DNA_sequence BDA MDD 2018
  63. 63. Task Replaceability Task replaceability: For an operation op2 to be able to substitute an operation op1, op2 must fulfil a task that is equivalent to or subsumes the task op1 performs: BDA MDD 2018
  64. 64. Parameter compatibility Parameter replaceability: To be compatible the domain of the output must be the same as or subconcept of the domain of the subsequent input. BDA MDD 2018
  65. 65. Limitations While the method just presented is sound, its practical applicability is hindered by the following facts §  Semantic annotations of web services are scarce. §  Our experience suggests that a large proportion of existing semantic annotations suffer from inaccuracies §  As a result, a substitute that is discovered for replacing an unavailable operation using such annotations may turn out to be unsuitable, and, inversely, a suitable substitute may be discarded. BDA MDD 2018
  66. 66. Discovering Substitute Services Using Existing Workflow Specifications and Provenance traces Existing Workflow Specifications Provenance traces of missing operations BDA MDD 2018
  67. 67. Parameter Compatibility Formally, let wf1 be a workflow in which the operation op1 is unavailable. The operation op2 can replace the operation op1 in terms of its inputs and outputs if: BDA MDD 2018
  68. 68. Task Compatibility   In addition to the compatibility in terms of inputs and outputs, we have to check that the candidate substitute performs a task compatible with that of the unavailable operation.   To perform this test, we exploit the following observation. An operation op2 is able to replace the operation op1 in terms of task, if for every possible input instances that op1 is able to consume, op2 delivers the same output as that obtained by invoking op1.   To perform the above test, however, we will have to call the missing operation op1!   A solution that we adopt for overcoming the above problem makes use of workflow provenance logs. These are traces that contain intermediate data that were used as input and delivered as output by the constituent operations of a workflow when enacted. BDA MDD 2018
  69. 69. Task Compatibility (cont.) §  An operation op2 may be compatible in terms of task with op1 if: op2 delivers the same results that op1 delivered in past executions, that are logged within provenance logs, when fed using the same input values. §  Notice that we say may be compatible. This is because we may not be able to compare the outputs obtained for every possible input value of the operation op1. BDA MDD 2018
  70. 70. Relaxing Substitutability Conditions   The condition that we have described for checking the suitability of an operation as a substitute for another one may be stronger than is required in practice.   There are various parameter representations that are adopted in bioinformatics.   Because of representation mismatch, a service operation that performs a task similar to the missing operation may be found to be unsuitable. BDA MDD 2018
  71. 71. Example of values delivered by two operations using the same input value Value1 Value2 CosSym(value1,value2) = 0.007 BDA MDD 2018
  72. 72. Relaxing Substitutability Conditions To overcome this problem, we use a two step process when comparing the values of parameters: 1.  Given a parameter value, we derive its representation. 2.  If the representation is associated with a key attribute (identifier), extract the value of such an attribute If two parameter values are associated with identifiers, then they are compared by comparing their identifiers. BDA MDD 2018
  73. 73. Example of values delivered by two operations using the same input value Value1 Value2 Fasta Format Uniprot Format BDA MDD 2018
  74. 74. Data Examples for Characterizing Scientific Operations   We have conducted an empirical evaluation to assess the effectiveness of the method described.   The issue that we faced is the ability to have examples that characterize the missing operation, and that can be used for comparison with available modules.   This motivated a proposal that we have worked on for characterizing analysis operations using data examples. BDA MDD 2018
  75. 75. Data Example Describes > BDA MDD 2018
  76. 76. Generating Data Examples   Data examples can be used as a means to describe the behavior of analysis operations.   Enumerating all possible data examples that can be used to describe a given operation may be expensive, and may contain redundant data examples that describe the same behavior.   Issue: which data examples should be used to characterize the functionality of a given operation?   Solution: We have showed how software testing techniques can be adapted to the problem of generating data examples without relying on the availability of the operation specification, which often is not accessible. Trick: Use domain ontologies for partitioning the space of possible values BDA MDD 2018
  77. 77. Agenda   Scientific workflows   Scientific Workflow Reproducibility   Workflow Preservation Against Decay   From Workflows to Scripts BDA MDD 2018
  78. 78. From Workflow to Scripts… and then Back   Scientific Workflows have proved their utility, and they are used in practice by scientist   However, the majority of scientists utilize scripting languages to specify and enact their data analysis.   In order to promote their reproducibility, we have seen a number of proposals in recent years that seek to bring some advantages that characterize workflows to scripts.   We will see some of them in what follows. BDA MDD 2018
  79. 79. Meanwhile, on a nearby planet … Interactive Visualization R and Python and the Winners BDA MDD 2018
  80. 80. Why Bother?   Workflow provides key features to enable reproducibility that scripts lack   Modularity   This features lack in scripts in general   Workflow can be repurposed in a straightforward manner by customizing the resources and the dependencies   Scalability: some workflow systems can handle large amounts of data   Provenance: Most workflow systems are instrumented to capture provenance information about workflow execution YesWorkflow to the rescue BDA MDD 2018
  81. 81. Science Example: Paleoclimate ReconstrucRon BDA MDD 2018
  82. 82. BDA MDD 2018
  83. 83. YesWorkflow = Script + Comments   Scripts can be hard to digest, communicate   Idea:   Add structured comments (cf. JavaDoc) => reveal workflow structure and dataflow   => obtain some scientific workflow benefits BDA MDD 2018
  84. 84. YesWorkflow Generates Three Views from the Script BDA MDD 2018
  85. 85. User Comments: YesWorkflow Annotations BDA MDD 2018
  86. 86. Paleoclimate+ReconstrucRon+…+++ B."Ludäscher"""""""""""""""""""""""""""""""""""""""""""""""""YesWorkflow:"Workflow"Views"from"Scripts."IDCC'15,"London"" 9" GetModernClimate PRISM_annual_growing_season_precipitation SubsetAllData dendro_series_for_calibration dendro_series_for_reconstruction CAR_Analysis_unique cellwise_unique_selected_linear_models CAR_Analysis_union cellwise_union_selected_linear_models CAR_Reconstruction_union raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors CAR_Reconstruction_union_output ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif master_data_directory prism_directory tree_ring_datacalibration_years retrodiction_years •  …"explained"using"YesWorkflow+ Kyle"B.,"(computa9onal)"archeologist:"" "It!took!me!about!20!minutes!to!comment.!Less! than!an!hour!to!learn!and!YWAannotate,!allAtold."! BDA MDD 2018
  87. 87. YesWorkflow Architecture   • YW-Extract   – ... structured comments   YW-Model   Program Block, Workflow   Port (data, parameters) – Channels (dataflow)   YW-Graph   using GraphViz/DOT files BDA MDD 2018
  88. 88. What About Provenance?   There are some solutions that allow capturing the provenance of a script.   Use (R, Python, ..) libraries and/or code instrumentation to capture runtime observables   file read/write, function calls, program variables & state, …   noWorkflow system   [Murta-Braganholo-Chiriga=-Koop-Freire-IPAW14]   exploit Python profiling library to capture run=me provenance Can be messy as they capture every operating system event/call! BDA MDD 2018
  89. 89. Actually, We Can Construct the Provenance Without Recording it in the First Place! YW+annota)ons:(Model(your(Workflow!( YesWorkflow(Provenance(@(TaPP'15( 17( BDA MDD 2018
  90. 90. and You have the Provenance for Freerun/   ├──  raw   │      └──  q55   │              ├──  DRT240   │              │      ├──  e10000   │              │      │      ├──  image_001.raw   ...          ...  ...  ...   │              │      │      └──  image_037.raw   │              │      └──  e11000   │              │              ├──  image_001.raw   ...          ...          ...   │              │              └──  image_037.raw   │              └──  DRT322   │                      ├──  e10000   │                      │      ├──  image_001.raw   ...                  ...  ...   │                      │      └──  image_030.raw   │                      └──  e11000   │                              ├──  image_001.raw   ...                          ...   │                              └──  image_030.raw   ├──  data   │      ├──  DRT240   │      │      ├──  DRT240_10000eV_001.img   ...  ...  ...   │      │      └──  DRT240_11000eV_037.img   │      └──  DRT322   │              ├──  DRT322_10000eV_001.img   ...          ...   │              └──  DRT322_11000eV_030.img   │   ├──  collected_images.csv   ├──  rejected_samples.txt   └──  run_log.txt     YWDRECON:+Prospec=ve(&(Retrospec)ve( Provenance(…((almost)(for(free!(( YesWorkflow(Provenance(@(TaPP'15( 23( cassette_id sample_score_cutoff sample_spreadsheet file:cassette_{cassette_id}_spreadsheet.csv calibration_image file:calibration.img initialize_run run_log file:run/run_log.txt load_screening_results sample_namesample_quality calculate_strategy rejected_sample accepted_sample num_images energies log_rejected_sample rejection_log file:/run/rejected_samples.txt collect_data_set sample_id energy frame_number raw_image file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw transform_images corrected_image file:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img total_intensitypixel_count corrected_image_path log_average_image_intensity collection_log file:run/collected_images.csv •  URIDtemplates+link(conceptual(en==es( to(run)me+provenance+“le|(behind”(by( the(script(author(…(( •  …(facilita=ng(provenance(reconstruc=on(BDA MDD 2018
  91. 91. BDA MDD 2018
  92. 92. BDA MDD 2018
  93. 93. BDA MDD 2018
  94. 94. BDA MDD 2018
  95. 95. Back to Workflow Land Converting Scripts into Reproducible Workflow Research Objects 95 BDA MDD 2018
  96. 96. BDA MDD 2018
  97. 97. 38 Step Bundle Resources into a Research Object 5 Script Abstract workow Concrete workow(s) Annotations Paper Provenance Data Attributions BDA MDD 2018
  98. 98. Conclusions   Research in enabling reproducibility has seen a real push in recent year, with some great initiatives, software products and data repositories Figshare, Dataverse, OpenAir, DataONE, RDA   Workflows and Scripts are no exception, and there have been some good proposals from a handful of researchers as well as practitioners.   MADICS Workfing Group on Reproducibility.   We are just scratching the surface and there are numerous issues that still need to be addressed.   workflow/scripts similarities, comparison of scientific results, incremental re-computation, to cite a few are still open topics. BDA MDD 2018
  99. 99. Acknowledgement   Pinar Alper,   Lucas Augusto Carvalho   Shawn Bowers   Sarah Cohen Boulakia   Alban Gaignard   Daniel Garijo   Carole Goble,   Bertram Ludascher   Timothy McPhilips   Claudia Medeiros   Paolo Missier Stian Soiland-Reyes BDA MDD 2018
  100. 100. References   Pinar Alper, Khalid Belhajjame, Carole A. Goble: Static analysis of Taverna workflows to predict provenance patterns. Future Generation Comp. Syst. 75: 310-329 (2017)   Khalid Belhajjame, Jun Zhao, Daniel Garijo, Matthew Gamble, Kristina M. Hettne, Raúl Palma, Eleni Mina, Óscar Corcho, José Manuél Gómez-Pérez, Sean Bechhofer, Graham Klyne, Carole A. Goble: Using a suite of ontologies for preserving workflow-centric research objects. J. Web Sem. 32: 16-42 (2015)   Khalid Belhajjame, Carole A. Goble, Stian Soiland-Reyes, David De Roure: Fostering Scientific Workflow Preservation through Discovery of Substitute Services. eScience 2011: 97-104   Sarah Cohen Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard, Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, Fabien Mareuil, Hervé Ménager, Christophe Pradal, Christophe Blanchet: Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Comp. Syst. 75: 284-298 (2017)   Alban Gaignard, Khalid Belhajjame, Hala Skaf-Molli: SHARP: Harmonizing Cross-workflow Provenance. SeWeBMeDA@ESWC 2017: 50-64   Lucas Augusto Montalvão Costa Carvalho, Khalid Belhajjame, Claudia Bauzer Medeiros: Converting scripts into reproducible workflow research objects. eScience 2016: 71-80   Timothy M. McPhillips, Shawn Bowers, Khalid Belhajjame, Bertram Ludäscher: Retrospective Provenance Without a Runtime Provenance Recorder. TaPP 2015   Timothy M. McPhillips, Tianhong Song, Tyler Kolisnik, Steve Aulenbach, Khalid Belhajjame, Kyle Bocinsky, Yang Cao, Fernando Chirigati, Saumen C. Dey, Juliana Freire, Deborah N. Huntzinger, Christopher Jones, David Koop, Paolo Missier, Mark Schildhauer, Christopher R. Schwalm, Yaxing Wei, James Cheney, Mark Bieda, Bertram Ludäscher: YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts. CoRR abs/1502.02403 (2015) BDA MDD 2018
  101. 101. Computational Reproducibility: Workflows, Provenance and Scripts Khalid Belhajjame PSL, LAMSADE, Université Paris-Dauphine BDA MDD 2018 101