Towards Reproducible Science: a few building blocks from my personal experience

Full Professor at Universidad Politécnica de Madrid / Localidata
Oct. 22, 2017
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
Towards Reproducible Science: a few building blocks from my personal experience
1 of 61

More Related Content

What's hot

Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious DiseaseJoão André Carriço
Flash introduction to Qiime2 -- 16S Amplicon analysisFlash introduction to Qiime2 -- 16S Amplicon analysis
Flash introduction to Qiime2 -- 16S Amplicon analysisAndrea Telatin
Introduction to 16S Analysis with NGS - BMR GenomicsIntroduction to 16S Analysis with NGS - BMR Genomics
Introduction to 16S Analysis with NGS - BMR GenomicsAndrea Telatin
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linkedJennifer Shelton
The benefits of environment specific curation of the public databases for tax...The benefits of environment specific curation of the public databases for tax...
The benefits of environment specific curation of the public databases for tax...Aaron Marc Saunders
T-bioinfo overviewT-bioinfo overview
T-bioinfo overviewJaclyn Williams

What's hot(20)

Similar to Towards Reproducible Science: a few building blocks from my personal experience

Using semantics and NLP in experimental protocolsUsing semantics and NLP in experimental protocols
Using semantics and NLP in experimental protocolsOlga Ximena Giraldo
The role of annotation in reproducibility (Empirical 2014)The role of annotation in reproducibility (Empirical 2014)
The role of annotation in reproducibility (Empirical 2014)Oscar Corcho
Patent awareness particularly in Bio-science related inventionsPatent awareness particularly in Bio-science related inventions
Patent awareness particularly in Bio-science related inventionsPankaj Kumar
Web Apollo: Lessons learned from community-based biocuration efforts.Web Apollo: Lessons learned from community-based biocuration efforts.
Web Apollo: Lessons learned from community-based biocuration efforts.Monica Munoz-Torres
14A81A05A314A81A05A3
14A81A05A3Chaitanya Ram
China Medical University Student ePaper2China Medical University Student ePaper2
China Medical University Student ePaper2Isabelle Chiu

Similar to Towards Reproducible Science: a few building blocks from my personal experience(20)

More from Oscar Corcho

Organisational Interoperability in Practice at Universidad Politécnica de MadridOrganisational Interoperability in Practice at Universidad Politécnica de Madrid
Organisational Interoperability in Practice at Universidad Politécnica de MadridOscar Corcho
Introducción a los Datos Abiertos - Open Data Day 2020Introducción a los Datos Abiertos - Open Data Day 2020
Introducción a los Datos Abiertos - Open Data Day 2020Oscar Corcho
Open Data (and Software, and other Research Artefacts) -A proper managementOpen Data (and Software, and other Research Artefacts) -A proper management
Open Data (and Software, and other Research Artefacts) - A proper management Oscar Corcho
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosAdiós a los ficheros, hola a los grafos de conocimientos estadísticos
Adiós a los ficheros, hola a los grafos de conocimientos estadísticosOscar Corcho
Ontology Engineering at Scale for Open City Data SharingOntology Engineering at Scale for Open City Data Sharing
Ontology Engineering at Scale for Open City Data SharingOscar Corcho
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Situación de las iniciativas de Open Data internacionales (y algunas recomen...
Situación de las iniciativas de Open Data internacionales (y algunas recomen...Oscar Corcho

More from Oscar Corcho(20)

Recently uploaded

Investigating Coronal Holes and CMEs as Sources of Brightness Depletion Detec...Investigating Coronal Holes and CMEs as Sources of Brightness Depletion Detec...
Investigating Coronal Holes and CMEs as Sources of Brightness Depletion Detec...Sérgio Sacani
Dirac – Delta FunctionDirac – Delta Function
Dirac – Delta FunctionMayur Sangole
diploma in pharmacy all definition pharmaceutical chemistry 20112.pptxdiploma in pharmacy all definition pharmaceutical chemistry 20112.pptx
diploma in pharmacy all definition pharmaceutical chemistry 20112.pptxAshokrao Mane institute of diploma in pharmacy peth-vadgaon
Plant Research ReagentsPlant Research Reagents
Plant Research ReagentsTokyo Chemicals Industry (TCI)
Oxidative stress and its implications in cervical cancer.pptxOxidative stress and its implications in cervical cancer.pptx
Oxidative stress and its implications in cervical cancer.pptxMahima Gupta
Developing Therapeutic Approaches For Emerging Viral DiseasesDeveloping Therapeutic Approaches For Emerging Viral Diseases
Developing Therapeutic Approaches For Emerging Viral DiseasesSindhBiotech

Towards Reproducible Science: a few building blocks from my personal experience

Editor's Notes

  1. Cambiar la licencia por la que aplique.
  2. Experiments are central to empirical science, they are the foundation in which experimental sciences are built and improved. They allow to verify the hypothesis defined according to the scientific method. Convince the reader (other scientists) that the conclusions of an study are correct. For that, and for supporting the growth of science, the must be a repeatable process. (both by him/herself and by other scientists).
  3. In last decades there has been an evolution in the way experimental science is conducted, adding computational resources for solving scientific problems. We have moved from a paradigm in which experiments were mainly conducted on laboratories or in nature, also referred to as in vitro or in vivo science To a paradigm in which simulations and mathematical models executed over computational resources, are used for obtaining scientific insights, also referred to as computational science or in silico science. Computational experiments complement rather than substitute classical experiments.
  4. In both cases, either in classical or computational experimental science, experiments must be a repeatable process For trusting the scientific results And for allowing the development of incremental research.
  5. In this context, a definition of which kind od repeatability we are looking for, and how we plan to do it, must be provided. The first thing that we have to do, is to define how we are going to take care of the object of interest, which can be done in 2 main ways Preservation: the act of isolating the object preventing any interaction that could damage it. Conservation: the set of actions for studying the object and its associated features, allowing a supervised or restricted use of it. The processes allow to prolong the life of the object.
  6. Once a plan for taking care of the object have been stated, we have as well two ways for obtaining a repetition of the it: A replication: an copy of the original object which is as close as possible to the original A reproduction: an object that expose or mimic a certain set of features in the same way as the original one In this work we explore how conservation techniques can be applied for experimental science reproducibility For achieving this conservation and reproducibility…
  7. Any scientific experiment can be divided into three main components DATA: the phenomena we study from nature, light from stars, genomes from plants or animals, reports in social science, etc. SCIENTIFIC PROCEDURE: the set of steps that have to be performed in order to obtain the results of the experiment. EQUIPMENT: the set of tools that are required by scientists in order to capture, process and interpret the desired data. From telescopes to microscopes, petri dishes or bunsen burners, there is a wide range of tools depending on the scientific domain. All these components… __________________________
  8. All these components have a counterpart in the Computational Science world. DATA is often represented by means of tables in data bases, structured files, or even web services providing data. The SCIENTIFIC PROCEDURE can be defined by the source code written on a given language or by the descriptions of a set of invocations of different tools. … and in last decades, as Scientific Workflows, which have emerge as a paradigm for formally defining the set of data transformations to perform the scientific procedure of a computational experiment. Finally, the EQUIPMENT of a computational experiment is defined by the of hardware and software resources that are required to execute the experiment. Some initiatives have ….
  9. In our platform the users login with an ORCID ID.
  10. We capture bibliographic data and information related to the description of the protocol like purpose, applications, advantages, limitations, etc.
  11. We capture a set of metadata for representing the sample, one of them is the name of the organism; and the name of the organism come from …
  12. And in the case of the reagents we capture the reagents from PubChem API
  13. the users can draw their workflows, describe each step or instruction and capture additional information as equipment, reagent, kits, software that participate in each step, also the users can include alerts messages, etc.
  14. All these components have a counterpart in the Computational Science world. DATA is often represented by means of tables in data bases, structured files, or even web services providing data. The SCIENTIFIC PROCEDURE can be defined by the source code written on a given language or by the descriptions of a set of invocations of different tools. … and in last decades, as Scientific Workflows, which have emerge as a paradigm for formally defining the set of data transformations to perform the scientific procedure of a computational experiment. Finally, the EQUIPMENT of a computational experiment is defined by the of hardware and software resources that are required to execute the experiment. Some initiatives have ….
  15. Some initiatives have been proposed to target the reproducibility issues of the different components of experiments in computational science.
  16. DATA Examples: RDA, Open Provenance Mode, MIBBI, VCR…
  17. Some initiatives have been proposed to target the reproducibility issues of the parts of computational experiments. SC. PROCEDURE Examples: Taverna, Pegasus, WINGS, Galaxy, SCUFL WMS and their related WF languages are a way of encapsulating an preserving the scientific procedure in computational experiments Platforms such as myExperiment allow its sharing and reproducibility
  18. Finally, we found that there was a lack of approaches targeting the computational equipment by the time we started this work. Most of the work done in the area by that time, focused on sharing virtual machine images, as a way of providing exact copies of the execution environment During the time of this work, some other initiatives have appear targeting this problem, as we will discuss later. ------------------------------------------------- EQUIPMENT There is a lack of initiatives in this aspect Some projects have aimed to approach it during the time of this work. Most of them focus on the use of VM -> BLACK BOXES (here we should motivate the need of exposing the knowledge about the execution environment for increasing the reproducibility) Examples: CernVM, ReproZip, TIMBUS NOTE: LINK THIS ONE WITH THE FOLLOWING SLIDE ABOUT THE OPEN RESEARCH PROBLEMS
  19. To share your research materials (RO as a social object) To facilitate reproducibility and reuse of methods To be recognized and cited (even for constituent resources) To preserve results and prevent decay (curation of workflow definition; using provenance for partial rerun) Middleware
  20. All these components have a counterpart in the Computational Science world. DATA is often represented by means of tables in data bases, structured files, or even web services providing data. The SCIENTIFIC PROCEDURE can be defined by the source code written on a given language or by the descriptions of a set of invocations of different tools. … and in last decades, as Scientific Workflows, which have emerge as a paradigm for formally defining the set of data transformations to perform the scientific procedure of a computational experiment. Finally, the EQUIPMENT of a computational experiment is defined by the of hardware and software resources that are required to execute the experiment. Some initiatives have ….
  21. The firs open problem we identified is that… ____________________________________ Open Research Problem 1: Computational Infrastructures are usually a predefined element of a Computational Scientific Workflow. The majority of computational scientists develop their experiments with an already existing infrastructure in mind, thus not considering its definition as part of the experiment. Open Research Problem 2: Execution Environments are poorly described, or even not described at all, when describing the results of an experiment. Often, the infrastructure used in the evaluation process is summarized explaining briefly its hardware overall capabilities and the basic software stack. This lack of information compromises the conservation and reproducibility of the experiment. Open Research Problem 3: Current approaches for Computational Scientific Experiments conservation and reproducibility take into account only the compu-tational process of the experiment (scientific procedure) and the data used and produced, but not the execution environment.
  22. Open Research Problem 1: Computational Infrastructures are usually a predefined element of a Computational Scientific Workflow. The majority of computational scientists develop their experiments with an already existing infrastructure in mind, thus not considering its definition as part of the experiment.
  23. Open Research Problem 2: Execution Environments are poorly described, or even not described at all, when describing the results of an experiment. Often, the infrastructure used in the evaluation process is summarized explaining briefly its hardware overall capabilities and the basic software stack. This lack of information compromises the conservation and reproducibility of the experiment.
  24. Open Research Problem 3: Current approaches for Computational Scientific Experiments conservation and reproducibility take into account only the computational process of the experiment (scientific procedure) and the data used and produced, but not the execution environment. Based on this study, in this work, we focus on the aspects related to the reproducibility of the computational EQUIPMENT of a scientific experiment defined as a computational scientific workflows.
  25. That is, a set of modes for annotating the original environment, and that can be used for specifying and reproducing a new equivalent using cloud solutions
  26. As a result of this process, we developed the WICUS ontology network, which is composed…
  27. The first ontology is the workflow execution environment, which introduces the concept of workflow… Using this ontology we can describe the structure of a workflow, such as the ones depicted on this figure, which describes 3 workflows belonging to the Pegasus WMS, represented by the different figures and colors. Here we see how each of workflows is composed by a set of subworkflows, each one of them related to different execution requirement, as well as the requirement defined by the WMS (pegasus in this case).
  28. - DEPENDENCIES: JAR FILES DEPENDS ON THE JAVA VM
  29. Examples… Based on these models, that allow us to describe execution environments of scientific workflows….
  30. These system is composed by 3 main stages, which process the available experimental materials, for obtaining the corresponding enactment files These enactment files can be executed for deploying a reproduced execution environment. These overview can be decomposed into a set of modules and intermediate results generated during the process of reproducing an experiment _________________________________________ There are several input files and registries that can be used to extract information about the execution environment of the workflows Wf spec (DAG, make, etc.) SW comp registry (TC) WMS annotations (manual) SVA catalog (manual)
  31. We evaluated a total of 6 different workflows All of them expose different computational characteristics, From small ones, such as internal extinction, to really large ones such as SoyKb Or those requiring small amount of time for execution, such as xcorr, or montage, to the ones requiring 20 to 24 hours, such as BLAST All these workflows have been developed by different institutions, and published in different conferences and journals Some of them date from a decade ago, whereas others have been published recently We have selected them, based on their domain and the availability of their materials and support by the communities.
  32. Executed the 6 workflows in their original context Documented their execution environment Executed the ISA, obtaining enactment scripts Enacted the reproduced environments and executed the workflows. Workflow results compared to the corresponding baseline executions Montage: pHash similarity, factor 1.0, 0.85 factor Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors. Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary BLAST: equal results With this we consider the reproduction of the execution environments to be successful
  33. Executed the 6 workflows in their original context Documented their execution environment Executed the ISA, obtaining enactment scripts Enacted the reproduced environments and executed the workflows. Workflow results compared to the corresponding baseline executions Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors. Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary BLAST: equal results With this we consider the reproduction of the execution environments to be successful
  34. Executed the 6 workflows in their original context Documented their execution environment Executed the ISA, obtaining enactment scripts Enacted the reproduced environments and executed the workflows. Workflow results compared to the corresponding baseline executions Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors. Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary BLAST: equal results With this we consider the reproduction of the execution environments to be successful
  35. Executed the 6 workflows in their original context Documented their execution environment Executed the ISA, obtaining enactment scripts Enacted the reproduced environments and executed the workflows. Workflow results compared to the corresponding baseline executions Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors. Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary BLAST: equal results With this we consider the reproduction of the execution environments to be successful
  36. Executed the 6 workflows in their original context Documented their execution environment Executed the ISA, obtaining enactment scripts Enacted the reproduced environments and executed the workflows. Workflow results compared to the corresponding baseline executions Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors. Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary BLAST: equal results With this we consider the reproduction of the execution environments to be successful
  37. Executed the 6 workflows in their original context Documented their execution environment Executed the ISA, obtaining enactment scripts Enacted the reproduced environments and executed the workflows. Workflow results compared to the corresponding baseline executions Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors. Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary BLAST: equal results With this we consider the reproduction of the execution environments to be successful
  38. Cambiar la licencia por la que aplique.