Full Professor at Universidad Politécnica de Madrid / Localidata
Oct. 22, 2017•0 likes•1,271 views
1 of 61
Towards Reproducible Science: a few building blocks from my personal experience
Oct. 22, 2017•0 likes•1,271 views
Download to read offline
Report
Science
Invited keynote given at the Second International Workshop on Semantics for BioDiversity (http://fusion.cs.uni-jena.de/s4biodiv2017/), held in conjunction with ISWC2017 (https://iswc2017.semanticweb.org/)
Towards Reproducible Science: a few building blocks from my personal experience
1. Oscar Corcho
(with contributions from Olga Giraldo, Alexander García,
and Idafen Santana)
http://www.oeg-upm.net/index.php/en/researchareas/3-
semanticscience/index.html
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Towards Reproducible Science: a
few building blocks from my
personal experience
ocorcho@fi.upm.es
@ocorcho
22/10/2017
S4BioDiv2017, Vienna
5. Towards Reproducible Science 5
Before continuing….
What does reproducibility
mean for you?
And for your colleagues?
And for the colleagues from
other disciplines?
10. Towards Reproducible Science
Experiment components
10
DATA SCIENTIFIC PROCEDURE EQUIPMENT
INVIVO/VITROINSILICO
This has attracted most
of the attention so far
11. Towards Reproducible Science
Block 1. Experimental Protocols
11
Olga Giraldo
Alexander Garcia
Explore alternative ways for documenting and
retrieving information from experimental protocols
Using Semantics and NLP in the SMART Protocols Repository. Giraldo O, García-Castro
A, Corcho O - ICBO, 2015
Using Semantics and Natural Language Processing in Experimental Protocols. Giraldo
O, García-Castro A, Figueredo J, Corcho O - J Biomedical Semantics, to appear
SMART protocols: semantic representation for experimental protocols. Giraldo O,
García-Castro A, Corcho O – Linked Science 2014
12. Towards Reproducible Science
What is an experimental protocol
Experimental protocols
are like cooking recipes
They have ingredients:
reagents and sample
They have appliances:
equipment,
They have a list of instructions,
The protocols should have
complete information that
allows anybody to recreate an
experiment.
They have a total time
They have critical steps…
13. Towards Reproducible Science
Some of the issues we aim at addressing
• Incubate the
centrifuge tubes in a
water bath.
• Incubate the samples
for 5 min with gentle
shaking.
• Rinse DNA briefly in
1-2 ml of wash.
• Incubate at -20C
overnight.
some protocols present insufficient
granularity,
the instructions can be imprecise or
ambiguous due to the use of natural
language.
The protocols lack structure
14. Towards Reproducible Science
Bio-ontologies
OBI, EXPO, EXACT, BAO, IAO, ERO…
Data repository
for making data
available
few efforts focus on
representing and
standardizing
experimental protocols.
For reproducibility
purposes, if the data
must be available, so
does the experimental
protocol detailing the
methodology followed
to derive the data.
Resources for
reporting guidelines or
Minimum Information
standards
Ingredients for Improving Reproducibility
16. Towards Reproducible Science
Our approach
• Ontology model representing lab protocols
• Gazetteer-based method: use existing lists of named
entities
Lists of proper nouns, which refer to real-life entities
• Rule-based approaches:
write manual extraction
rules
• Development of a Gold
Standard of protocols
annotated manually
18. Towards Reproducible Science
The SIRO model
Sample/Specimen
(whole organism, anatomical
part, bodily fluids, etc.)
Instruments
(equipment, devices,
consumables, software)
Reagents
(chemical compounds,
mixtures)
Objective
(purpose)
The SIRO model
supports search,
retrieval and
classification of
experimental protocols
19. Towards Reproducible Science
Design of semantic Gazetteer and JAPE rules
Design of semantic Gazetteers
• Facilitate the annotation of instances
related to:
Experimental actions
Instruments
Samples/ organisms
Reagents
Design of grammar
rules
• Facilitate the
annotation of
instructions
20. Towards Reproducible Science
Development of a Gold Standard
100 protocols published in
several repositories
Annotators - experts in
life sciences
http://smart-
protocols.labs.linkingdata.io/dist/d
ev/#/login
The SMART Protocols
Annotation Tool
Guidelines about What
and How annotate
Materials:
• BioTechniques,
• CSH-Protocols,
• Current protocols,
• Genet and Mol. Res,
• Journal of Biolog. Methods,
• Jove,
• MethodsX,
• Nature protocols exchange,
• Nature protocols
• Curso BIOS 2016, Colombia
• Universidad del Valle,
Colombia
• Japan (Database Center for
Life Science (DBCLS),
Robotic Biology Institute
(RBI), Spiber, Yachie-Lab,
University of Tokyo).
• Universidad Santiago de
Cali, Colombia
21. Towards Reproducible Science
Preliminary results
Entities sample instrument reagent objective
Sample Neural cell 3 0 0 0
neural stem cells (NSCs) 3 0 0 0
Instrument Cell culture centrifuge 0 3 0 0
cell culture incubator 0 3 0 0
Microscope 0 3 0 0
Millicell culture plate inserts 8-?m pore size 0 3 0 0
reagent B27 supplement 0 0 3 0
DMEM/F12 0 0 3 0
FGF2 neutralizing antibody 0 0 3 0
glucose 0 0 3 0
objective Here we describe two migration assays, a matrigel migration assay
and a Boyden chamber migration assay, which allow the in
vitro assessment of neural migration under defined conditions
(Ladewig, Koch and Brüstle, 2014).
0 0 0 3
entities sample instrument reagent
Reagent - Sample/Organism Ac-omega viral DNA 1 2
baculoviral 1 2
DNA insert 2 1
I-Sce I meganuclease 1 2
Sample/Organism Insect cells 3
Instrument spinner 3
Centrifuge 3
Flask 3
Reagent IPL-41 powdered 3
Liposome formulation 3
Phenol:chloroform 3
Fleiss Kappa for 3
raters = 1.0
Fleiss Kappa for 3
raters = 0.755
22. Towards Reproducible Science
Our ongoing work
22
So far, this is ok for handling protocols that have
been already reported in papers
Can we actually change the way in which
these protocols are produced?
23. Towards Reproducible Science
Platform for publishing semantic protocols
Features:
Open semantic publishing platform
o The protocols are born semantic
Self describing documents
o Meaningful entities
o Machine procesable workflows
Documents will reference existing URIs
o Samples/organisms
o Reagents/chemical compounds
o Instruments
SMART Protocols Ontology /
Gazetteers / Grammar rules
UniProt
NCBI taxonomy
PubChem
Vendors
30. Towards Reproducible Science
Block 2. Computational Environments
30
Idafen Santana
Is it possible to describe the main properties of the
Execution Environment of a Computational Scientific
Experiment and, based on this description, derive a
reproduction process for generating an equivalent
environment using virtualization techniques?
Conservation of Computational Scientific Execution Environments for Workflow-
based Experiments Using Ontologies. Santana-Pérez I. PhD thesis, 2016.
http://oa.upm.es/39520/
36. Towards Reproducible Science
bundles and relates digital resources of a scientific experiment
or investigation using standard mechanisms, “tool middleware”
http://www.w3.org/community/rosc/
http://www.researchobject.org/
39. Towards Reproducible Science
Open Research Problems
40
Computational Infrastructures are usually a predefined
element of a Computational Scientific Workflow.
40. Towards Reproducible Science
Open Research Problems
41
Computational Infrastructures are usually a predefined
element of a Computational Scientific Workflow.
Execution Environments are poorly described.
41. Towards Reproducible Science
Open Research Problems
42
Computational Infrastructures are usually a predefined
element of a Computational Scientific Workflow.
Execution Environments are poorly described.
Current reproducibility approaches for computational
experiments consider mostly data and procedure.
51. Towards Reproducible Science
Evaluation
Workflows reproduced
o 3 scientific domains
o 3 workflow management systems
o 6 different workflows
52
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
(2003) (2014)(2014) (2015) (2011)(2011)
52. Towards Reproducible Science
Evaluation
53
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
Results
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
CLOU
D
EQUIVALENT EXECUTION
ENVIRONMENTSEMANTIC
ANNOTATIONS
COMPARE
53. Towards Reproducible Science
Evaluation
54
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
Results
CLOU
D
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
SEMANTIC
ANNOTATIONS
EQUIVALENT EXECUTION
ENVIRONMENT
COMPARE
54. Towards Reproducible Science
Evaluation
55
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
Results
CLOU
D
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
SEMANTIC
ANNOTATIONS
EQUIVALENT EXECUTION
ENVIRONMENT
COMPARE
• Non-deterministic
• Standard and error output
• Generated files equivalent
55. Towards Reproducible Science
Evaluation
56
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
Results
CLOU
D
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
SEMANTIC
ANNOTATIONS
EQUIVALENT EXECUTION
ENVIRONMENT
COMPARE
• Same results
• Results from Int. Extinction
may vary
56. Towards Reproducible Science
Evaluation
57
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
Results
CLOU
D
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
SEMANTIC
ANNOTATIONS
EQUIVALENT EXECUTION
ENVIRONMENT
COMPARE
• Genomic data
• Exact match
57. Towards Reproducible Science
Evaluation
58
Domain Seismic Astronomy Bio
WMS dispel4py Pegasus Makeflow
Name xcorr
Internal
Extinction
Montage Epigenomics SoyKB BLAST
Results
CLOU
D
FORMER
EQUIPMENT
ANNOTATE REPRODUCE
SEMANTIC
ANNOTATIONS
EQUIVALENT EXECUTION
ENVIRONMENT
COMPARE
58. Towards Reproducible Science
Summarizing
Two building blocks towards reproducibility of
scientific experiments
o In vivo/vitro
• Focus on providing structured descriptions of methods
(laboratory protocols)
• Our tools: ontologies, gazeteers, NLP tools and
automatic and manual annotation tools
• Challenge: make protocols be more structured (and
semantic) from the beginning
o In silico
• Focus on the equipment (computational infrastructure)
for workflow-based experiments
• Ontologies, automatic and manual annotation tools, and
an execution environment
• Challenge: keep track of all types of appliances, and
make scientists work on providing annotations
Is this enough?
59
60. Oscar Corcho
(with contributions from Olga Giraldo, Alexander García,
and Idafen Santana)
Ontology Engineering Group
Universidad Politécnica de Madrid, Spain
Towards Reproducible Science: a
few building blocks from my
personal experience
ocorcho@fi.upm.es
@ocorcho
22/10/2017
S4BioDiv2017, Vienna
Experiments are central to empirical science, they are the foundation in which experimental sciences are built and improved.
They allow to verify the hypothesis defined according to the scientific method.
Convince the reader (other scientists) that the conclusions of an study are correct.
For that, and for supporting the growth of science, the must be a repeatable process. (both by him/herself and by other scientists).
In last decades there has been an evolution in the way experimental science is conducted, adding computational resources for solving scientific problems.
We have moved from a paradigm in which experiments were mainly conducted on laboratories or in nature, also referred to as in vitro or in vivo science
To a paradigm in which simulations and mathematical models executed over computational resources, are used for obtaining scientific insights, also referred to as computational science or in silico science.
Computational experiments complement rather than substitute classical experiments.
In both cases, either in classical or computational experimental science, experiments must be a repeatable process
For trusting the scientific results
And for allowing the development of incremental research.
In this context, a definition of which kind od repeatability we are looking for, and how we plan to do it, must be provided.
The first thing that we have to do, is to define how we are going to take care of the object of interest, which can be done in 2 main ways
Preservation: the act of isolating the object preventing any interaction that could damage it.
Conservation: the set of actions for studying the object and its associated features, allowing a supervised or restricted use of it.
The processes allow to prolong the life of the object.
Once a plan for taking care of the object have been stated, we have as well two ways for obtaining a repetition of the it:
A replication: an copy of the original object which is as close as possible to the original
A reproduction: an object that expose or mimic a certain set of features in the same way as the original one
In this work we explore how conservation techniques can be applied for experimental science reproducibility
For achieving this conservation and reproducibility…
Any scientific experiment can be divided into three main components
DATA: the phenomena we study from nature, light from stars, genomes from plants or animals, reports in social science, etc.
SCIENTIFIC PROCEDURE: the set of steps that have to be performed in order to obtain the results of the experiment.
EQUIPMENT: the set of tools that are required by scientists in order to capture, process and interpret the desired data. From telescopes to microscopes, petri dishes or bunsen burners, there is a wide range of tools depending on the scientific domain.
All these components…
__________________________
All these components have a counterpart in the Computational Science world.
DATA is often represented by means of tables in data bases, structured files, or even web services providing data.
The SCIENTIFIC PROCEDURE can be defined by the source code written on a given language or by the descriptions of a set of invocations of different tools.
… and in last decades, as Scientific Workflows, which have emerge as a paradigm for formally defining the set of data transformations to perform the scientific procedure of a computational experiment.
Finally, the EQUIPMENT of a computational experiment is defined by the of hardware and software resources that are required to execute the experiment.
Some initiatives have ….
In our platform the users login with an ORCID ID.
We capture bibliographic data and information related to the description of the protocol like purpose, applications, advantages, limitations, etc.
We capture a set of metadata for representing the sample, one of them is the name of the organism; and the name of the organism come from …
And in the case of the reagents we capture the reagents from PubChem API
the users can draw their workflows, describe each step or instruction and capture additional information as equipment, reagent, kits, software that participate in each step, also the users can include alerts messages, etc.
All these components have a counterpart in the Computational Science world.
DATA is often represented by means of tables in data bases, structured files, or even web services providing data.
The SCIENTIFIC PROCEDURE can be defined by the source code written on a given language or by the descriptions of a set of invocations of different tools.
… and in last decades, as Scientific Workflows, which have emerge as a paradigm for formally defining the set of data transformations to perform the scientific procedure of a computational experiment.
Finally, the EQUIPMENT of a computational experiment is defined by the of hardware and software resources that are required to execute the experiment.
Some initiatives have ….
Some initiatives have been proposed to target the reproducibility issues of the different components of experiments in computational science.
DATA
Examples: RDA, Open Provenance Mode, MIBBI, VCR…
Some initiatives have been proposed to target the reproducibility issues of the parts of computational experiments.
SC. PROCEDURE
Examples: Taverna, Pegasus, WINGS, Galaxy, SCUFL
WMS and their related WF languages are a way of encapsulating an preserving the scientific procedure in computational experiments
Platforms such as myExperiment allow its sharing and reproducibility
Finally, we found that there was a lack of approaches targeting the computational equipment by the time we started this work.
Most of the work done in the area by that time, focused on sharing virtual machine images, as a way of providing exact copies of the execution environment
During the time of this work, some other initiatives have appear targeting this problem, as we will discuss later.
-------------------------------------------------
EQUIPMENT
There is a lack of initiatives in this aspect
Some projects have aimed to approach it during the time of this work.
Most of them focus on the use of VM -> BLACK BOXES (here we should motivate the need of exposing the knowledge about the execution environment for increasing the reproducibility)
Examples: CernVM, ReproZip, TIMBUS
NOTE: LINK THIS ONE WITH THE FOLLOWING SLIDE ABOUT THE OPEN RESEARCH PROBLEMS
To share your research materials (RO as a social object)
To facilitate reproducibility and reuse of methods
To be recognized and cited (even for constituent resources)
To preserve results and prevent decay (curation of workflow definition; using provenance for partial rerun)
Middleware
All these components have a counterpart in the Computational Science world.
DATA is often represented by means of tables in data bases, structured files, or even web services providing data.
The SCIENTIFIC PROCEDURE can be defined by the source code written on a given language or by the descriptions of a set of invocations of different tools.
… and in last decades, as Scientific Workflows, which have emerge as a paradigm for formally defining the set of data transformations to perform the scientific procedure of a computational experiment.
Finally, the EQUIPMENT of a computational experiment is defined by the of hardware and software resources that are required to execute the experiment.
Some initiatives have ….
The firs open problem we identified is that…
____________________________________
Open Research Problem 1: Computational Infrastructures are usually a predefined element of a Computational Scientific Workflow. The majority of computational scientists develop their experiments with an already existing infrastructure in mind, thus not considering its definition as part of the experiment.
Open Research Problem 2: Execution Environments are poorly described, or even not described at all, when describing the results of an experiment. Often, the infrastructure used in the evaluation process is summarized explaining briefly its hardware overall capabilities and the basic software stack. This lack of information compromises the conservation and reproducibility of the experiment.
Open Research Problem 3: Current approaches for Computational Scientific Experiments conservation and reproducibility take into account only the compu-tational process of the experiment (scientific procedure) and the data used and produced, but not the execution environment.
Open Research Problem 1: Computational Infrastructures are usually a predefined element of a Computational Scientific Workflow. The majority of computational scientists develop their experiments with an already existing infrastructure in mind, thus not considering its definition as part of the experiment.
Open Research Problem 2: Execution Environments are poorly described, or even not described at all, when describing the results of an experiment. Often, the infrastructure used in the evaluation process is summarized explaining briefly its hardware overall capabilities and the basic software stack. This lack of information compromises the conservation and reproducibility of the experiment.
Open Research Problem 3: Current approaches for Computational Scientific Experiments conservation and reproducibility take into account only the computational process of the experiment (scientific procedure) and the data used and produced, but not the execution environment.
Based on this study, in this work, we focus on the aspects related to the reproducibility of the computational EQUIPMENT of a scientific experiment defined as a computational scientific workflows.
That is, a set of modes for annotating the original environment, and that can be used for specifying and reproducing a new equivalent using cloud solutions
As a result of this process, we developed the WICUS ontology network, which is composed…
The first ontology is the workflow execution environment, which introduces the concept of workflow…
Using this ontology we can describe the structure of a workflow, such as the ones depicted on this figure, which describes 3 workflows belonging to the Pegasus WMS, represented by the different figures and colors. Here we see how each of workflows is composed by a set of subworkflows, each one of them related to different execution requirement, as well as the requirement defined by the WMS (pegasus in this case).
- DEPENDENCIES: JAR FILES DEPENDS ON THE JAVA VM
Examples…
Based on these models, that allow us to describe execution environments of scientific workflows….
These system is composed by 3 main stages, which process the available experimental materials, for obtaining the corresponding enactment files
These enactment files can be executed for deploying a reproduced execution environment.
These overview can be decomposed into a set of modules and intermediate results generated during the process of reproducing an experiment
_________________________________________
There are several input files and registries that can be used to extract information about the execution environment of the workflows
Wf spec (DAG, make, etc.)
SW comp registry (TC)
WMS annotations (manual)
SVA catalog (manual)
We evaluated a total of 6 different workflows
All of them expose different computational characteristics,
From small ones, such as internal extinction, to really large ones such as SoyKb
Or those requiring small amount of time for execution, such as xcorr, or montage, to the ones requiring 20 to 24 hours, such as BLAST
All these workflows have been developed by different institutions, and published in different conferences and journals
Some of them date from a decade ago, whereas others have been published recently
We have selected them, based on their domain and the availability of their materials and support by the communities.
Executed the 6 workflows in their original context
Documented their execution environment
Executed the ISA, obtaining enactment scripts
Enacted the reproduced environments and executed the workflows.
Workflow results compared to the corresponding baseline executions
Montage: pHash similarity, factor 1.0, 0.85 factor
Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors.
Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary
BLAST: equal results
With this we consider the reproduction of the execution environments to be successful
Executed the 6 workflows in their original context
Documented their execution environment
Executed the ISA, obtaining enactment scripts
Enacted the reproduced environments and executed the workflows.
Workflow results compared to the corresponding baseline executions
Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor
Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors.
Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary
BLAST: equal results
With this we consider the reproduction of the execution environments to be successful
Executed the 6 workflows in their original context
Documented their execution environment
Executed the ISA, obtaining enactment scripts
Enacted the reproduced environments and executed the workflows.
Workflow results compared to the corresponding baseline executions
Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor
Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors.
Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary
BLAST: equal results
With this we consider the reproduction of the execution environments to be successful
Executed the 6 workflows in their original context
Documented their execution environment
Executed the ISA, obtaining enactment scripts
Enacted the reproduced environments and executed the workflows.
Workflow results compared to the corresponding baseline executions
Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor
Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors.
Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary
BLAST: equal results
With this we consider the reproduction of the execution environments to be successful
Executed the 6 workflows in their original context
Documented their execution environment
Executed the ISA, obtaining enactment scripts
Enacted the reproduced environments and executed the workflows.
Workflow results compared to the corresponding baseline executions
Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor
Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors.
Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary
BLAST: equal results
With this we consider the reproduction of the execution environments to be successful
Executed the 6 workflows in their original context
Documented their execution environment
Executed the ISA, obtaining enactment scripts
Enacted the reproduced environments and executed the workflows.
Workflow results compared to the corresponding baseline executions
Montage: which generates an image of the sky, pHash similarity, factor 1.0, 0.85 factor
Epigenomics and SoyKB: non-deterministic, out files equal in terms of number of lines and content, with no errors.
Internal Extinction and xcorr: exact same results, even when in the case of internal extinction they may vary
BLAST: equal results
With this we consider the reproduction of the execution environments to be successful