Reproducibility in e­science: 
the Simulagora platform 
The present poster first reviews what reproducibility in e­science 
is all about[1][2][3]. Three important features of a reproducible study are listed and the challenges to be 
addressed exhibited, so that a research work can really be reproduced by its author itself (e.g. to improve upon it or just to check results obtained in the past), by 
reviewers (e.g. for careful analysis before publication), or by collaborators (e.g. to derive a new work). The Simulagora platform is then presented to show how it can 
help e­scientists 
produce or reproduce a research work. Future developments of the Simulagora platform are also quoted. 
Replication 
Replication is the first step towards reproducibility. It aims at making it as 
easy as possible to re­play 
the complete research study without any 
modification of the input data or the software. 
The first challenge is to find available hardware resources that are 
compatible with the workload to be replicated. Among other things, the 
computing power and storage capacity must be sufficient. Ideally, the 
hardware resources used at this step are the same as those used to carry 
out the initial study. 
A second challenge is to replicate the complete software environment of 
the initial study, including all the dependencies. Note that proprietary 
software may be an intractable issue at this step, as licenses might not be 
available to the people who replicate the study at the time they do it. 
These two challenges may take a significant amount of time, making the 
replication impracticable. The global ease of the replication process is 
critical, thus requiring dedicated tools and methods. 
Re­appropriation 
For a true reproducibility, the re­appropriation 
of the process described in a 
research paper is even more important. It requires a thorough analysis of the 
data workflow and of the software. A final goal should be to derive some work 
from the initial one in order to test the robustness of the proposed approach. 
Documenting the data workflow of the initial study is of great help but not 
sufficient here as errors may occur. A tool that ensures the traceability of the 
data and software throughout the process would be highly valuable. 
To deeply analyze a given step of the process, dedicated tools like statistics 
libraries or data visualization software are necessary and must be provided or 
easily installable in the initial study's environment: the data volume may be too 
big to be transferred. 
When it comes to software review, a distinction may be made between high­level, 
study­specific 
software and widely accepted, community­supported, 
reliable libraries: auditing the study­specific 
software should be as easy as 
possible, which might require code or input data editing. 
http://www.logilab.fr 
contact@logilab.fr 
Cloud resource usage 
Virtualization is an ideal work replication technology. 
Although usable on small­scale 
data centers, public 
clouds allow access to virtually infinite computing 
power and storage capacity, very high availability and 
always lower prices thanks to economies of scale. 
When used on a typical public cloud Simulagora gets a 
new virtual machine within a few minutes after the 
user's request, gets the input data and launches the 
required program, either for computation­intensive 
tasks or an interactive working session. 
Science­targeted 
machine images 
Simulagora uses Debian[5] Linux as a basis for its machine 
images. Debian is a natural choice for scientists, thanks to its 
very large open source scientific software repository. 
Logilab actively contributes to Debian, notably by packaging 
free scientific software (recently, the Code_ASTER[6] finite 
element solver for mechanics). 
Simulagora's machine images are produced using Saltstack[7] 
to make it easy to install and configure Simulagora's specific 
software. 
Collaborate on software – Mercurial 
Simulagora uses Mercurial[14] repositories to store the program 
that launches the computations and that may use the pre­installed 
software. The Mercurial version of this program is 
stored in Simulagora's database along with the machine image's 
version, the computation's input data, and results, … These 
repositories are exposed to power users, who may then write 
their own programs, share, clone and modify them before 
pushing back to the platform for use in a new study. 
Model­based 
application 
The traceability is achieved using a WEB application 
based on the Open Source CubicWeb[8] framework, 
written in Python[9]. The input, output and software of 
each processing step is described and stored in a 
PostgreSQL[10] database before being executed and 
become immutable. 
Collaborate on documentation – vcwiki 
The wiki­like 
functionality from vcwiki[15], is being integrated into 
Simulagora. Documents can be edited through the WEB interface or pushed 
using the Mercurial version control system. Vcwiki currently uses 
RestructuredText but could be extended to other textual markups if users ask. 
https://www.simulagora.com 
Remote access and visualization 
Simulagora users are granted full root access to their virtual 
machines. They can simply use OpenSSH[11] from a terminal 
and monitor their work or install software. Or they can take 
advantage of NoVNC[12] and remotely view a full XFCE[13] 
desktop environment directly in their web­socket 
capable 
browser, for a zero­install 
usage of Simulagora. 
Collaboration 
Strictly speaking, this is not a requirement of research reproducibility, but 
rather an extension. Collaboration­like 
features are however often 
required during the reviewing process, if the reviewer wants to check the 
results with slightly modified algorithms or input data. In other contexts, 
collaboration features are essential to make it easier for e­scientists 
to 
base their efforts on previous successful ones, and might considerably 
improve the collective efficiency. 
The first need for proper collaboration in e­science 
is the ability to share 
code with collaborators in a secure way, as WEB­based 
software forges 
do. 
However, sharing data is more difficult: depending on the size of the data 
involved in the considered study, getting a copy might be impracticable. In 
this case, it is more suitable to push the software required by the new work 
to the initial data than the other way round. 
Collaborate on studies – Clone, modify, compare 
Simulagora users can allow people of their choice to access a complete study, 
including input data and results. They may also freeze the study so that 
collaborators can clone it as a start for a derived work, then perhaps compare 
the new results with those of the original study. 
[1] http://recomputation.org 
[2] http://ropensci.org/blog/2014/06/09/reproducibility 
[3] http://www.nature.com/nature/focus/reproducibility 
Further reading: 
[4] http://www.openstack.org 
[5] http://www.debian.org 
[6] http://www.code­aster. 
org 
[7] http://www.saltstack.com 
[8] http://www.cubicweb.org 
[9] http://www.python.org 
[10] http://www.postgresql.org 
[11] http://www.openssh.org 
[12] http://novnc.com 
[13] http://www.xfce.org 
[14] http://mercurial.selenic.com 
[15] http://www.cubicweb.org/project/cubicweb­vcwiki 
References:

Simulagora (Euroscipy2014 - Logilab)

  • 1.
    Reproducibility in e­science: the Simulagora platform The present poster first reviews what reproducibility in e­science is all about[1][2][3]. Three important features of a reproducible study are listed and the challenges to be addressed exhibited, so that a research work can really be reproduced by its author itself (e.g. to improve upon it or just to check results obtained in the past), by reviewers (e.g. for careful analysis before publication), or by collaborators (e.g. to derive a new work). The Simulagora platform is then presented to show how it can help e­scientists produce or reproduce a research work. Future developments of the Simulagora platform are also quoted. Replication Replication is the first step towards reproducibility. It aims at making it as easy as possible to re­play the complete research study without any modification of the input data or the software. The first challenge is to find available hardware resources that are compatible with the workload to be replicated. Among other things, the computing power and storage capacity must be sufficient. Ideally, the hardware resources used at this step are the same as those used to carry out the initial study. A second challenge is to replicate the complete software environment of the initial study, including all the dependencies. Note that proprietary software may be an intractable issue at this step, as licenses might not be available to the people who replicate the study at the time they do it. These two challenges may take a significant amount of time, making the replication impracticable. The global ease of the replication process is critical, thus requiring dedicated tools and methods. Re­appropriation For a true reproducibility, the re­appropriation of the process described in a research paper is even more important. It requires a thorough analysis of the data workflow and of the software. A final goal should be to derive some work from the initial one in order to test the robustness of the proposed approach. Documenting the data workflow of the initial study is of great help but not sufficient here as errors may occur. A tool that ensures the traceability of the data and software throughout the process would be highly valuable. To deeply analyze a given step of the process, dedicated tools like statistics libraries or data visualization software are necessary and must be provided or easily installable in the initial study's environment: the data volume may be too big to be transferred. When it comes to software review, a distinction may be made between high­level, study­specific software and widely accepted, community­supported, reliable libraries: auditing the study­specific software should be as easy as possible, which might require code or input data editing. http://www.logilab.fr contact@logilab.fr Cloud resource usage Virtualization is an ideal work replication technology. Although usable on small­scale data centers, public clouds allow access to virtually infinite computing power and storage capacity, very high availability and always lower prices thanks to economies of scale. When used on a typical public cloud Simulagora gets a new virtual machine within a few minutes after the user's request, gets the input data and launches the required program, either for computation­intensive tasks or an interactive working session. Science­targeted machine images Simulagora uses Debian[5] Linux as a basis for its machine images. Debian is a natural choice for scientists, thanks to its very large open source scientific software repository. Logilab actively contributes to Debian, notably by packaging free scientific software (recently, the Code_ASTER[6] finite element solver for mechanics). Simulagora's machine images are produced using Saltstack[7] to make it easy to install and configure Simulagora's specific software. Collaborate on software – Mercurial Simulagora uses Mercurial[14] repositories to store the program that launches the computations and that may use the pre­installed software. The Mercurial version of this program is stored in Simulagora's database along with the machine image's version, the computation's input data, and results, … These repositories are exposed to power users, who may then write their own programs, share, clone and modify them before pushing back to the platform for use in a new study. Model­based application The traceability is achieved using a WEB application based on the Open Source CubicWeb[8] framework, written in Python[9]. The input, output and software of each processing step is described and stored in a PostgreSQL[10] database before being executed and become immutable. Collaborate on documentation – vcwiki The wiki­like functionality from vcwiki[15], is being integrated into Simulagora. Documents can be edited through the WEB interface or pushed using the Mercurial version control system. Vcwiki currently uses RestructuredText but could be extended to other textual markups if users ask. https://www.simulagora.com Remote access and visualization Simulagora users are granted full root access to their virtual machines. They can simply use OpenSSH[11] from a terminal and monitor their work or install software. Or they can take advantage of NoVNC[12] and remotely view a full XFCE[13] desktop environment directly in their web­socket capable browser, for a zero­install usage of Simulagora. Collaboration Strictly speaking, this is not a requirement of research reproducibility, but rather an extension. Collaboration­like features are however often required during the reviewing process, if the reviewer wants to check the results with slightly modified algorithms or input data. In other contexts, collaboration features are essential to make it easier for e­scientists to base their efforts on previous successful ones, and might considerably improve the collective efficiency. The first need for proper collaboration in e­science is the ability to share code with collaborators in a secure way, as WEB­based software forges do. However, sharing data is more difficult: depending on the size of the data involved in the considered study, getting a copy might be impracticable. In this case, it is more suitable to push the software required by the new work to the initial data than the other way round. Collaborate on studies – Clone, modify, compare Simulagora users can allow people of their choice to access a complete study, including input data and results. They may also freeze the study so that collaborators can clone it as a start for a derived work, then perhaps compare the new results with those of the original study. [1] http://recomputation.org [2] http://ropensci.org/blog/2014/06/09/reproducibility [3] http://www.nature.com/nature/focus/reproducibility Further reading: [4] http://www.openstack.org [5] http://www.debian.org [6] http://www.code­aster. org [7] http://www.saltstack.com [8] http://www.cubicweb.org [9] http://www.python.org [10] http://www.postgresql.org [11] http://www.openssh.org [12] http://novnc.com [13] http://www.xfce.org [14] http://mercurial.selenic.com [15] http://www.cubicweb.org/project/cubicweb­vcwiki References: