Simulagora (Euroscipy2014 - Logilab)

Reproducibility in escience:
the Simulagora platform
The present poster first reviews what reproducibility in escience
is all about[1][2][3]. Three important features of a reproducible study are listed and the challenges to be
addressed exhibited, so that a research work can really be reproduced by its author itself (e.g. to improve upon it or just to check results obtained in the past), by
reviewers (e.g. for careful analysis before publication), or by collaborators (e.g. to derive a new work). The Simulagora platform is then presented to show how it can
help escientists
produce or reproduce a research work. Future developments of the Simulagora platform are also quoted.
Replication
Replication is the first step towards reproducibility. It aims at making it as
easy as possible to replay
the complete research study without any
modification of the input data or the software.
The first challenge is to find available hardware resources that are
compatible with the workload to be replicated. Among other things, the
computing power and storage capacity must be sufficient. Ideally, the
hardware resources used at this step are the same as those used to carry
out the initial study.
A second challenge is to replicate the complete software environment of
the initial study, including all the dependencies. Note that proprietary
software may be an intractable issue at this step, as licenses might not be
available to the people who replicate the study at the time they do it.
These two challenges may take a significant amount of time, making the
replication impracticable. The global ease of the replication process is
critical, thus requiring dedicated tools and methods.
Reappropriation
For a true reproducibility, the reappropriation
of the process described in a
research paper is even more important. It requires a thorough analysis of the
data workflow and of the software. A final goal should be to derive some work
from the initial one in order to test the robustness of the proposed approach.
Documenting the data workflow of the initial study is of great help but not
sufficient here as errors may occur. A tool that ensures the traceability of the
data and software throughout the process would be highly valuable.
To deeply analyze a given step of the process, dedicated tools like statistics
libraries or data visualization software are necessary and must be provided or
easily installable in the initial study's environment: the data volume may be too
big to be transferred.
When it comes to software review, a distinction may be made between highlevel,
studyspecific
software and widely accepted, communitysupported,
reliable libraries: auditing the studyspecific
software should be as easy as
possible, which might require code or input data editing.
http://www.logilab.fr
contact@logilab.fr
Cloud resource usage
Virtualization is an ideal work replication technology.
Although usable on smallscale
data centers, public
clouds allow access to virtually infinite computing
power and storage capacity, very high availability and
always lower prices thanks to economies of scale.
When used on a typical public cloud Simulagora gets a
new virtual machine within a few minutes after the
user's request, gets the input data and launches the
required program, either for computationintensive
tasks or an interactive working session.
Sciencetargeted
machine images
Simulagora uses Debian[5] Linux as a basis for its machine
images. Debian is a natural choice for scientists, thanks to its
very large open source scientific software repository.
Logilab actively contributes to Debian, notably by packaging
free scientific software (recently, the Code_ASTER[6] finite
element solver for mechanics).
Simulagora's machine images are produced using Saltstack[7]
to make it easy to install and configure Simulagora's specific
software.
Collaborate on software – Mercurial
Simulagora uses Mercurial[14] repositories to store the program
that launches the computations and that may use the preinstalled
software. The Mercurial version of this program is
stored in Simulagora's database along with the machine image's
version, the computation's input data, and results, … These
repositories are exposed to power users, who may then write
their own programs, share, clone and modify them before
pushing back to the platform for use in a new study.
Modelbased
application
The traceability is achieved using a WEB application
based on the Open Source CubicWeb[8] framework,
written in Python[9]. The input, output and software of
each processing step is described and stored in a
PostgreSQL[10] database before being executed and
become immutable.
Collaborate on documentation – vcwiki
The wikilike
functionality from vcwiki[15], is being integrated into
Simulagora. Documents can be edited through the WEB interface or pushed
using the Mercurial version control system. Vcwiki currently uses
RestructuredText but could be extended to other textual markups if users ask.
https://www.simulagora.com
Remote access and visualization
Simulagora users are granted full root access to their virtual
machines. They can simply use OpenSSH[11] from a terminal
and monitor their work or install software. Or they can take
advantage of NoVNC[12] and remotely view a full XFCE[13]
desktop environment directly in their websocket
capable
browser, for a zeroinstall
usage of Simulagora.
Collaboration
Strictly speaking, this is not a requirement of research reproducibility, but
rather an extension. Collaborationlike
features are however often
required during the reviewing process, if the reviewer wants to check the
results with slightly modified algorithms or input data. In other contexts,
collaboration features are essential to make it easier for escientists
to
base their efforts on previous successful ones, and might considerably
improve the collective efficiency.
The first need for proper collaboration in escience
is the ability to share
code with collaborators in a secure way, as WEBbased
software forges
do.
However, sharing data is more difficult: depending on the size of the data
involved in the considered study, getting a copy might be impracticable. In
this case, it is more suitable to push the software required by the new work
to the initial data than the other way round.
Collaborate on studies – Clone, modify, compare
Simulagora users can allow people of their choice to access a complete study,
including input data and results. They may also freeze the study so that
collaborators can clone it as a start for a derived work, then perhaps compare
the new results with those of the original study.
[1] http://recomputation.org
[2] http://ropensci.org/blog/2014/06/09/reproducibility
[3] http://www.nature.com/nature/focus/reproducibility
Further reading:
[4] http://www.openstack.org
[5] http://www.debian.org
[6] http://www.codeaster.
org
[7] http://www.saltstack.com
[8] http://www.cubicweb.org
[9] http://www.python.org
[10] http://www.postgresql.org
[11] http://www.openssh.org
[12] http://novnc.com
[13] http://www.xfce.org
[14] http://mercurial.selenic.com
[15] http://www.cubicweb.org/project/cubicwebvcwiki
References:

Simulagora (Euroscipy2014 - Logilab)

More Related Content

What's hot

Similar to Simulagora (Euroscipy2014 - Logilab)

More from Logilab

Recently uploaded

Simulagora (Euroscipy2014 - Logilab)