Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Reproducibility in e­science: 
the Simulagora platform 
The present poster first reviews what reproducibility in e­science...
Upcoming SlideShare
Loading in …5
×

Simulagora (Euroscipy2014 - Logilab)

439 views

Published on

L'autre poster présenté par Logilab concerne Simulagora, un service en ligne de simulation numérique collaborative, qui permet de lancer des calculs dans les nuages (donc sans investissement dans du matériel ou d'administration système), qui met l'accent sur la traçabilité et la reproductibilité des calculs, ainsi que sur le travail collaboratif (partage de logiciel, de données et d'études numériques complètes).

Published in: Science
  • Be the first to comment

Simulagora (Euroscipy2014 - Logilab)

  1. 1. Reproducibility in e­science: the Simulagora platform The present poster first reviews what reproducibility in e­science is all about[1][2][3]. Three important features of a reproducible study are listed and the challenges to be addressed exhibited, so that a research work can really be reproduced by its author itself (e.g. to improve upon it or just to check results obtained in the past), by reviewers (e.g. for careful analysis before publication), or by collaborators (e.g. to derive a new work). The Simulagora platform is then presented to show how it can help e­scientists produce or reproduce a research work. Future developments of the Simulagora platform are also quoted. Replication Replication is the first step towards reproducibility. It aims at making it as easy as possible to re­play the complete research study without any modification of the input data or the software. The first challenge is to find available hardware resources that are compatible with the workload to be replicated. Among other things, the computing power and storage capacity must be sufficient. Ideally, the hardware resources used at this step are the same as those used to carry out the initial study. A second challenge is to replicate the complete software environment of the initial study, including all the dependencies. Note that proprietary software may be an intractable issue at this step, as licenses might not be available to the people who replicate the study at the time they do it. These two challenges may take a significant amount of time, making the replication impracticable. The global ease of the replication process is critical, thus requiring dedicated tools and methods. Re­appropriation For a true reproducibility, the re­appropriation of the process described in a research paper is even more important. It requires a thorough analysis of the data workflow and of the software. A final goal should be to derive some work from the initial one in order to test the robustness of the proposed approach. Documenting the data workflow of the initial study is of great help but not sufficient here as errors may occur. A tool that ensures the traceability of the data and software throughout the process would be highly valuable. To deeply analyze a given step of the process, dedicated tools like statistics libraries or data visualization software are necessary and must be provided or easily installable in the initial study's environment: the data volume may be too big to be transferred. When it comes to software review, a distinction may be made between high­level, study­specific software and widely accepted, community­supported, reliable libraries: auditing the study­specific software should be as easy as possible, which might require code or input data editing. http://www.logilab.fr contact@logilab.fr Cloud resource usage Virtualization is an ideal work replication technology. Although usable on small­scale data centers, public clouds allow access to virtually infinite computing power and storage capacity, very high availability and always lower prices thanks to economies of scale. When used on a typical public cloud Simulagora gets a new virtual machine within a few minutes after the user's request, gets the input data and launches the required program, either for computation­intensive tasks or an interactive working session. Science­targeted machine images Simulagora uses Debian[5] Linux as a basis for its machine images. Debian is a natural choice for scientists, thanks to its very large open source scientific software repository. Logilab actively contributes to Debian, notably by packaging free scientific software (recently, the Code_ASTER[6] finite element solver for mechanics). Simulagora's machine images are produced using Saltstack[7] to make it easy to install and configure Simulagora's specific software. Collaborate on software – Mercurial Simulagora uses Mercurial[14] repositories to store the program that launches the computations and that may use the pre­installed software. The Mercurial version of this program is stored in Simulagora's database along with the machine image's version, the computation's input data, and results, … These repositories are exposed to power users, who may then write their own programs, share, clone and modify them before pushing back to the platform for use in a new study. Model­based application The traceability is achieved using a WEB application based on the Open Source CubicWeb[8] framework, written in Python[9]. The input, output and software of each processing step is described and stored in a PostgreSQL[10] database before being executed and become immutable. Collaborate on documentation – vcwiki The wiki­like functionality from vcwiki[15], is being integrated into Simulagora. Documents can be edited through the WEB interface or pushed using the Mercurial version control system. Vcwiki currently uses RestructuredText but could be extended to other textual markups if users ask. https://www.simulagora.com Remote access and visualization Simulagora users are granted full root access to their virtual machines. They can simply use OpenSSH[11] from a terminal and monitor their work or install software. Or they can take advantage of NoVNC[12] and remotely view a full XFCE[13] desktop environment directly in their web­socket capable browser, for a zero­install usage of Simulagora. Collaboration Strictly speaking, this is not a requirement of research reproducibility, but rather an extension. Collaboration­like features are however often required during the reviewing process, if the reviewer wants to check the results with slightly modified algorithms or input data. In other contexts, collaboration features are essential to make it easier for e­scientists to base their efforts on previous successful ones, and might considerably improve the collective efficiency. The first need for proper collaboration in e­science is the ability to share code with collaborators in a secure way, as WEB­based software forges do. However, sharing data is more difficult: depending on the size of the data involved in the considered study, getting a copy might be impracticable. In this case, it is more suitable to push the software required by the new work to the initial data than the other way round. Collaborate on studies – Clone, modify, compare Simulagora users can allow people of their choice to access a complete study, including input data and results. They may also freeze the study so that collaborators can clone it as a start for a derived work, then perhaps compare the new results with those of the original study. [1] http://recomputation.org [2] http://ropensci.org/blog/2014/06/09/reproducibility [3] http://www.nature.com/nature/focus/reproducibility Further reading: [4] http://www.openstack.org [5] http://www.debian.org [6] http://www.code­aster. org [7] http://www.saltstack.com [8] http://www.cubicweb.org [9] http://www.python.org [10] http://www.postgresql.org [11] http://www.openssh.org [12] http://novnc.com [13] http://www.xfce.org [14] http://mercurial.selenic.com [15] http://www.cubicweb.org/project/cubicweb­vcwiki References:

×