Biostatflow

BioStatFlow –© INRA DJ 2014
PMFB –UMR 1332, INRA, F-33140 Villenave d’Ornon
djacob@bordeaux.inra.fr
http://biostatflow.org

BioStatFlow is a web application designed for the analysis of "omics", including
metabolomics, data with statistical methods. It deals with the analysis of data sets
generated from experiments.
Omics experiments yield large amounts of data, too much to be interpreted by the human
eye. A combination of multivariate and univariate data analyses are therefore essential to
extract and visualize the information of interest. Biologists need to gain basic knowledge
about the statistics employed to critically contribute to and evaluate their experimental
design, protocols, and results.
Nevertheless, there is still a lack of useful, fast, and easy online statistical tools for those
who are not experts in statistics. BioStatFlow has been developed to meet this need.
A web-based tool for Statistical Analysis

Motivation of the design of BioStatFlow
1. The main goal of BioStatFlow is to facilitate the access to statistical tools for biologists that are
not specialists. It has been designed to execute statistical analyses sequentially, i.e. a linear chain
of statistical processing, so-called workflow in BioStatFlow. From a set of use cases identified
(mainly around OMICS data), BioStatFlow is based on the typical workflow as shown below:
A set of analysis is first proposed as a static sequence in order to normalize the
dataset. At this stage, users have to follow the order of the sequence. Because
of experimental issues in the technical equipment, the levels of some
analytical variables (features) cannot be determined or that different
experiments need to be compared, missing value estimation and data scaling
are helpful pre-processing steps. This is the default use case (default
workflow). Then, users can choose any of additional methods depending on
the dataset and the corresponding experimental design (i.e. factors), in order i)
to visualize the whole data, ii) to reveal biomarkers, iii) to analyse interactions
between factors, iv) to discriminate groups, and so on.
The entrance to each treatment takes the output of previous treatment.
If a treatment generates a data table (matrix) as an output, it will be used as
input to the next step. Otherwise, if the treatment only generates results (texts
and images) but does not change the input array, this latter will be directly
taken as output.
Each treatment can be written as an R script (most common) or as a PERL
script, embedding binary tools (like Matlab compiled scripts).

http://biostatflow.org/doc/pg?id=tutorial:startTutorial:
Overview of how to use BioStatFlow

STEP1: Input Dataset :
Provided by user, by uploading a dataset file
correctly formatted, then « Next Step »

STEP2: Workflow selection
Modify parameters and/or add another
analysis, then "Launch"

STEP3: Visualization of Results
Select a result, Zoom In/Out, or Download

2. BioStatFlow allows bioinformaticians to easily integrate a new method of statistical
analysis in a workflow, or even create their own workflows. Thus, the analysis scripts and
the workflow definition files are stored in separate catalogs of the application; some
configuration files enabling integration without modify the application source code.

The BioStatFlow software components consist of:
1. The BioStatFlow core, which is responsible for:
• managing the input-output through the GUI (datasets, workflows, parameters of each analysis, and results),
• creating batch scripts, from the workflow definition files,
• launching the analysis scripts,
• managing the persistent sessions (including access management)
2. The workflow and statistical analysis catalogs. These catalogs may be enriched at any time by adding either some statistical
analyses or even a new workflow.
3. The repository of persistent sessions. To save your work in a persistent session, you have to register before.
Architecture
1
2
3

Workflow and Statistical Analysis catalogs
Catalog’s Root
Workflow 1
Workflow 2
Workflow n
…
def doc scripts
PCA.def PCA.xml PCA.R
…
…
…
…
…
…
Definition
files
Documentation
files
Scripts
files
workflow.def
Workflow
definition
files
•A Workflow is implemented as a directory containing itself three sub-directories, plus one definition file.
•the ‘def’ sub-directory:
•contains the analysis definition files which serve to automatically build the GUI of input masks
of the analysis parameters with some default values, and also the the header of R scripts taken
into account the initialization of parameters with the values given by the user.
•the 'doc' sub-directory:
•contains the analysis documentation files describing the the analysis parameters within the
input mask.
•the 'scripts' sub-directory:
•contains the analysis scripts themselves (not including the initialisation part of their
parameters, given that the header of each script, automatically generated, takes into account
this part )
•the 'workflow.def‘ file:
•contains the list of all analyses within the workflow

PCA.def
Header of the R script
(automatically generated)
The R script
(written by the provider)
dataInMat dataInFact
dataOutMat dataOutFact
PCA.R
Params
Results
PCA.xml
An example: PCA
Overview of the interaction mechanism of
the different file types

An example: PCA
PCA.def
GUI
(automatically
generated)
Header of
the R code
(automatically
generated)

An example: PCA
…
…
PCA.R :
R code
written by
the provider

An example: PCA
Results

Repository of persistent sessions
Repository’s Root
Session 1
Session 2
Session n
…
query
bswf
imported_matrix_file.csv
p0 : Data Formatting
p1 : Split names
Sub-directory of Input data
Sub-directory of the analysis results
p5 : Scaling
…
…
sessparams : session parameters

3. BioStatFlow helps disseminate the results of statistical analyzes by saving them in a
persistent session so that they can be fully restored. One can thus provide the session
identifier when publishing results (see the tutorial).
To disseminate your data and their associated statistical analysis, communicate the URL formed as:
http://biostatflow.org/view/<SESSION ID>

Example of Session ID: http://biostatflow.org/view/G633
Results of statistical analyzes
Motivation of the design of BioStatFlow: Dissemination
Datasets
R code

Some Links
A Spotlight on BioStatFlow in MetaboNews
http://www.metabonews.ca/Feb2015/MetaboNews_Feb2015.htm#spotlight
BioStatFlow is available online:
http://biostatflow.org
A Tutorial on BioStatFlow
http://biostatflow.org/doc/pg?id=tutorial:start

Some references

experiment
Data preprocessing

Biostatflow

More Related Content

Viewers also liked

Similar to Biostatflow

Recently uploaded

Biostatflow