Bio-UnaGrid: Easing bioinformatics workflow execution

The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)

Bio-UnaGrid: Easing Bioinformatics Workflow Execution Using LONI
Pipeline and a Virtual Desktop Grid

Mario Villamizar, Harold Castro, Silvia Restrepo, Luis Rodriguez
David Mendez Department of Biological Sciences
Departments of Systems and Computer Universidad de los Andes
Engineering Bogotá D.C., Colombia
Universidad de los Andes
Bogotá, Colombia


Agenda

The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work


Around the problem
Bioinformatics researchers use applications that require large computational
capabilities.

Cluster and grid computing are common solutions to support these
capabilities.

Cluster computing allows to
aggregate homogeneous
computational resources for an
specific research group or
organization


Around the problem
Grid computing allows to aggregate heterogeneous computational resources
of different organizations to support larger computational capabilities in e-
Science projects.

In grid computing a set of
organization, with a common
objective (VO), share
computational resources.


Around the problem
High performance computing (HPC) infrastructures, such as cluster or grid
computing, provide results in shorter time, however when bioinformatics
researchers want to execute applications, they face several consuming time
problems.


Around the problem
1. There are many application suites to run different analysis, so researchers
must learn command line syntax (options, input and output files, distributed
execution environment, etc.) for hundreds of applications.


Around the problem
2. Researchers must learn commands to manage and use distributed
computing infrastructures.


Around the problem
3. Applications require specific and complex configurations to operate in
distributed environments.


Around the problem
4. Researchers frequently execute a set of applications in a sequential and/or
coordinated manner, called pipelines or workflows.

In many cases the workflow execution is manually done by the researchers
using command line. Execute an application, wait …., execute the next
application, wait …. This process may take a lot of time.


Around the problem
5. Very often researchers want to execute applications that require larger
processing capabilities than those provided by dedicated cluster computing
infrastructures, so they must wait weeks or months before getting their
results.


The problem
We were having the following problem:

The manual and command-based application and workflow execution require
bioinformatics researchers to spend most of their time in: configuring
application parameters, managing HPC infrastructures, managing scientific
data, and linking partial results from one application to another.

When a new researcher or student want to use applications he/she spends a
lot of time learning all those things.


Related work
Several projects have been developed to facilitate the automatic workflow
execution in bioinformatics and other scientific fields.


Related work
Most of workflow tools have limitations for bioinformatics applications such
as:

They require applications to be recompiled or modified.

It is too complex to modify or recompile every bioinformatics application.

They support internal data sources with specific data structures.

Bioinformatics data have different formats and they can be internal or
external.

They operate with specific platforms, and cluster or grid middlewares.

Institutions regularly have different cluster and grid configurations. In our
case we are part of different Grid projects such as GISELA, EGEE, and
OSG.


Related work
Most of workflow tools have limitations for bioinformatics applications such
as:

They regularly require researchers to use complex commands involving
consuming-time tasks.

Bioinformatics researchers should use easy GUIs to execute workflows.

They operate on dedicated cluster and grid infrastructures, however in a
university or enterprise campus there are tens or hundreds of desktop
computers that are under-utilized.

Opportunistic or DGVCSs can provide large processing capabilities, at low
cost.


How to solve these problems

Define an infrastructure that allows that
existing bioinformatics applications can be
easily incorporated without modifications

LONI Pipeline
Workflows should be created through
GUIs and executed on different HPC
infrastructures.

The infrastructure should also allow the
use of different data sources (internal and
external), and the incorporation of MPI
applications.
UnaGrid
For getting results faster the infrastructure
Bio-UnaGrid integrates the should operate on dedicated computing
capabilities of LONI Pipeline and infrastructures and on DGVCS.
UnaGrid to provide these features.


LONI Pipeline
LONI Pipeline is a free framework used to
execute neuroscience workflows.

From a computational perspective, LONI differs
from other workflow tools in several features:

It does not require external application being
recompiled.
It supports external data sources.
It is hardware platform independent.
It can be installed on different cluster or grid
middlewares using the Pipeline plugins.

From a research user perspective, LONI was
designed having in mind features like usability,
portability, intuitiveness, transparency and
abstraction of cluster or grid infrastructures.


LONI Pipeline Architecture


UnaGrid

Dedicated infrastructures are an unviable option in organizations or
countries with low financial resources. However, these organizations have
many computer labs which are not fully utilized by employees or university
students.


UnaGrid: Opportunistic virtual clusters
X
Cores

Linux

Processing
Virtual Machine

Physical Machine of a
Computer Room

b. When there is not an End User
using the physical machine

A virtual cluster is a set of commodity and interconnected desktops
executing virtual machines (VMs) in background and low-priority through
virtualization technologies, these VMs take advantage of the available idle
processing capabilities in computer labs on an university campus.



A virtual machine is executed on each computer of a lab and it supports
the role of a cluster slave and all of these virtual machines on execution
make up a virtual processing cluster. A dedicated node is necessary for a
virtual cluster and it supports the role of the cluster master.



Each research group can define its own virtual clusters with custom
application environments (middlewares, applications, configurations, etc.)

A grid solution (several virtual clusters) can be deployed for supporting
the processing capabilities required by some applications.


Bio-UnaGrid Architecture

The LONI Pipeline Client is the main entry point of bioinformatics
researchers to the Bio-UnaGrid infrastructure.


Bio-UnaGrid Operation
Each researcher receives a username and a password.

He/she can proceed to create the workflows, using the bioinformatics
LONI Module Library, drag and drop, and other tools, provided by the
LONI Pipeline Client.

For each LONI Module (a bioinformatics application), the researcher
must define the input and output parameters, and how each LONI
Module is connected with other LONI Modules of the workflow.

Once a workflow has been created, researchers can execute it, sending it
to the LONI Pipeline Server.

Researcher can monitor and visualize the status of the workflow, and
download partial results, through GUIs of the LONI Pipeline Client.

The Grid/Cluster infrastructure is transparent for researchers.


Bio-UnaGrid: Bioinformatics Applications

For executing workflows, bioinformatics researchers do
not have to worry about applications’ commands or
complex HPC infrastructures.


Bio-UnaGrid: Incorporating Bioinformatics Applications
The process to incorporate existing and new bioinformatics application
suites, like NCBI BLAST, into Bio-UnaGrid infrastructure are summarized in
five steps.

1) Installation and configuration of the application suite on a Shared
Storage System.

2) Creation of a LONI Pipeline Module using the LONI Pipeline Client for
each executable of the suite.

3) Individual tests for each LONI Pipeline Module of the suite from the
LONI Pipeline Client.

4) Storage of the LONI Pipeline Module in the LONI Pipeline Server.

5) Using the modules from workflows created by bioinformatics
researchers in the LONI Pipeline Client.


Bio-UnaGrid: Incorporating Bioinformatics Applications


Bio-UnaGrid: MPI applications on Bio-UnaGrid

At the time of this development, LONI did not support MPI applications.

To allow the execution of MPI applications, we developed a wrapper, called
mpiJobManager, which uses the Distributed Resource Management
Application API (DRMAA).

To execute an MPI Application, such as mpiBLAST, a researcher uses the
MPI LONI Module of the MPI application to specify the number of MPI
processes to be used.

The LONI Pipeline Server receives the MPI job and proceeds to execute it
on the DRM slaves.


Bio-UnaGrid: Implementation
Bioinformatics Infrastructure
6 Dedicated OGE Slaves – 24 Cores

1 Gbps S&C Infrastructure
LONI Client GUMA Client
10
Gbps NFS-NAS Server
10 Gbps
LONI Pipeline Server
10 1 OGE Master
Bioinformatics Gbps Gbps
Researcher GUMA Portal
10 Gbps 10 Gbps
155
LAN/ Mbps 10
Internet Gbps User Database
1 Gbps

Linux CVC with 21 OGE Slaves – 21 Cores

VM VM VM VM ... VM

Windows 7 Lab
UnaGrid Infrastructure

Bio-UnaGrid was implemented to support several genomic projects related
to coffee, potato and cassava, focused in increasing their production.



The LONI Pipeline Server version 5.0.2 and the DRM Master of OGE
version 6.2 update 5.

6 dedicated servers - Intel Xeon X5560 Quad Core processor of 2.8 GHz
and 4 GB of RAM.

A computer lab (Windows 7) - Intel i5 processor of 3.46 GHz and 8 GB of
RAM. 21 VMs (UnaGrid) was deployed and configured with a CPU core
and 4 GB of RAM.

The Database User was installed on a dedicated server using MySQL.

We use an NFSv3-NAS.

All nodes are interconnected through 1 GbE and 10GbE links.


Fault tolerance mechanisms were configured for BoT applications using the
OGE capabilities.

Four bioinformatics application suites have been installed on the Bio-
UnaGrid implementation and 21 LONI Modules were created.

NCBI BLAST version 2.2.20: BLASTn, BLASTp, BLASTx, tBLASTn,
tBLASTx and MEGABLAST.

HMMER version 2.3.2: Build, Search and Calibrate

InterProScan version 4.6: BlastProdom, Coils, Gene3D, PIR, Panther,
Pfam, SEG, SMART, SuperFamily, TIGRfam and fPrintScan

mpiBLAST version 1.6.0: mpiBLAST

Because mpiBLAST requires the use of an MPI implementation, the MPICH-
2 implementation version 1.2.7 was installed.


Bio-UnaGrid: A Simple Workflow Example


Bio-UnaGrid: An Example of A More Complex Workflow


Performance Tests - BLASTn

We executed performance tests in both environments, in the dedicated
cluster and in the opportunistic UnaGrid CVC.

We selected the high popular BLAST algorithm to execute these tests.

For testing the performance of Bio-UnaGrid for executing BoT applications
we used the NCBI BLASTn application, varying the number of CPU cores
and the size of the sequence set.

We executed BLASTn searches between nucleotide sets of 480 (487 KB)
and 960 (954 KB) sequences, and the nt database (28 GB).

The searches were executed using 5, 10, 15 and 20, BLASTn independent
processes (BoT BLASTn), using query segmentation.

Each search was executed using a single CPU core.


Performance Tests - BLASTn

Execution time of BoT BLASTn searches between sets of 480 and 960
sequences, and the nt database.

480 seq. - Dedicated Cluster (20 Cores) 10.9x – UnaGrid (20 Cores) 13.1x
960 seq. - Dedicated Cluster (15 Cores) 6.1x – UnaGrid (20 Cores) 10x


Performance Tests - mpiBLAST

For testing the execution of MPI applications on both environments, we
executed the same tests using mpiBLAST.

In these tests we executed MPI processes coordinated on 5, 10, 15 and 20
CPU cores.

In each test, the additional process called mpiJobManager was executed in
another CPU core.

We used database segmentation to divide the database in 5, 10, 15 and
20 partitions using the mpiformatdb installed with mpiBLAST.


Performance Tests - mpiBLAST

Execution time of mpiBLAST searches between sets of 480 and 960
sequences, and the nt database.



Conclusions
Using the idle processing capabilities available in
computer labs using Windows, Linux or Mac
desktops, Bio-UnaGrid provides additional
processing capabilities to those provided by
dedicated computational clusters.

Bio-UnaGrid allows bioinformatics researchers to
easily define workflows using GUIs and execute
them on cluster and grid computing infrastructures
with a simple click.

Bio-UnaGrid is highly extensible as existing
bioinformatics applications can be incorporated
without modifications.

With Bio-UnaGrid, researchers focus on analysis
of application results, not on technical issues of
distributed computing infrastructures.


Future Work
We will incorporate more bioinformatics
applications suites into Bio-UnaGrid.

We will execute new performance tests with
other applications such as HMMER, MPI-
HMMER, ClustalW and ClustalW-MPI.

We will execute new performance tests in a
larger grid deployment involving hundreds of
heterogeneous desktop computers in different
administrative domains.

We will implement a shared storage system
more scalable than NFSv3.

We also plan to implement a fault tolerance
mechanism for MPI applications.


Thanks for your attention !!!


Contact Information

mj.villamizar24@uniandes.edu.co mjvc007@hotmail.com

http://twitter.com/mjvc007

http://linkedin.com/in/mariojosevillamizarcano

Mario José Villamizar Cano

Bio-UnaGrid: Easing bioinformatics workflow execution

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Bio-UnaGrid: Easing bioinformatics workflow execution

Similar to Bio-UnaGrid: Easing bioinformatics workflow execution (20)

More from Mario Jose Villamizar Cano

More from Mario Jose Villamizar Cano (19)

Recently uploaded

Recently uploaded (20)

Bio-UnaGrid: Easing bioinformatics workflow execution