We propose the Bio-UnaGrid infrastructure to facilitate the automatic execution of intensive-computing workflows that require the use of existing application suites and distributed computing infrastructures. With Bio-UnaGrid, bioinformatics workflows are easily created and executed, with a simple click and in a transparent manner, on different cluster and grid computing infrastructures (line command is not used). To provide more processing capabilities, at low cost, Bio-UnaGrid use the idle processing capabilities of computer labs with Windows, Linux and Mac desktop computers, using a key virtualization strategy. We implement Bio-UnaGrid in a dedicated cluster and a computer lab. Results of performance tests evidence the gain obtained by our researchers.
1. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Easing Bioinformatics Workflow Execution Using LONI
Pipeline and a Virtual Desktop Grid
Mario Villamizar, Harold Castro, Silvia Restrepo, Luis Rodriguez
David Mendez Department of Biological Sciences
Departments of Systems and Computer Universidad de los Andes
Engineering Bogotá D.C., Colombia
Universidad de los Andes
Bogotá, Colombia
2. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
3. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
4. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
Bioinformatics researchers use applications that require large computational
capabilities.
Cluster and grid computing are common solutions to support these
capabilities.
Cluster computing allows to
aggregate homogeneous
computational resources for an
specific research group or
organization
5. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
Grid computing allows to aggregate heterogeneous computational resources
of different organizations to support larger computational capabilities in e-
Science projects.
In grid computing a set of
organization, with a common
objective (VO), share
computational resources.
6. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
High performance computing (HPC) infrastructures, such as cluster or grid
computing, provide results in shorter time, however when bioinformatics
researchers want to execute applications, they face several consuming time
problems.
7. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
1. There are many application suites to run different analysis, so researchers
must learn command line syntax (options, input and output files, distributed
execution environment, etc.) for hundreds of applications.
8. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
2. Researchers must learn commands to manage and use distributed
computing infrastructures.
9. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
3. Applications require specific and complex configurations to operate in
distributed environments.
10. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
4. Researchers frequently execute a set of applications in a sequential and/or
coordinated manner, called pipelines or workflows.
In many cases the workflow execution is manually done by the researchers
using command line. Execute an application, wait …., execute the next
application, wait …. This process may take a lot of time.
11. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Around the problem
5. Very often researchers want to execute applications that require larger
processing capabilities than those provided by dedicated cluster computing
infrastructures, so they must wait weeks or months before getting their
results.
12. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
The problem
We were having the following problem:
The manual and command-based application and workflow execution require
bioinformatics researchers to spend most of their time in: configuring
application parameters, managing HPC infrastructures, managing scientific
data, and linking partial results from one application to another.
When a new researcher or student want to use applications he/she spends a
lot of time learning all those things.
13. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
14. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Related work
Several projects have been developed to facilitate the automatic workflow
execution in bioinformatics and other scientific fields.
15. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Related work
Most of workflow tools have limitations for bioinformatics applications such
as:
They require applications to be recompiled or modified.
It is too complex to modify or recompile every bioinformatics application.
They support internal data sources with specific data structures.
Bioinformatics data have different formats and they can be internal or
external.
They operate with specific platforms, and cluster or grid middlewares.
Institutions regularly have different cluster and grid configurations. In our
case we are part of different Grid projects such as GISELA, EGEE, and
OSG.
16. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Related work
Most of workflow tools have limitations for bioinformatics applications such
as:
They regularly require researchers to use complex commands involving
consuming-time tasks.
Bioinformatics researchers should use easy GUIs to execute workflows.
They operate on dedicated cluster and grid infrastructures, however in a
university or enterprise campus there are tens or hundreds of desktop
computers that are under-utilized.
Opportunistic or DGVCSs can provide large processing capabilities, at low
cost.
17. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
How to solve these problems
Define an infrastructure that allows that
existing bioinformatics applications can be
easily incorporated without modifications
LONI Pipeline
Workflows should be created through
GUIs and executed on different HPC
infrastructures.
The infrastructure should also allow the
use of different data sources (internal and
external), and the incorporation of MPI
applications.
UnaGrid
For getting results faster the infrastructure
Bio-UnaGrid integrates the should operate on dedicated computing
capabilities of LONI Pipeline and infrastructures and on DGVCS.
UnaGrid to provide these features.
18. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
LONI Pipeline
LONI Pipeline is a free framework used to
execute neuroscience workflows.
From a computational perspective, LONI differs
from other workflow tools in several features:
It does not require external application being
recompiled.
It supports external data sources.
It is hardware platform independent.
It can be installed on different cluster or grid
middlewares using the Pipeline plugins.
From a research user perspective, LONI was
designed having in mind features like usability,
portability, intuitiveness, transparency and
abstraction of cluster or grid infrastructures.
19. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
LONI Pipeline Architecture
20. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
UnaGrid
Dedicated infrastructures are an unviable option in organizations or
countries with low financial resources. However, these organizations have
many computer labs which are not fully utilized by employees or university
students.
21. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
UnaGrid: Opportunistic virtual clusters
X
Cores
Linux
Processing
Virtual Machine
Physical Machine of a
Computer Room
b. When there is not an End User
using the physical machine
A virtual cluster is a set of commodity and interconnected desktops
executing virtual machines (VMs) in background and low-priority through
virtualization technologies, these VMs take advantage of the available idle
processing capabilities in computer labs on an university campus.
22. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
UnaGrid: Opportunistic virtual clusters
A virtual machine is executed on each computer of a lab and it supports
the role of a cluster slave and all of these virtual machines on execution
make up a virtual processing cluster. A dedicated node is necessary for a
virtual cluster and it supports the role of the cluster master.
23. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
UnaGrid: Opportunistic virtual clusters
A virtual machine is executed on each computer of a lab and it supports
the role of a cluster slave and all of these virtual machines on execution
make up a virtual processing cluster. A dedicated node is necessary for a
virtual cluster and it supports the role of the cluster master.
24. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
UnaGrid: Opportunistic virtual clusters
A virtual machine is executed on each computer of a lab and it supports
the role of a cluster slave and all of these virtual machines on execution
make up a virtual processing cluster. A dedicated node is necessary for a
virtual cluster and it supports the role of the cluster master.
25. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
UnaGrid: Opportunistic virtual clusters
Each research group can define its own virtual clusters with custom
application environments (middlewares, applications, configurations, etc.)
A grid solution (several virtual clusters) can be deployed for supporting
the processing capabilities required by some applications.
26. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
27. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid Architecture
The LONI Pipeline Client is the main entry point of bioinformatics
researchers to the Bio-UnaGrid infrastructure.
28. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid Operation
Each researcher receives a username and a password.
He/she can proceed to create the workflows, using the bioinformatics
LONI Module Library, drag and drop, and other tools, provided by the
LONI Pipeline Client.
For each LONI Module (a bioinformatics application), the researcher
must define the input and output parameters, and how each LONI
Module is connected with other LONI Modules of the workflow.
Once a workflow has been created, researchers can execute it, sending it
to the LONI Pipeline Server.
Researcher can monitor and visualize the status of the workflow, and
download partial results, through GUIs of the LONI Pipeline Client.
The Grid/Cluster infrastructure is transparent for researchers.
29. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Bioinformatics Applications
For executing workflows, bioinformatics researchers do
not have to worry about applications’ commands or
complex HPC infrastructures.
30. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Incorporating Bioinformatics Applications
The process to incorporate existing and new bioinformatics application
suites, like NCBI BLAST, into Bio-UnaGrid infrastructure are summarized in
five steps.
1) Installation and configuration of the application suite on a Shared
Storage System.
2) Creation of a LONI Pipeline Module using the LONI Pipeline Client for
each executable of the suite.
3) Individual tests for each LONI Pipeline Module of the suite from the
LONI Pipeline Client.
4) Storage of the LONI Pipeline Module in the LONI Pipeline Server.
5) Using the modules from workflows created by bioinformatics
researchers in the LONI Pipeline Client.
31. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Incorporating Bioinformatics Applications
32. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: MPI applications on Bio-UnaGrid
At the time of this development, LONI did not support MPI applications.
To allow the execution of MPI applications, we developed a wrapper, called
mpiJobManager, which uses the Distributed Resource Management
Application API (DRMAA).
To execute an MPI Application, such as mpiBLAST, a researcher uses the
MPI LONI Module of the MPI application to specify the number of MPI
processes to be used.
The LONI Pipeline Server receives the MPI job and proceeds to execute it
on the DRM slaves.
33. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
34. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Implementation
Bioinformatics Infrastructure
6 Dedicated OGE Slaves – 24 Cores
1 Gbps S&C Infrastructure
LONI Client GUMA Client
10
Gbps NFS-NAS Server
10 Gbps
LONI Pipeline Server
10 1 OGE Master
Bioinformatics Gbps Gbps
Researcher GUMA Portal
10 Gbps 10 Gbps
155
LAN/ Mbps 10
Internet Gbps User Database
1 Gbps
Linux CVC with 21 OGE Slaves – 21 Cores
VM VM VM VM ... VM
Windows 7 Lab
UnaGrid Infrastructure
Bio-UnaGrid was implemented to support several genomic projects related
to coffee, potato and cassava, focused in increasing their production.
35. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Implementation
The LONI Pipeline Server version 5.0.2 and the DRM Master of OGE
version 6.2 update 5.
6 dedicated servers - Intel Xeon X5560 Quad Core processor of 2.8 GHz
and 4 GB of RAM.
A computer lab (Windows 7) - Intel i5 processor of 3.46 GHz and 8 GB of
RAM. 21 VMs (UnaGrid) was deployed and configured with a CPU core
and 4 GB of RAM.
The Database User was installed on a dedicated server using MySQL.
We use an NFSv3-NAS.
All nodes are interconnected through 1 GbE and 10GbE links.
36. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: Implementation
Fault tolerance mechanisms were configured for BoT applications using the
OGE capabilities.
Four bioinformatics application suites have been installed on the Bio-
UnaGrid implementation and 21 LONI Modules were created.
NCBI BLAST version 2.2.20: BLASTn, BLASTp, BLASTx, tBLASTn,
tBLASTx and MEGABLAST.
HMMER version 2.3.2: Build, Search and Calibrate
InterProScan version 4.6: BlastProdom, Coils, Gene3D, PIR, Panther,
Pfam, SEG, SMART, SuperFamily, TIGRfam and fPrintScan
mpiBLAST version 1.6.0: mpiBLAST
Because mpiBLAST requires the use of an MPI implementation, the MPICH-
2 implementation version 1.2.7 was installed.
37. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: A Simple Workflow Example
38. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Bio-UnaGrid: An Example of A More Complex Workflow
39. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
40. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Performance Tests - BLASTn
We executed performance tests in both environments, in the dedicated
cluster and in the opportunistic UnaGrid CVC.
We selected the high popular BLAST algorithm to execute these tests.
For testing the performance of Bio-UnaGrid for executing BoT applications
we used the NCBI BLASTn application, varying the number of CPU cores
and the size of the sequence set.
We executed BLASTn searches between nucleotide sets of 480 (487 KB)
and 960 (954 KB) sequences, and the nt database (28 GB).
The searches were executed using 5, 10, 15 and 20, BLASTn independent
processes (BoT BLASTn), using query segmentation.
Each search was executed using a single CPU core.
41. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Performance Tests - BLASTn
Execution time of BoT BLASTn searches between sets of 480 and 960
sequences, and the nt database.
480 seq. - Dedicated Cluster (20 Cores) 10.9x – UnaGrid (20 Cores) 13.1x
960 seq. - Dedicated Cluster (15 Cores) 6.1x – UnaGrid (20 Cores) 10x
42. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Performance Tests - mpiBLAST
For testing the execution of MPI applications on both environments, we
executed the same tests using mpiBLAST.
In these tests we executed MPI processes coordinated on 5, 10, 15 and 20
CPU cores.
In each test, the additional process called mpiJobManager was executed in
another CPU core.
We used database segmentation to divide the database in 5, 10, 15 and
20 partitions using the mpiformatdb installed with mpiBLAST.
43. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Performance Tests - mpiBLAST
Execution time of mpiBLAST searches between sets of 480 and 960
sequences, and the nt database.
480 seq. - Dedicated Cluster (15 Cores) 2.3x – UnaGrid (20 Cores) 4.7x
960 seq. - Dedicated Cluster (20 Cores) 3.0x – UnaGrid (20 Cores) 3.7x
44. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Agenda
The Problem Context
Related Work
Bio‐UnaGrid: Architecture
Bio‐UnaGrid: Implementation
Bio‐UnaGrid: Performance Tests
Conclusions and Future Work
45. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Conclusions
Using the idle processing capabilities available in
computer labs using Windows, Linux or Mac
desktops, Bio-UnaGrid provides additional
processing capabilities to those provided by
dedicated computational clusters.
Bio-UnaGrid allows bioinformatics researchers to
easily define workflows using GUIs and execute
them on cluster and grid computing infrastructures
with a simple click.
Bio-UnaGrid is highly extensible as existing
bioinformatics applications can be incorporated
without modifications.
With Bio-UnaGrid, researchers focus on analysis
of application results, not on technical issues of
distributed computing infrastructures.
46. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Future Work
We will incorporate more bioinformatics
applications suites into Bio-UnaGrid.
We will execute new performance tests with
other applications such as HMMER, MPI-
HMMER, ClustalW and ClustalW-MPI.
We will execute new performance tests in a
larger grid deployment involving hundreds of
heterogeneous desktop computers in different
administrative domains.
We will implement a shared storage system
more scalable than NFSv3.
We also plan to implement a fault tolerance
mechanism for MPI applications.
47. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Thanks for your attention !!!
48. The Third International Conference on Bioinformatics,
Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)
Contact Information
mj.villamizar24@uniandes.edu.co mjvc007@hotmail.com
http://twitter.com/mjvc007
http://linkedin.com/in/mariojosevillamizarcano
Mario José Villamizar Cano