• Save
Bio-UnaGrid: Easing bioinformatics workflow execution
Upcoming SlideShare
Loading in...5
×
 

Bio-UnaGrid: Easing bioinformatics workflow execution

on

  • 1,029 views

We propose the Bio-UnaGrid infrastructure to facilitate the automatic execution of intensive-computing workflows that require the use of existing application suites and distributed computing ...

We propose the Bio-UnaGrid infrastructure to facilitate the automatic execution of intensive-computing workflows that require the use of existing application suites and distributed computing infrastructures. With Bio-UnaGrid, bioinformatics workflows are easily created and executed, with a simple click and in a transparent manner, on different cluster and grid computing infrastructures (line command is not used). To provide more processing capabilities, at low cost, Bio-UnaGrid use the idle processing capabilities of computer labs with Windows, Linux and Mac desktop computers, using a key virtualization strategy. We implement Bio-UnaGrid in a dedicated cluster and a computer lab. Results of performance tests evidence the gain obtained by our researchers.

Statistics

Views

Total Views
1,029
Slideshare-icon Views on SlideShare
1,028
Embed Views
1

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Bio-UnaGrid: Easing bioinformatics workflow execution Bio-UnaGrid: Easing bioinformatics workflow execution Presentation Transcript

    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)Bio-UnaGrid: Easing Bioinformatics Workflow Execution Using LONI Pipeline and a Virtual Desktop GridMario Villamizar, Harold Castro, Silvia Restrepo, Luis Rodriguez David Mendez Department of Biological SciencesDepartments of Systems and Computer Universidad de los Andes Engineering Bogotá D.C., Colombia Universidad de los Andes Bogotá, Colombia
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problemBioinformatics researchers use applications that require large computationalcapabilities.Cluster and grid computing are common solutions to support thesecapabilities. Cluster computing allows to aggregate homogeneous computational resources for an specific research group or organization
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problemGrid computing allows to aggregate heterogeneous computational resourcesof different organizations to support larger computational capabilities in e-Science projects. In grid computing a set of organization, with a common objective (VO), share computational resources.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problemHigh performance computing (HPC) infrastructures, such as cluster or gridcomputing, provide results in shorter time, however when bioinformaticsresearchers want to execute applications, they face several consuming timeproblems.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problem1. There are many application suites to run different analysis, so researchersmust learn command line syntax (options, input and output files, distributedexecution environment, etc.) for hundreds of applications.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problem2. Researchers must learn commands to manage and use distributedcomputing infrastructures.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problem3. Applications require specific and complex configurations to operate indistributed environments.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problem4. Researchers frequently execute a set of applications in a sequential and/orcoordinated manner, called pipelines or workflows.In many cases the workflow execution is manually done by the researchers using command line. Execute an application, wait …., execute the next application, wait …. This process may take a lot of time.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Around the problem5. Very often researchers want to execute applications that require largerprocessing capabilities than those provided by dedicated cluster computinginfrastructures, so they must wait weeks or months before getting theirresults.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) The problemWe were having the following problem:The manual and command-based application and workflow execution requirebioinformatics researchers to spend most of their time in: configuringapplication parameters, managing HPC infrastructures, managing scientificdata, and linking partial results from one application to another.When a new researcher or student want to use applications he/she spends alot of time learning all those things.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Related workSeveral projects have been developed to facilitate the automatic workflowexecution in bioinformatics and other scientific fields.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Related workMost of workflow tools have limitations for bioinformatics applications suchas: They require applications to be recompiled or modified.It is too complex to modify or recompile every bioinformatics application. They support internal data sources with specific data structures.Bioinformatics data have different formats and they can be internal orexternal. They operate with specific platforms, and cluster or grid middlewares.Institutions regularly have different cluster and grid configurations. In ourcase we are part of different Grid projects such as GISELA, EGEE, andOSG.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Related workMost of workflow tools have limitations for bioinformatics applications suchas: They regularly require researchers to use complex commands involving consuming-time tasks.Bioinformatics researchers should use easy GUIs to execute workflows. They operate on dedicated cluster and grid infrastructures, however in a university or enterprise campus there are tens or hundreds of desktop computers that are under-utilized.Opportunistic or DGVCSs can provide large processing capabilities, at lowcost.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) How to solve these problems Define an infrastructure that allows that existing bioinformatics applications can be easily incorporated without modifications LONI Pipeline Workflows should be created through GUIs and executed on different HPC infrastructures. The infrastructure should also allow the use of different data sources (internal and external), and the incorporation of MPI applications. UnaGrid For getting results faster the infrastructure Bio-UnaGrid integrates the should operate on dedicated computing capabilities of LONI Pipeline and infrastructures and on DGVCS.UnaGrid to provide these features.
    • The Third International Conference on Bioinformatics,Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) LONI Pipeline LONI Pipeline is a free framework used to execute neuroscience workflows. From a computational perspective, LONI differs from other workflow tools in several features: It does not require external application being recompiled. It supports external data sources. It is hardware platform independent. It can be installed on different cluster or grid middlewares using the Pipeline plugins. From a research user perspective, LONI was designed having in mind features like usability, portability, intuitiveness, transparency and abstraction of cluster or grid infrastructures.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)LONI Pipeline Architecture
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) UnaGrid Dedicated infrastructures are an unviable option in organizations orcountries with low financial resources. However, these organizations havemany computer labs which are not fully utilized by employees or universitystudents.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) UnaGrid: Opportunistic virtual clusters X Cores Linux Processing Virtual Machine Physical Machine of a Computer Room b. When there is not an End User using the physical machine A virtual cluster is a set of commodity and interconnected desktopsexecuting virtual machines (VMs) in background and low-priority throughvirtualization technologies, these VMs take advantage of the available idleprocessing capabilities in computer labs on an university campus.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) UnaGrid: Opportunistic virtual clusters A virtual machine is executed on each computer of a lab and it supportsthe role of a cluster slave and all of these virtual machines on executionmake up a virtual processing cluster. A dedicated node is necessary for avirtual cluster and it supports the role of the cluster master.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) UnaGrid: Opportunistic virtual clusters A virtual machine is executed on each computer of a lab and it supportsthe role of a cluster slave and all of these virtual machines on executionmake up a virtual processing cluster. A dedicated node is necessary for avirtual cluster and it supports the role of the cluster master.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) UnaGrid: Opportunistic virtual clusters A virtual machine is executed on each computer of a lab and it supportsthe role of a cluster slave and all of these virtual machines on executionmake up a virtual processing cluster. A dedicated node is necessary for avirtual cluster and it supports the role of the cluster master.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) UnaGrid: Opportunistic virtual clusters Each research group can define its own virtual clusters with customapplication environments (middlewares, applications, configurations, etc.) A grid solution (several virtual clusters) can be deployed for supportingthe processing capabilities required by some applications.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid ArchitectureThe LONI Pipeline Client is the main entry point of bioinformaticsresearchers to the Bio-UnaGrid infrastructure.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid OperationEach researcher receives a username and a password.He/she can proceed to create the workflows, using the bioinformaticsLONI Module Library, drag and drop, and other tools, provided by theLONI Pipeline Client.For each LONI Module (a bioinformatics application), the researchermust define the input and output parameters, and how each LONIModule is connected with other LONI Modules of the workflow.Once a workflow has been created, researchers can execute it, sending itto the LONI Pipeline Server.Researcher can monitor and visualize the status of the workflow, anddownload partial results, through GUIs of the LONI Pipeline Client. The Grid/Cluster infrastructure is transparent for researchers.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)Bio-UnaGrid: Bioinformatics Applications For executing workflows, bioinformatics researchers do not have to worry about applications’ commands or complex HPC infrastructures.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid: Incorporating Bioinformatics ApplicationsThe process to incorporate existing and new bioinformatics applicationsuites, like NCBI BLAST, into Bio-UnaGrid infrastructure are summarized infive steps.1) Installation and configuration of the application suite on a Shared Storage System.2) Creation of a LONI Pipeline Module using the LONI Pipeline Client for each executable of the suite.3) Individual tests for each LONI Pipeline Module of the suite from the LONI Pipeline Client.4) Storage of the LONI Pipeline Module in the LONI Pipeline Server.5) Using the modules from workflows created by bioinformatics researchers in the LONI Pipeline Client.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)Bio-UnaGrid: Incorporating Bioinformatics Applications
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid: MPI applications on Bio-UnaGridAt the time of this development, LONI did not support MPI applications.To allow the execution of MPI applications, we developed a wrapper, calledmpiJobManager, which uses the Distributed Resource ManagementApplication API (DRMAA).To execute an MPI Application, such as mpiBLAST, a researcher uses theMPI LONI Module of the MPI application to specify the number of MPIprocesses to be used.The LONI Pipeline Server receives the MPI job and proceeds to execute iton the DRM slaves.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid: Implementation Bioinformatics Infrastructure 6 Dedicated OGE Slaves – 24 Cores 1 Gbps S&C Infrastructure LONI Client GUMA Client 10 Gbps NFS-NAS Server 10 Gbps LONI Pipeline Server 10 1 OGE Master Bioinformatics Gbps Gbps Researcher GUMA Portal 10 Gbps 10 Gbps 155 LAN/ Mbps 10 Internet Gbps User Database 1 Gbps Linux CVC with 21 OGE Slaves – 21 Cores VM VM VM VM ... VM Windows 7 Lab UnaGrid InfrastructureBio-UnaGrid was implemented to support several genomic projects relatedto coffee, potato and cassava, focused in increasing their production.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid: ImplementationThe LONI Pipeline Server version 5.0.2 and the DRM Master of OGEversion 6.2 update 5.6 dedicated servers - Intel Xeon X5560 Quad Core processor of 2.8 GHzand 4 GB of RAM.A computer lab (Windows 7) - Intel i5 processor of 3.46 GHz and 8 GB ofRAM. 21 VMs (UnaGrid) was deployed and configured with a CPU coreand 4 GB of RAM.The Database User was installed on a dedicated server using MySQL.We use an NFSv3-NAS.All nodes are interconnected through 1 GbE and 10GbE links.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Bio-UnaGrid: ImplementationFault tolerance mechanisms were configured for BoT applications using theOGE capabilities.Four bioinformatics application suites have been installed on the Bio-UnaGrid implementation and 21 LONI Modules were created. NCBI BLAST version 2.2.20: BLASTn, BLASTp, BLASTx, tBLASTn, tBLASTx and MEGABLAST. HMMER version 2.3.2: Build, Search and Calibrate InterProScan version 4.6: BlastProdom, Coils, Gene3D, PIR, Panther, Pfam, SEG, SMART, SuperFamily, TIGRfam and fPrintScan mpiBLAST version 1.6.0: mpiBLASTBecause mpiBLAST requires the use of an MPI implementation, the MPICH-2 implementation version 1.2.7 was installed.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)Bio-UnaGrid: A Simple Workflow Example
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)Bio-UnaGrid: An Example of A More Complex Workflow
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Performance Tests - BLASTnWe executed performance tests in both environments, in the dedicatedcluster and in the opportunistic UnaGrid CVC.We selected the high popular BLAST algorithm to execute these tests.For testing the performance of Bio-UnaGrid for executing BoT applicationswe used the NCBI BLASTn application, varying the number of CPU coresand the size of the sequence set.We executed BLASTn searches between nucleotide sets of 480 (487 KB)and 960 (954 KB) sequences, and the nt database (28 GB).The searches were executed using 5, 10, 15 and 20, BLASTn independentprocesses (BoT BLASTn), using query segmentation.Each search was executed using a single CPU core.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Performance Tests - BLASTn Execution time of BoT BLASTn searches between sets of 480 and 960 sequences, and the nt database.480 seq. - Dedicated Cluster (20 Cores) 10.9x – UnaGrid (20 Cores) 13.1x960 seq. - Dedicated Cluster (15 Cores) 6.1x – UnaGrid (20 Cores) 10x
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Performance Tests - mpiBLASTFor testing the execution of MPI applications on both environments, weexecuted the same tests using mpiBLAST.In these tests we executed MPI processes coordinated on 5, 10, 15 and 20CPU cores.In each test, the additional process called mpiJobManager was executed inanother CPU core.We used database segmentation to divide the database in 5, 10, 15 and20 partitions using the mpiformatdb installed with mpiBLAST.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Performance Tests - mpiBLAST Execution time of mpiBLAST searches between sets of 480 and 960 sequences, and the nt database.480 seq. - Dedicated Cluster (15 Cores) 2.3x – UnaGrid (20 Cores) 4.7x960 seq. - Dedicated Cluster (20 Cores) 3.0x – UnaGrid (20 Cores) 3.7x
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) AgendaThe Problem ContextRelated WorkBio‐UnaGrid: ArchitectureBio‐UnaGrid: ImplementationBio‐UnaGrid: Performance TestsConclusions and Future Work
    • The Third International Conference on Bioinformatics,Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) ConclusionsUsing the idle processing capabilities available incomputer labs using Windows, Linux or Macdesktops, Bio-UnaGrid provides additionalprocessing capabilities to those provided bydedicated computational clusters.Bio-UnaGrid allows bioinformatics researchers toeasily define workflows using GUIs and executethem on cluster and grid computing infrastructureswith a simple click.Bio-UnaGrid is highly extensible as existingbioinformatics applications can be incorporatedwithout modifications.With Bio-UnaGrid, researchers focus on analysisof application results, not on technical issues ofdistributed computing infrastructures.
    • The Third International Conference on Bioinformatics,Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Future Work We will incorporate more bioinformatics applications suites into Bio-UnaGrid. We will execute new performance tests with other applications such as HMMER, MPI- HMMER, ClustalW and ClustalW-MPI. We will execute new performance tests in a larger grid deployment involving hundreds of heterogeneous desktop computers in different administrative domains. We will implement a shared storage system more scalable than NFSv3. We also plan to implement a fault tolerance mechanism for MPI applications.
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011)Thanks for your attention !!!
    • The Third International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies (BIOTECHNO 2011) Contact Informationmj.villamizar24@uniandes.edu.co mjvc007@hotmail.com http://twitter.com/mjvc007 http://linkedin.com/in/mariojosevillamizarcano Mario José Villamizar Cano