bioKepler: A ComprehensiveBioinformatics Scientific WorkflowModule for Distributed Analysis of   Large-Scale Biological Da...
Kepler: a Scientific Workflow System                          www.kepler-project.org• A cross-project collaboration       ...
bioKepler: a Module Being Built in Kepler • Use Distributed Data-Parallel (DDP)   frameworks, e.g., MapReduce, to   accele...
Conceptual Framework  07/14/12   http://www.biokepler.org/   4
Software Architecture  07/14/12   http://www.biokepler.org/   5
Sample bioActors•   Alignment: BLAST, BLAT•   Profile-Sequence Alignment: PSI-BLAST•   Hidden Markov Model: HMMER•   Mappi...
DDP BLAST Workflow via Splitting      Query Sequences                                                        Switch direct...
DDP BLAST Workflow Experiments       07/14/12   http://www.biokepler.org/   8
Questions?• More Information                     jianwu@sdsc.edu                 http://www.biokepler.org              htt...
Upcoming SlideShare
Loading in …5
×

J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

933 views

Published on

Presentation at BOSC2012 by J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
933
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • vision paper advocates using scientific workflows for analysis of large-scale biological datai’ll first talk about scientific workflows and what they dothen talk about some challenges for processing large-scale biological data and our approaches to these challenges
  • to overcome these challenges, execute bioinformatics tools using distributed data-parallel, or ddp, frameworksthese frameworks have shown to work well for big data problems by using data locality: data sets are partitioned and computational tasks operate mainly on data on local diskwhile using mapreduce and similar frameworks for bioinformatics problems is not new, our goal it to make it easier to use these frameworks we’re creating configurable and reusable ddp components in a scientific workflow system that can use different execution engines and computational environments
  • to overcome these challenges, execute bioinformatics tools using distributed data-parallel, or ddp, frameworksthese frameworks have shown to work well for big data problems by using data locality: data sets are partitioned and computational tasks operate mainly on data on local diskwhile using mapreduce and similar frameworks for bioinformatics problems is not new, our goal it to make it easier to use these frameworks we’re creating configurable and reusable ddp components in a scientific workflow system that can use different execution engines and computational environments
  • bioinformatician builds the pipeline or workflow, choosing from a set of bioinformatics toolseach of these tools is wrapped as a “bioActor”, and user chooses from library of bioActoreach bioActor has an implementation to run the tool for a set of data-parallel execution patternsfor example, blast supports MapReduce and MasterSlave, but HMMER only supports All-Pairswhen the workflow is executed, the director, or scheduler, creates a executable workflow plan, deciding which pattern to use for which bioActorplan is then handed to execution engine, which deploys and executes it on the compute resources, transferring data as neededin addition to library of bioActors, and data-parallel director, kepler provides several other componentsthe provenance system records data lineage and execution history,reporting framework generates reports about workflow results,run manager searches past workflow executions, and provides a tagging interface to label and describe executions,and fault-tolerance module provides error-handling mechanisms such as a framework for designing alternative pipelines or sub-workflows to execute when the primary sub-workflow fails
  • at the top is kepler with distributed data-parallel components, including both actors and directorsthe actors provide data-parallel patterns such as map, reduce, all-pairs, etc, also actors for i/o: accessing the data in hdfs, local, or amazone s3directors run the workflow on the distributed data-parallel execution engineplanning to support hadoop and stratosphere, and looking into othersexecution engines in turn use various computational environments such as cloud, grid, or cluster
  • here’s an example bioActor to run BLAST using MapReduceBLAST performs sequence alignment and has two inputs: a set of query sequences and a reference databasethis bioActor, implemented as a pipeline: reads and split the query sequences in FileDataSource,runs BLAST in parallel on each split in Map,in Reduce it merges the output into a single file,and the outputs are staged out in FileDataSinkthis is just one pattern for BLAST, an alternative is to split the reference database, or split both the query sequences and reference database
  • same BLAST bioActor to use mapreduce to split query sequencescircled is the director, in this case stratospherewhen this bioActor is run, it uses stratsphere frameworkcan change the director to hadoop director, but no other part of work to run on hadoop
  • J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module for distributed analysis of large-scale biological data

    1. 1. bioKepler: A ComprehensiveBioinformatics Scientific WorkflowModule for Distributed Analysis of Large-Scale Biological Data Project Website: http://www.biokepler.org Ilkay Altintas1, Daniel Crawl1, Weizhong Li2, Shulei Sun2, Jianwu Wang1, Sitao Wu2 1San Diego Supercomputer Center, UCSD 2Center for Research in Biological Systems, UCSD
    2. 2. Kepler: a Scientific Workflow System www.kepler-project.org• A cross-project collaboration Ptolemy II: A laboratory for investigating design initiated August 2003 KEPLER: A problem-solving download times > 40,000 environment for Scientific Workflow• 2.3 released on 20 Jan 2012 KEPLER = “Ptolemy II + X” for• Builds upon the open-source Scientific Workflows Ptolemy II framework 07/14/12 http://www.biokepler.org/ 2
    3. 3. bioKepler: a Module Being Built in Kepler • Use Distributed Data-Parallel (DDP) frameworks, e.g., MapReduce, to accelerate bioinformatics tool execution • Create, configurable, reusable and executable DDP components in Scientific Workflow System • Support different execution engines and computational environments and optimize workflow execution 07/14/12 http://www.biokepler.org/ 3
    4. 4. Conceptual Framework 07/14/12 http://www.biokepler.org/ 4
    5. 5. Software Architecture 07/14/12 http://www.biokepler.org/ 5
    6. 6. Sample bioActors• Alignment: BLAST, BLAT• Profile-Sequence Alignment: PSI-BLAST• Hidden Markov Model: HMMER• Mapping: Bowtie, BWA, Samtools• Multiple Alignment: ClustalW, Muscle• Clustering: CD-HIT, Blastclust• Gene Prediction: Glimmer, Genescan, Fraggenescan• tRNA prediction: tRNA-scan, Meta-RNA• Phylogeny: FastTree, RAxML 07/14/12 http://www.biokepler.org/ 6
    7. 7. DDP BLAST Workflow via Splitting Query Sequences Switch director to work with other DDP engines, such as Hadoopexecute withdata partition 07/14/12 http://www.biokepler.org/ 7
    8. 8. DDP BLAST Workflow Experiments 07/14/12 http://www.biokepler.org/ 8
    9. 9. Questions?• More Information jianwu@sdsc.edu http://www.biokepler.org http://www.kepler-project.org• Acknowledgements 07/14/12 http://www.biokepler.org/ 9

    ×