Lichtenberg bosc2010 wordseeker


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MPI: Widely Supported by network interface designers
  • Lichtenberg bosc2010 wordseeker

    1. 1. Concurrent Bioinformatics Software FORDISCOVERING Genome-Wide Patternsand Word-based Genomic Signatures<br />Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch<br />
    2. 2. The WordSeeker Tool<br />Enumeration<br />Suffix Tree and Suffix Array<br />Radix Tree<br />Scoring<br />Clustering<br />Sequence Clustering<br />Word Clustering<br />Conservation Analysis<br />Phast Cons Score Extraction<br />Location Distributions<br />Sequence Coverage<br />Min set of words necessary to<br /> cover all sequences<br />Module Discovery<br />Enumerative<br />Ranger Markup<br />Basic Functional Elements<br />
    3. 3. Software Properties<br />Google code repository:<br />GNU General Public License v3<br />Doxygen code generator (Internal Documentation).<br />Svn for command line access:<br />Requirements<br />G++ compiler version 4.1* or higher<br />OpenMP headers<br />MPI environment (distributed version)<br />For visualizations and other post-processing steps<br />Perl 5.8.8,<br />TFBS (<br />SET::Scalar<br />LWP::Simple<br />Parallel::Forkmanager<br />GD::Graphs::bars,<br />Algorithm::Cluster <br />Bio::SeqIO (all available through CPAN)<br />Gnuplot version 4.2 or higher<br />
    4. 4. Need for a Scalable Approach<br />Word Enumeration Module<br />Represents a set of biological input sequences based on some data structure<br />Keeps track of words, word counts, sequence counts, and word locations<br />Need to keep the data persistent in memory<br />Word Scoring Module<br />Determines statistical scores for each word<br />Frequent lookups for words and substrings of words <br />Example: Markov order m model requires lookups for all substrings of up to length m for all words<br /><ul><li>Keep space complexity low  Keep time complexity for</li></ul>lookups low<br />
    5. 5. Enumeration Approaches<br />Total number of nucleotides in the input sequences: n<br />Word length: m<br />
    6. 6. Distributed Solution<br />Tasks executed on different nodes<br />Distributed Memory<br />Multi-core Solution<br />Tasks executed on different cores<br />Shared Memory Solution<br />Parallelization<br />
    7. 7. Parallel Software Properties<br />Shared Memory<br />Open MP parallelization<br />Simple, portable, directives that compile even on non supported architectures<br />Simple loops are run in parallel on multiple processors<br />Distributed Memory<br />MPI parallelization<br />Hardware optimizations and support for Fortran, C/C++, Perl<br />Each node is provided a subset of the data to process<br />“Smart” division of tasks is key<br />
    8. 8. Results<br />Analyzed the Arabidopsis thaliana genome<br />All segments and the full genome<br />Multiple word lengths (1-20)<br />Searched top words against AGRIS (repository of known elements in A. thaliana)<br />Characterized the Framework<br />Speedup and runtime analysis<br />Radix Trie and Suffix Tree<br />
    9. 9. Memory Requirements for Arabidopsis thaliana<br />Conducted at the Ohio Supercomputer Center<br />
    10. 10. Execution Times for Arabidopsis thaliana<br />
    11. 11. Speedup, efficiency and timing using A. thaliana core promoter sequences.<br />Analyzing the Parallel System<br />
    12. 12. Shared and Distributed Memory Speedup<br />Radix Trie<br />Suffix Tree<br />
    13. 13. Shared and Distributed Memory Efficiency<br />Radix Trie<br />Suffix Tree<br />
    14. 14. Shared and Distributed Memory Performance<br />Radix Trie<br />Suffix Tree<br />
    15. 15. Scoring Speedup Contribution<br />Runtime<br />Scoring<br />
    16. 16. Results: Pushing the limits<br />
    17. 17. Summary<br />Parallel<br />Shared memory on single nodes<br />Distributed memory on 5 nodes<br />High-throughput<br />Full genomes analyzed in under 5 hours<br />Long word lengths<br />Genomes approaching 20<br />Smaller files often 100 or greater<br />Powerful analysis<br />Detailed statistics<br />Degeneracy via clustering<br />Additional post-processing (scatter plots, logos, etc.)<br />
    18. 18. Future Work<br />Post-processing<br />Word distributions<br />Sequence clustering<br />Gbrowse visualization<br />Further parallelization<br />Within a node<br />Greater distributed abstraction (more prefixes)<br />
    19. 19. Questions?<br />