Your SlideShare is downloading. ×
0
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Lichtenberg bosc2010 wordseeker
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Lichtenberg bosc2010 wordseeker

471

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
471
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • MPI: Widely Supported by network interface designers
  • Transcript

    • 1. Concurrent Bioinformatics Software FORDISCOVERING Genome-Wide Patternsand Word-based Genomic Signatures<br />Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch<br />
    • 2. The WordSeeker Tool<br />Enumeration<br />Suffix Tree and Suffix Array<br />Radix Tree<br />Scoring<br />Clustering<br />Sequence Clustering<br />Word Clustering<br />Conservation Analysis<br />Phast Cons Score Extraction<br />Location Distributions<br />Sequence Coverage<br />Min set of words necessary to<br /> cover all sequences<br />Module Discovery<br />Enumerative<br />Ranger Markup<br />Basic Functional Elements<br />
    • 3. Software Properties<br />Google code repository: http://code.google.com/p/word-seeker/<br />GNU General Public License v3<br />Doxygen code generator (Internal Documentation).<br />Svn for command line access: http://word-seeker.googlecode.com/svn/trunk<br />Requirements<br />G++ compiler version 4.1* or higher<br />OpenMP headers<br />MPI environment (distributed version)<br />For visualizations and other post-processing steps<br />Perl 5.8.8,<br />TFBS (http://tfbs.genereg.net/)<br />SET::Scalar<br />LWP::Simple<br />Parallel::Forkmanager<br />GD::Graphs::bars,<br />Algorithm::Cluster <br />Bio::SeqIO (all available through CPAN)<br />Gnuplot version 4.2 or higher<br />
    • 4. Need for a Scalable Approach<br />Word Enumeration Module<br />Represents a set of biological input sequences based on some data structure<br />Keeps track of words, word counts, sequence counts, and word locations<br />Need to keep the data persistent in memory<br />Word Scoring Module<br />Determines statistical scores for each word<br />Frequent lookups for words and substrings of words <br />Example: Markov order m model requires lookups for all substrings of up to length m for all words<br /><ul><li>Keep space complexity low  Keep time complexity for</li></ul>lookups low<br />
    • 5. Enumeration Approaches<br />Total number of nucleotides in the input sequences: n<br />Word length: m<br />
    • 6. Distributed Solution<br />Tasks executed on different nodes<br />Distributed Memory<br />Multi-core Solution<br />Tasks executed on different cores<br />Shared Memory Solution<br />Parallelization<br />
    • 7. Parallel Software Properties<br />Shared Memory<br />Open MP parallelization<br />Simple, portable, directives that compile even on non supported architectures<br />Simple loops are run in parallel on multiple processors<br />Distributed Memory<br />MPI parallelization<br />Hardware optimizations and support for Fortran, C/C++, Perl<br />Each node is provided a subset of the data to process<br />“Smart” division of tasks is key<br />
    • 8. Results<br />Analyzed the Arabidopsis thaliana genome<br />All segments and the full genome<br />Multiple word lengths (1-20)<br />Searched top words against AGRIS (repository of known elements in A. thaliana)<br />Characterized the Framework<br />Speedup and runtime analysis<br />Radix Trie and Suffix Tree<br />
    • 9. Memory Requirements for Arabidopsis thaliana<br />Conducted at the Ohio Supercomputer Center<br />
    • 10. Execution Times for Arabidopsis thaliana<br />
    • 11. Speedup, efficiency and timing using A. thaliana core promoter sequences.<br />Analyzing the Parallel System<br />
    • 12. Shared and Distributed Memory Speedup<br />Radix Trie<br />Suffix Tree<br />
    • 13. Shared and Distributed Memory Efficiency<br />Radix Trie<br />Suffix Tree<br />
    • 14. Shared and Distributed Memory Performance<br />Radix Trie<br />Suffix Tree<br />
    • 15. Scoring Speedup Contribution<br />Runtime<br />Scoring<br />
    • 16. Results: Pushing the limits<br />
    • 17. Summary<br />Parallel<br />Shared memory on single nodes<br />Distributed memory on 5 nodes<br />High-throughput<br />Full genomes analyzed in under 5 hours<br />Long word lengths<br />Genomes approaching 20<br />Smaller files often 100 or greater<br />Powerful analysis<br />Detailed statistics<br />Degeneracy via clustering<br />Additional post-processing (scatter plots, logos, etc.)<br />
    • 18. Future Work<br />Post-processing<br />Word distributions<br />Sequence clustering<br />Gbrowse visualization<br />Further parallelization<br />Within a node<br />Greater distributed abstraction (more prefixes)<br />
    • 19. Questions?<br />

    ×