IAC 2024 - IA Fast Track to Search Focused AI Solutions
Lichtenberg bosc2010 wordseeker
1. Concurrent Bioinformatics Software FORDISCOVERING Genome-Wide Patternsand Word-based Genomic Signatures Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch
2. The WordSeeker Tool Enumeration Suffix Tree and Suffix Array Radix Tree Scoring Clustering Sequence Clustering Word Clustering Conservation Analysis Phast Cons Score Extraction Location Distributions Sequence Coverage Min set of words necessary to cover all sequences Module Discovery Enumerative Ranger Markup Basic Functional Elements
3. Software Properties Google code repository: http://code.google.com/p/word-seeker/ GNU General Public License v3 Doxygen code generator (Internal Documentation). Svn for command line access: http://word-seeker.googlecode.com/svn/trunk Requirements G++ compiler version 4.1* or higher OpenMP headers MPI environment (distributed version) For visualizations and other post-processing steps Perl 5.8.8, TFBS (http://tfbs.genereg.net/) SET::Scalar LWP::Simple Parallel::Forkmanager GD::Graphs::bars, Algorithm::Cluster Bio::SeqIO (all available through CPAN) Gnuplot version 4.2 or higher
6. Distributed Solution Tasks executed on different nodes Distributed Memory Multi-core Solution Tasks executed on different cores Shared Memory Solution Parallelization
7. Parallel Software Properties Shared Memory Open MP parallelization Simple, portable, directives that compile even on non supported architectures Simple loops are run in parallel on multiple processors Distributed Memory MPI parallelization Hardware optimizations and support for Fortran, C/C++, Perl Each node is provided a subset of the data to process “Smart” division of tasks is key
8. Results Analyzed the Arabidopsis thaliana genome All segments and the full genome Multiple word lengths (1-20) Searched top words against AGRIS (repository of known elements in A. thaliana) Characterized the Framework Speedup and runtime analysis Radix Trie and Suffix Tree
17. Summary Parallel Shared memory on single nodes Distributed memory on 5 nodes High-throughput Full genomes analyzed in under 5 hours Long word lengths Genomes approaching 20 Smaller files often 100 or greater Powerful analysis Detailed statistics Degeneracy via clustering Additional post-processing (scatter plots, logos, etc.)
18. Future Work Post-processing Word distributions Sequence clustering Gbrowse visualization Further parallelization Within a node Greater distributed abstraction (more prefixes)