Lichtenberg bosc2010 wordseeker

•Download as PPTX, PDF•

0 likes•456 views

BOSC 2010

Technology

Concurrent Bioinformatics Software FORDISCOVERING Genome-Wide Patternsand Word-based Genomic Signatures Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch

The WordSeeker Tool Enumeration Suffix Tree and Suffix Array Radix Tree Scoring Clustering Sequence Clustering Word Clustering Conservation Analysis Phast Cons Score Extraction Location Distributions Sequence Coverage Min set of words necessary to cover all sequences Module Discovery Enumerative Ranger Markup Basic Functional Elements

Software Properties Google code repository: http://code.google.com/p/word-seeker/ GNU General Public License v3 Doxygen code generator (Internal Documentation). Svn for command line access: http://word-seeker.googlecode.com/svn/trunk Requirements G++ compiler version 4.1* or higher OpenMP headers MPI environment (distributed version) For visualizations and other post-processing steps Perl 5.8.8, TFBS (http://tfbs.genereg.net/) SET::Scalar LWP::Simple Parallel::Forkmanager GD::Graphs::bars, Algorithm::Cluster Bio::SeqIO (all available through CPAN) Gnuplot version 4.2 or higher

Need for a Scalable Approach Word Enumeration Module Represents a set of biological input sequences based on some data structure Keeps track of words, word counts, sequence counts, and word locations Need to keep the data persistent in memory Word Scoring Module Determines statistical scores for each word Frequent lookups for words and substrings of words Example: Markov order m model requires lookups for all substrings of up to length m for all words ,[object Object],lookups low

Enumeration Approaches Total number of nucleotides in the input sequences: n Word length: m

Distributed Solution Tasks executed on different nodes Distributed Memory Multi-core Solution Tasks executed on different cores Shared Memory Solution Parallelization

Parallel Software Properties Shared Memory Open MP parallelization Simple, portable, directives that compile even on non supported architectures Simple loops are run in parallel on multiple processors Distributed Memory MPI parallelization Hardware optimizations and support for Fortran, C/C++, Perl Each node is provided a subset of the data to process “Smart” division of tasks is key

Results Analyzed the Arabidopsis thaliana genome All segments and the full genome Multiple word lengths (1-20) Searched top words against AGRIS (repository of known elements in A. thaliana) Characterized the Framework Speedup and runtime analysis Radix Trie and Suffix Tree

Memory Requirements for Arabidopsis thaliana Conducted at the Ohio Supercomputer Center

Execution Times for Arabidopsis thaliana

Speedup, efficiency and timing using A. thaliana core promoter sequences. Analyzing the Parallel System

Shared and Distributed Memory Speedup Radix Trie Suffix Tree

Shared and Distributed Memory Efficiency Radix Trie Suffix Tree

Shared and Distributed Memory Performance Radix Trie Suffix Tree

Scoring Speedup Contribution Runtime Scoring

Summary Parallel Shared memory on single nodes Distributed memory on 5 nodes High-throughput Full genomes analyzed in under 5 hours Long word lengths Genomes approaching 20 Smaller files often 100 or greater Powerful analysis Detailed statistics Degeneracy via clustering Additional post-processing (scatter plots, logos, etc.)

Future Work Post-processing Word distributions Sequence clustering Gbrowse visualization Further parallelization Within a node Greater distributed abstraction (more prefixes)

What's hot

Pthread LibraryKhemraj Dhondge

Taming SnakemakeJeremy Leipzig

How to be a bioinformaticianChristian Frech

Microkernel designmicrokerneldude

eScience Cluster Arch. OverviewFrancesco Bongiovanni

Introduction to Galaxy and RNA-SeqEnis Afgan

Summary of Simultaneous Multithreading: Maximizing On-Chip ParallelismFarwa Ansari

Wiki 2Sid Hegde

Chapter04 newvmummaneni

A Survey of NGS Data Analysis on HadoopChung-Tsai Su

LO-PHI: Low-Observable Physical Host Instrumentation for Malware AnalysisPietro De Nicolao

Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya

Progressive Provenance Capture Through Re-computationPaul Groth

NGS: Mapping and de novo assemblyBioinformatics and Computational Biosciences Branch

AcdcJimmy Calderon

Kosmos Filesystemelliando dias

Dynamic Resource Allocation Algorithm using ContainersIRJET Journal

HadoopEsraa El Ghoul

Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma

Cn lab manual sb 19_scsl56 (1)SRINIVASUNIVERSITYEN

What's hot (20)

Pthread Library

Taming Snakemake

How to be a bioinformatician

Microkernel design

eScience Cluster Arch. Overview

Introduction to Galaxy and RNA-Seq

Summary of Simultaneous Multithreading: Maximizing On-Chip Parallelism

Wiki 2

Chapter04 new

A Survey of NGS Data Analysis on Hadoop

LO-PHI: Low-Observable Physical Host Instrumentation for Malware Analysis

Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming

Progressive Provenance Capture Through Re-computation

NGS: Mapping and de novo assembly

Acdc

Kosmos Filesystem

Dynamic Resource Allocation Algorithm using Containers

Hadoop

Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System

Cn lab manual sb 19_scsl56 (1)

Viewers also liked

Как стать информационным продюсеромАльберт Коррч

Оптимизация интерактивного тестирования с использованием метрики Покрытие кодаSPB SQA Group

Venkatesan bosc2010 onto-toolkitBOSC 2010

1.2 Hubert BolducIzabela Popova

$C:\Users\The Andersens\Desktop\Karin\I Wanna Learn To Play Like The Dolphins$ $C:\Users\The Andersens\Desktop\Karin\I Wanna Learn To Play Like The Dolphins$

C:\Users\The Andersens\Desktop\Karin\I Wanna Learn To Play Like The Dolphinskkindig

LimecoconutMichelle Carriere

Results from survey.afrostwick

Edison.powerpoint.106.v2aedison

Nars cosmetics couponMaterazzi3

Snapshot Of Umt For Investmentmqazi

Gogirl indonesiaJay Lee

Portfolio acadêmicoJessica Barros

_right_ Goozzy TechCrunch presentationalarin

자바스터디 4jangpd007

From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley Cary Institute of Ecosystem Studies

Influenta brandurilor asupra consumatorilor social mediaValentin Vesa

CRITERIOS DE OBTENCIÓN DEL CERTIFICADO BAI EUSKARARIBai Euskarari Ziurtagiriaren Elkartea

Latest trends in emTEO (The Event Organizers)

LE LABEL BAI EUSKARARI: CRITERES D'OBTENCIONBai Euskarari Ziurtagiriaren Elkartea

Gustar2Ricardo Valenzuela

Viewers also liked (20)

Как стать информационным продюсером

Оптимизация интерактивного тестирования с использованием метрики Покрытие кода

Venkatesan bosc2010 onto-toolkit

1.2 Hubert Bolduc

$C:\Users\The Andersens\Desktop\Karin\I Wanna Learn To Play Like The Dolphins$ $C:\Users\The Andersens\Desktop\Karin\I Wanna Learn To Play Like The Dolphins$

C:\Users\The Andersens\Desktop\Karin\I Wanna Learn To Play Like The Dolphins

Limecoconut

Results from survey.

Edison.powerpoint.106.v2

Nars cosmetics coupon

Snapshot Of Umt For Investment

Gogirl indonesia

Portfolio acadêmico

_right_ Goozzy TechCrunch presentation

자바스터디 4

From Forests to Farms, and Back Again: Land Use Change in the Hudson Valley

Influenta brandurilor asupra consumatorilor social media

CRITERIOS DE OBTENCIÓN DEL CERTIFICADO BAI EUSKARARI

Latest trends in em

LE LABEL BAI EUSKARARI: CRITERES D'OBTENCION

Gustar2

Similar to Lichtenberg bosc2010 wordseeker

Linux Driver and Embedded Developer with Android Course Content & HighlightsVeda Solutions - Embedded Systems & Linux Device Drivers Training

Linux Driver and Embedded Developer Course HighlightsVeda Solutions - Embedded Systems & Linux Device Drivers Training

Effect of Virtualization on OS InterferenceEric Van Hensbergen

Mmp hotos2003-slidesMUHAMMAD UMAIR

Lecture 3,4 operating systemsPradeep Kumar TS

Hardware & softwaresSantosh Kulkarni

App AWayne Jones Jnr

Operating system conceptsGreen Ecosystem

Petapath HP Cast 12 - Programming for High Performance Accelerated Systemsdairsie

Unit1 principle of programming languageVasavi College of Engg

Chapter 22 - Windows XPWayne Jones Jnr

Lampchetanmbhimewal

Unix1girdharitrupti

Operating System 4 1193308760782240 2mona_hakmy

Operating System 4tech2click

.Net framework interview questionsMir Majid

Open64 compilerMaria Akther

Intro to Perfect - LA presentationTim Taplin

Windows Operating system notes taken from somewheretoursofecstacy

Similar to Lichtenberg bosc2010 wordseeker (20)

Linux Driver and Embedded Developer with Android Course Content & Highlights

Linux Driver and Embedded Developer Course Highlights

Effect of Virtualization on OS Interference

Mmp hotos2003-slides

Lecture 3,4 operating systems

Hardware & softwares

App A

Operating system concepts

Petapath HP Cast 12 - Programming for High Performance Accelerated Systems

Unit1 principle of programming language

Chapter 22 - Windows XP

Lamp

Unix1

Operating System 4 1193308760782240 2

Operating System 4

.Net framework interview questions

Open64 compiler

Intro to Perfect - LA presentation

Windows Operating system notes taken from somewhere

Recently uploaded

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Real Time Object Detection Using Open CVKhem

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to convert PDF to text with Nanonetsnaman860154

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Real Time Object Detection Using Open CV

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Exploring the Future Potential of AI-Enabled Smartphone Processors

08448380779 Call Girls In Friends Colony Women Seeking Men

Boost PC performance: How more available memory can improve productivity

[2024]Digital Global Overview Report 2024 Meltwater.pdf

How to convert PDF to text with Nanonets

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Powerful Google developer tools for immediate impact! (2023-24 C)

GenCyber Cyber Security Day Presentation

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

08448380779 Call Girls In Civil Lines Women Seeking Men

Axa Assurance Maroc - Insurer Innovation Award 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Presentation on how to chat with PDF using ChatGPT code interpreter

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Lichtenberg bosc2010 wordseeker

1. Concurrent Bioinformatics Software FORDISCOVERING Genome-Wide Patternsand Word-based Genomic Signatures Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch

2. The WordSeeker Tool Enumeration Suffix Tree and Suffix Array Radix Tree Scoring Clustering Sequence Clustering Word Clustering Conservation Analysis Phast Cons Score Extraction Location Distributions Sequence Coverage Min set of words necessary to cover all sequences Module Discovery Enumerative Ranger Markup Basic Functional Elements

3. Software Properties Google code repository: http://code.google.com/p/word-seeker/ GNU General Public License v3 Doxygen code generator (Internal Documentation). Svn for command line access: http://word-seeker.googlecode.com/svn/trunk Requirements G++ compiler version 4.1* or higher OpenMP headers MPI environment (distributed version) For visualizations and other post-processing steps Perl 5.8.8, TFBS (http://tfbs.genereg.net/) SET::Scalar LWP::Simple Parallel::Forkmanager GD::Graphs::bars, Algorithm::Cluster Bio::SeqIO (all available through CPAN) Gnuplot version 4.2 or higher

5. Enumeration Approaches Total number of nucleotides in the input sequences: n Word length: m

6. Distributed Solution Tasks executed on different nodes Distributed Memory Multi-core Solution Tasks executed on different cores Shared Memory Solution Parallelization

7. Parallel Software Properties Shared Memory Open MP parallelization Simple, portable, directives that compile even on non supported architectures Simple loops are run in parallel on multiple processors Distributed Memory MPI parallelization Hardware optimizations and support for Fortran, C/C++, Perl Each node is provided a subset of the data to process “Smart” division of tasks is key

8. Results Analyzed the Arabidopsis thaliana genome All segments and the full genome Multiple word lengths (1-20) Searched top words against AGRIS (repository of known elements in A. thaliana) Characterized the Framework Speedup and runtime analysis Radix Trie and Suffix Tree

9. Memory Requirements for Arabidopsis thaliana Conducted at the Ohio Supercomputer Center

10. Execution Times for Arabidopsis thaliana

11. Speedup, efficiency and timing using A. thaliana core promoter sequences. Analyzing the Parallel System

12. Shared and Distributed Memory Speedup Radix Trie Suffix Tree

13. Shared and Distributed Memory Efficiency Radix Trie Suffix Tree

14. Shared and Distributed Memory Performance Radix Trie Suffix Tree

15. Scoring Speedup Contribution Runtime Scoring

16. Results: Pushing the limits

17. Summary Parallel Shared memory on single nodes Distributed memory on 5 nodes High-throughput Full genomes analyzed in under 5 hours Long word lengths Genomes approaching 20 Smaller files often 100 or greater Powerful analysis Detailed statistics Degeneracy via clustering Additional post-processing (scatter plots, logos, etc.)

18. Future Work Post-processing Word distributions Sequence clustering Gbrowse visualization Further parallelization Within a node Greater distributed abstraction (more prefixes)

19. Questions?

Editor's Notes

MPI: Widely Supported by network interface designers

Lichtenberg bosc2010 wordseeker

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lichtenberg bosc2010 wordseeker

Similar to Lichtenberg bosc2010 wordseeker (20)

More from BOSC 2010

More from BOSC 2010 (20)

Recently uploaded

Recently uploaded (20)

Lichtenberg bosc2010 wordseeker

Editor's Notes