Upcoming SlideShare
×

# Fast Searching in Biological Sequences Using Multiple Hash Functions

199 views
104 views

Published on

Published in: Education, Technology, Sports
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
199
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Fast Searching in Biological Sequences Using Multiple Hash Functions

1. 1. A T A C G T T C A G A T T G C C A G C A C G T T Algorithms & Complexity Evaluation Fast Search in Biological Sequences using Multiple Hash Functions
2. 2. We are going to deal with a very tiny alphabet representing nucleotydes in a genetic sequence. A DENINE T HYMINE Searching in a sequence for more patterns. G UANINE C YTOSINE After veryfing matches, advance window: pos++ search window T G A G C A G G C A T G T C G patterns to search T G A G C shift window by 1 position A T G A C G A C T A G G C A T G T C G A T G A C G A C T Grasping the problem string matching??? what’s this? DNA sequence Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
3. 3. a pattern NOT A TEXT!!! now... a text! T G A G C A C T G gram dim q = 3 T G A G C A C T G extract ing the first q-gram T G A First we have pre processing stage... F[HASH(’CTG’)] = patterns[cur] feeding the hash function with the extracted q-gram, hash is returned: 0 <= hash <= MAX HASH ( T G A ) = #@!*\$%£&? calculated hash is used as index in shift array value used to shift the window sh[ #@!*\$%£&? ] = shift Let‛s talk about Wu & Mamber don’t worry! It’s not a magic spell... it’s just an algorithm A G T C C T G T A A A G A G G A C C T C T G A C G T G G G G T C C A A T G G G C A C A C Then we can move to real search... window size = pattern size = m A C A A C T G G C extracting the last q-gram only G G C hash function gets the q-gram, hash returned: 0 <= hash <= MAX HASH ( G G C ) = ^@!*%£\$?# shift index shift = sh[ ^@!*%£\$?# ] 0? true NAIVE CHECK Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
4. 4. T G A k = Math.floor(w/q); 0 1 0 0 1 0 1 k W-M limit cannot increase them both... Decrease number of false positives 0 1 k w 1 1 0 0 1 1 0 k More text to analize Increase q More bits per char Increase k Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
5. 5. Enhancing W-M... T G A G C A C T G γ =1 γ =2 T G A G C A C T G pre-processing T G A G C A C T G HASH(’CTG’) = h1 HASH(’GCA’) = h2 HASH ( T G A ) = #@!*\$% HASH ( T G A ) = #@!*\$% sh 1[ #@!*\$% ] = m-q-i sh 2[ #@!*\$% ] = m-2q-i to be continued... h = ( h1 << 1) + h2 F[h] = patterns[cur] Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
6. 6. ...Enhancing W-M search window A G T C C T G T A A A G A G G A C C T C T G A C G T G G G G T C C A A T G G G C A C A C shift1 = sh 1[ §+!#*£\$?% ] HASH ( G G C ) = §+!#*£\$?% a text h1 A C A A C T G G C ...now you can’t go back In the end... h = ( h1 << 1) + h2 if (shift1 == 0 && shift2 == 0) foreach (p in F[h]) checkOccurrInWin(p); h2 HASH ( A C T ) = ^@!*%£\$?# shift2 = sh 2[ ^@!*%£\$?# ] Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
7. 7. Complexities Pre-processing O ( MAX (1+ O ( MAX + r ) + r ) = Space requirement m q ) = Time requirement Search phase O ( m (1) n ) = Time requirement m (1) = r i=1 ( len ( p )) i Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
8. 8. Experimental results 35 time 100 WM(6,1) MBNDM time 1200 WM(4,2) |P| = 100 30 best WM(q,γ) time |P| = 1000 80 1000 25 20 800 60 15 WM(8,1) 10 400 5 0 WM(8,1) 40 |P| = 10000 600 WM(8,1) 8 16 20 WM(8,1) 32 64 WM(8,2) WM(8,1) 128 w 0 8 16 32 WM(4,2) WM(8,3) WM(8,3) 64 128 WM(8,2) WM(8,2) WM(8,2) WM(8,2) 200 w 0 8 16 32 64 128 w Showing comparison on execution times among WM(q,γ) and one of the current fastest algorithms in literature Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
9. 9. A T A C G T T C A G A T T G C C A G C A C G T T The End