A T A C G T T C A G A T T G C C A G C A C G T T

Algorithms & Complexity Evaluation

Fast Search in
Biological Sequences
u...
We are going to deal with a very tiny alphabet
representing nucleotydes in a genetic sequence.

A DENINE
T HYMINE

Searchi...
a pattern
NOT A TEXT!!!

now... a text!

T G A G C A C T G
gram dim q = 3

T G A G C A C T G
extract
ing the
first
q-gram
...
T G A

k = Math.floor(w/q);

0

1

0

0

1

0

1

k

W-M limit
cannot increase
them both...

Decrease
number
of false
posi...
Enhancing
W-M...

T G A G C A C T G

γ =1

γ =2

T G A G C A C T G

pre-processing

T G A G C A C T G

HASH(’CTG’) = h1

H...
...Enhancing
W-M
search

window

A
G
T
C

C
T
G
T

A
A
A
G

A
G
G
A

C
C
T
C

T
G
A
C

G
T
G
G

G
G
T
C

C
A
A
T

G
G
G
C
...
Complexities
Pre-processing
O ( MAX (1+

O ( MAX + r

) + r ) = Space requirement
m q ) = Time requirement

Search phase
O...
Experimental results
35

time

100

WM(6,1)

MBNDM

time

1200

WM(4,2)

|P| = 100

30

best WM(q,γ)

time

|P| = 1000
80
...
A T A C G T T C A G A T T G C C A G C A C G T T

The End
Upcoming SlideShare
Loading in …5
×

Fast Searching in Biological Sequences Using Multiple Hash Functions

199 views
104 views

Published on

Published in: Education, Technology, Sports
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
199
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fast Searching in Biological Sequences Using Multiple Hash Functions

  1. 1. A T A C G T T C A G A T T G C C A G C A C G T T Algorithms & Complexity Evaluation Fast Search in Biological Sequences using Multiple Hash Functions
  2. 2. We are going to deal with a very tiny alphabet representing nucleotydes in a genetic sequence. A DENINE T HYMINE Searching in a sequence for more patterns. G UANINE C YTOSINE After veryfing matches, advance window: pos++ search window T G A G C A G G C A T G T C G patterns to search T G A G C shift window by 1 position A T G A C G A C T A G G C A T G T C G A T G A C G A C T Grasping the problem string matching??? what’s this? DNA sequence Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  3. 3. a pattern NOT A TEXT!!! now... a text! T G A G C A C T G gram dim q = 3 T G A G C A C T G extract ing the first q-gram T G A First we have pre processing stage... F[HASH(’CTG’)] = patterns[cur] feeding the hash function with the extracted q-gram, hash is returned: 0 <= hash <= MAX HASH ( T G A ) = #@!*$%£&? calculated hash is used as index in shift array value used to shift the window sh[ #@!*$%£&? ] = shift Let‛s talk about Wu & Mamber don’t worry! It’s not a magic spell... it’s just an algorithm A G T C C T G T A A A G A G G A C C T C T G A C G T G G G G T C C A A T G G G C A C A C Then we can move to real search... window size = pattern size = m A C A A C T G G C extracting the last q-gram only G G C hash function gets the q-gram, hash returned: 0 <= hash <= MAX HASH ( G G C ) = ^@!*%£$?# shift index shift = sh[ ^@!*%£$?# ] 0? true NAIVE CHECK Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  4. 4. T G A k = Math.floor(w/q); 0 1 0 0 1 0 1 k W-M limit cannot increase them both... Decrease number of false positives 0 1 k w 1 1 0 0 1 1 0 k More text to analize Increase q More bits per char Increase k Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  5. 5. Enhancing W-M... T G A G C A C T G γ =1 γ =2 T G A G C A C T G pre-processing T G A G C A C T G HASH(’CTG’) = h1 HASH(’GCA’) = h2 HASH ( T G A ) = #@!*$% HASH ( T G A ) = #@!*$% sh 1[ #@!*$% ] = m-q-i sh 2[ #@!*$% ] = m-2q-i to be continued... h = ( h1 << 1) + h2 F[h] = patterns[cur] Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  6. 6. ...Enhancing W-M search window A G T C C T G T A A A G A G G A C C T C T G A C G T G G G G T C C A A T G G G C A C A C shift1 = sh 1[ §+!#*£$?% ] HASH ( G G C ) = §+!#*£$?% a text h1 A C A A C T G G C ...now you can’t go back In the end... h = ( h1 << 1) + h2 if (shift1 == 0 && shift2 == 0) foreach (p in F[h]) checkOccurrInWin(p); h2 HASH ( A C T ) = ^@!*%£$?# shift2 = sh 2[ ^@!*%£$?# ] Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  7. 7. Complexities Pre-processing O ( MAX (1+ O ( MAX + r ) + r ) = Space requirement m q ) = Time requirement Search phase O ( m (1) n ) = Time requirement m (1) = r i=1 ( len ( p )) i Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  8. 8. Experimental results 35 time 100 WM(6,1) MBNDM time 1200 WM(4,2) |P| = 100 30 best WM(q,γ) time |P| = 1000 80 1000 25 20 800 60 15 WM(8,1) 10 400 5 0 WM(8,1) 40 |P| = 10000 600 WM(8,1) 8 16 20 WM(8,1) 32 64 WM(8,2) WM(8,1) 128 w 0 8 16 32 WM(4,2) WM(8,3) WM(8,3) 64 128 WM(8,2) WM(8,2) WM(8,2) WM(8,2) 200 w 0 8 16 32 64 128 w Showing comparison on execution times among WM(q,γ) and one of the current fastest algorithms in literature Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
  9. 9. A T A C G T T C A G A T T G C C A G C A C G T T The End

×