Fast Searching in Biological Sequences Using Multiple Hash Functions
1. A T A C G T T C A G A T T G C C A G C A C G T T
Algorithms & Complexity Evaluation
Fast Search in
Biological Sequences
using Multiple Hash
Functions
2. We are going to deal with a very tiny alphabet
representing nucleotydes in a genetic sequence.
A DENINE
T HYMINE
Searching in a sequence for
more patterns.
G UANINE
C YTOSINE
After veryfing matches,
advance window: pos++
search window
T G A G C
A G G C A
T G T C G
patterns to
search
T G A G C
shift window
by 1 position
A T G A C G A C T
A G G C A
T G T C G
A T G A C G A C T
Grasping the
problem
string matching???
what’s this?
DNA sequence
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
3. a pattern
NOT A TEXT!!!
now... a text!
T G A G C A C T G
gram dim q = 3
T G A G C A C T G
extract
ing the
first
q-gram
T G A
First we
have pre processing
stage...
F[HASH(’CTG’)] =
patterns[cur]
feeding the hash function
with the extracted q-gram,
hash is returned:
0 <= hash <= MAX
HASH ( T G A ) = #@!*$%£&?
calculated hash is
used as index in
shift array
value used to
shift the window
sh[ #@!*$%£&? ] = shift
Let‛s talk
about
Wu &
Mamber
don’t worry!
It’s not a
magic spell...
it’s just an
algorithm
A
G
T
C
C
T
G
T
A
A
A
G
A
G
G
A
C
C
T
C
T
G
A
C
G
T
G
G
G
G
T
C
C
A
A
T
G
G
G
C
A
C
A
C
Then we
can move
to real
search...
window size =
pattern size = m
A C A A C T G G C
extracting the
last q-gram only
G G C
hash function gets the
q-gram, hash returned:
0 <= hash <= MAX
HASH ( G G C ) = ^@!*%£$?#
shift index
shift
= sh[ ^@!*%£$?# ]
0?
true
NAIVE CHECK
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
4. T G A
k = Math.floor(w/q);
0
1
0
0
1
0
1
k
W-M limit
cannot increase
them both...
Decrease
number
of false
positives
0
1
k
w
1
1
0
0
1
1
0
k
More text to
analize
Increase q
More bits
per char
Increase k
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
5. Enhancing
W-M...
T G A G C A C T G
γ =1
γ =2
T G A G C A C T G
pre-processing
T G A G C A C T G
HASH(’CTG’) = h1
HASH(’GCA’) = h2
HASH ( T G A ) = #@!*$%
HASH ( T G A ) = #@!*$%
sh 1[ #@!*$% ] = m-q-i
sh 2[ #@!*$% ] = m-2q-i
to be
continued...
h = ( h1 << 1) + h2
F[h] = patterns[cur]
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
6. ...Enhancing
W-M
search
window
A
G
T
C
C
T
G
T
A
A
A
G
A
G
G
A
C
C
T
C
T
G
A
C
G
T
G
G
G
G
T
C
C
A
A
T
G
G
G
C
A
C
A
C
shift1 = sh 1[ §+!#*£$?% ]
HASH ( G G C ) = §+!#*£$?%
a text
h1
A C A A C T G G C
...now you
can’t go back
In the end...
h = ( h1 << 1) + h2
if (shift1 == 0 &&
shift2 == 0)
foreach (p in F[h])
checkOccurrInWin(p);
h2
HASH ( A C T ) = ^@!*%£$?#
shift2 = sh 2[ ^@!*%£$?# ]
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
7. Complexities
Pre-processing
O ( MAX (1+
O ( MAX + r
) + r ) = Space requirement
m q ) = Time requirement
Search phase
O ( m (1) n ) = Time requirement
m
(1)
=
r
i=1
( len ( p ))
i
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
8. Experimental results
35
time
100
WM(6,1)
MBNDM
time
1200
WM(4,2)
|P| = 100
30
best WM(q,γ)
time
|P| = 1000
80
1000
25
20
800
60
15
WM(8,1)
10
400
5
0
WM(8,1)
40
|P| = 10000
600
WM(8,1)
8
16
20
WM(8,1)
32
64
WM(8,2)
WM(8,1)
128
w
0
8
16
32
WM(4,2)
WM(8,3) WM(8,3)
64
128
WM(8,2) WM(8,2) WM(8,2) WM(8,2)
200
w
0
8
16
32
64
128
w
Showing comparison on execution times among WM(q,γ)
and one of the current fastest algorithms in literature
Presentation by Simone Tino - All rights reserved. Authored from November 2012 to December 2012 - University of Catania - Faculty of Computer Science - Algoritmi e Complessità
9. A T A C G T T C A G A T T G C C A G C A C G T T
The End