String searching

String Searching Algorithms
Problem Description
Given two strings P and T over the same alphabet ,
determine whether P occurs as a substring in T (or
find in which position(s) P occurs as a substring in T).
The strings P and T are called pattern and target
respectively.
[Adapted from G.Plaxton]

Some applications
[Adapted from K.Wayne]

Trivial Approach - Algorithm
SimpleMatcher(string P, string T)
n  length[T]
m  length[P]
for s  0 to n  m do
if P[1...m] = T[s+1 ... s+m] then
print s
T(n,m) = (n  m + 1) m (1) = (n m)

Rabin-Karp Algorithm - Idea
Pattern - P[1m]
Target - T[1n]
p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10m P[1]
for s = 0 to n – m:
ts = T[s+m] + 10 T[s+m–1] + 100 T[s+m–2] +  + 10m T[s+1]
P matches T at position i if and only if p = ti

Rabin-Karp Algorithm - Idea
p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10m P[1]
p = P[m] + 10 (P[m–1] + 10 (P[m–2] + 
 + 10 (P[2] + 10 P[1]))))
t0 = T[m] + 10 T[m–1] + 100 T[m–2] +  + 10m T[1]
t0 = T[m] + 10 (T[m–1] + 10 (T[m–2] + 
 + 10 (T[2] + 10 T[1]))))
ts+1 = 10 (ts – 10m–1 T[s+1]) + T[s+m+1]

Rabin-Karp Algorithm - Example 1
T = 289462340372392345
P = 234
p = 234
t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239,
392, 923, 234, 345

Rabin-Karp Algorithm - Problem
What to do, if p is too large to be stored as
integer data type?
(The simplest) solution:
Use p mod q and ti mod q, instead of p and ti
If p mod q  ti mod q no match is possible at position i
If p mod q = ti mod q we have to check for match explicitly

Rabin-Karp Algorithm - Algorithm
RabinKarpMatcher(string P, string T, integer d, integer q)
n  length[T]; m  length[P]
h  dm–1 mod q
p  0; t0  0
for i  1 to m do
p  (d p + P[i]) mod q
t0  (d t0 + T[i]) mod q
for s  0 to n – 1 do
if p = ts then
if P[1...m] = T[s+1 ... s+m] then
print s
if s < n – 1 then
ts  (d (ts – T[s+1] h) + T[s+m+1]) mod q

Rabin-Karp Algorithm - Example 2
T = 289462340372392345
P = 234
q = 5
p = 234
p mod q = 4
t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239,
392, 923, 234, 345
t mod q = 4, 4, 1, 2, 3, 4, 0, 3, 2, 2, 3, 4, 2, 3, 4, 0

Rabin-Karp Algorithm - generalization
Instead of calculating numbers mod q, we can use an arbitrary
hash function

Rabin-Karp Algorithm - Complexity
[Adapted from T.Ralphs]

Rabin-Karp Algorithm - Complexity
Worst case:
T(n,m) = (n  m + 1) m (1) = (n m)
Average case:
number of correct matches - v
number of incorrect matches -  n/q
T(n,m) = (n + m) + (m(v + n/q))
If v is small and m  q, then T(n,m) = (n + m)

Two dimensional pattern matching
[Adapted from M.Crochemore,T.Lecroq]

Knuth-Morris-Pratt Algorithm - Idea
[Adapted from A.Cawsey]

Knuth-Morris-Pratt Algorithm - Some
history
[Adapted from K.Wayne]

T = gadji beri bimba glandridi
P = gadjama
g a d j i b e r i b i m b a g l a n d r i d i
g a d j a m a
g a d j a m a

T = gadjama gramma berida
P = gaga
g a d j a m a g r a m m a b e r i d a
g a g a
g a d j a m a g r a m m a b e r i d a
g a g a

For each position q = 1, , m in P compute the number
of positions by which pattern can be advanced, if a
mismatch has been previously detected in q-th position.

KMP - Algorithm
KnuthMorrisPrattMatcher(string P, string T)
n  length[T]
m  length[P]
  PrefixFunction(P)
q  0
for i  1 to n do
while q > 0 & P[q+1]  T[i] do
q  [q]
if P[q+1] = T[i] then
q  q + 1
if q = m then
print i  m
q  [q]

KMP - Prefix Function
A << B - A is prefix of B, e.g. ab << abacae
A >> B - A is suffix of B, e.g. ae >> abacae
Ps - initial substring of P of length s
sP - terminal substring of P of length s
Prefix function:
 : {1 ,2, , m}  {0, 1, 2, , m–1}
[q] = max {k : k < q & Pk >> Pq}

KMP - Prefix Function
[q] = max {k : k < q & Pk >> Pq}
[q] is the length of the longest prefix of P that is a proper
suffix of Pq.
If a mismatch is detected at position q, then pattern
can be advanced by q – [q] positions.

KMP - Prefix Function - Example
[q] = max {k : k < q & Pk >> Pq}
suffix of Pq.
P = abracadabra
 = 0,0,0,1,0,1,0,1,2,3,4

KMP - Prefix Function - Algorithm
PrefixFunction(string P)
m  length[P]
[1]  0
k  0
for q  2 to m do
while k > 0 & P[k+1]  P[q] do
k  [k]
if P[k+1] = P[q] then
k  k + 1
[q]  k
return 

KMP - Complexity
n  length[T]
m  length[P]
q  0
for i  1 to n do
q  [q]
q  q + 1
if q = m then
print i  m
q  [q]
(n) times
In worst case (m) times
Thus T(n,m) = O(nm)...

n  length[T]
m  length[P]
q  0
for i  1 to n do
KMP - Complexity
T(n,m) = TP(m) + n TWhile(m) = O(n m)?
• q value are increased at most n times
• always q  0
q  [q]
q  q + 1
if q = m then
print i  m
q  [q]
Thus, q can not be decreased more than n times, i.e.
while loop can be executed no more than n times.
T(n,m) = TP(m) + n TWhile(m) = TP(m) + (n)

KMP - Prefix Function - Correctness
[q] = max {k : k < q & Pk >> Pq}
suffix of Pq.
We define : 0[q] = q, i+1[q] = [i[q]]
*[q] = {q, [q], 2[q], , t[q] = 0}

0[q] = q, i+1[q] = [i[q]]
*[q] = {q, [q], 2[q], , t[q] = 0}
Lemma
Let P be a pattern of length m with prefix function .
Then, for q = 1,2, , m we have *[q] = {k : Pk >> Pq}

Lemma
For q = 1,2, , m, if [q] > 0, then [q] – 1  *[q–1].

For q = 2, , m we define Eq–1  *[q–1] by
Eq–1 = {k : k  *[q–1] & P[k+1] = P[q]}
Corollary
For q = 2, , m:
[q] =
0, if Eq–1 = 
1 + max{k  Eq–1}, if Eq–1  

We consecutively compute [1], [2], , [m]
[1] = 0
For k > 1:
if P[k] = P[[k–1] + 1], then [k] = [k–1] + 1,
else, if P[k] = P[[k–2] + 1], then [k] = [[k–1]] + 1,
else, if P[k] = P[[k–3] + 1], then [k] = [[[k–1]]] + 1,


KMP - Prefix Function - Complexity
TP(m) = const + m TWhile(m) = O(m2)
• k value are increased at most n times
• always k  0
Thus, k can not be decreased more than n times, i.e.
while loop can be executed no more than n times.
TP(m) = const + m TWhile(m) = (m)
PrefixFunction(string P)
m  length[P]
[1]  0
k  0
for q  2 to m do
k  [k]
k  k + 1
[q]  k
return 

KMP - Complexity
T(n,m) = TP(m) + n TWhile(m) =
n  length[T]
m  length[P]
q  0
for i  1 to n do
TP(m) + (n) =
(m) + (n) =
(m + n) PrefixFunction(string P)
m  length[P]
[1]  0
k  0
for q  2 to m do
k  [k]
k  k + 1
[q]  k
return 
q  [q]
q  q + 1
if q = m then
print i  m
q  [q]

Boyer-Moore Algorithm - Idea 1
P = lonni
l o n n i
l o n n i
Bad character heuristic

Boyer-Moore Algorithm - Idea 2
P = ajiji
a j i j i
a j i j i
Good suffix heuristic

Boyer-Moore - Bad Character Function
Bad character function:
 :   {0,1,2, , m}
[s] = max {k : P[k] = s} (if such k exists)
[s] = 0 (otherwise)
[Adapted from M.Goodrich, R.Tamassia]

Boyer-Moore - Bad Character Function
BadCharacterFunction(string P, set )
m  length[P]
for a   do
[a]  0
for j  1 to m do
[P[j]]  j
return 
TB(m,||) = (m + ||)

Boyer-Moore - Suffix Function
Suffix function:
 : {1, 2, , m}  {1, 2, , m}
[j] = m – max {k : k < m & jP >> Pk  PK >> jP}
[Adapted from R.Lee, C.Lu]

[Adapted from R.Lee, C.Lu]

SuffixFunction(string P)
m  length[P]
P’  Reverse(P); ’  PrefixFunction(P’)
for j  0 to m do
[j]  m – [m]
for l  1 to m do
j  m – ’[l]
if [j] > l – ’[l] then
[j]  l – ’[l]
return 
TS(m) = (m)

Boyer-Moore Algorithm - Algorithm
BoyerMooreMatcher(string P, string T, set )
n  length[T]
m  length[P]
  LastOccurenceFunction(P,m, )
  GoodSuffixFunction(P,m)
s  0
while s  n  m do
j  m
while j > 0 & P[j] = T[s + j] do
j  j  1
if j = 0 then
print s
s  s +  [0]
else s  s + max( [j], j  [T[s + j]])

Boyer-Moore Algorithm - Complexity
TB(m,||) = (m + ||)
TS(m) = (m)
T(n,m,||) = TB(m,||) + TS(m) + n TWhile(m) =
= (m + ||) + (m) + O(n m) = O(|| + n m)?
It can be shown that
T(n,m,||) = (|| + n + m)

It can be shown that:
T(n,m,||) = (|| + n m) using only bad character rule
T(n,m,||) = (|| + n + m) using only good suffix rule, if
the pattern does not occur in text
T(n,m,||) = (|| + n m) using only good suffix rule, if
the pattern does occur in text

With Galil's modification:
T(n,m,||) = (|| + n + m) using only good suffix rule
There is also a similar Apostolico-Giancarlo algorithm that
achieves (|| + n + m) time bound (which is much easier to
prove)
On average the number of character comparisons is n/m
(for large ||)

Algorithms - Complexity comparison
[Adapted from H.Løvengreen]

Algorithms - Efficiency comparison
n=5000
[Adapted from I.Spence]

Complexity - Lower Bound
Theorem (Rivest)
Any string searching algorithm has worst-case time
complexity
T(n,m) = (m + n)

Theorem (Rivest)
Suffix Trees - The problem
Any string searching algorithm has worst-case time
complexity
T(n,m) = (m + n)
Despite this, we probably can do better!
(Well, for slightly different problem...)
[Adapted from P.Kilpeläinen]

Suffix Trees

Suffix Trees - Example

Suffix Trees - Do they always exist?

Suffix Trees - Application to string
matching

Suffix Trees - Construction

Suffix Trees - Construction - Example

Suffix Trees - Construction -
Complexity

Suffix Trees - Compact representation

Suffix Trees - Compact representation
- Example

Suffix Trees - Some history

Suffix Trees - Ukkonen's algorithm

Suffix Trees - Implicit trees

Suffix Trees - String paths

Suffix Trees - Extensions

Suffix Trees - Extensions - Example

Suffix Trees - Ukkonen's algorithm -
Complexity

Suffix Trees - Suffix links

Suffix Trees - Speeding up

Suffix Trees - Eliminating extensions

Suffix Trees - Single phase algorithm

String searching

More Related Content

What's hot

Viewers also liked

Similar to String searching

More from thinkphp

Recently uploaded

String searching