String Searching Algorithms 
Problem Description 
Given two strings P and T over the same alphabet , 
determine whether P occurs as a substring in T (or 
find in which position(s) P occurs as a substring in T). 
The strings P and T are called pattern and target 
respectively. 
[Adapted from G.Plaxton]
String Searching Algorithms 
Some applications 
[Adapted from K.Wayne]
String Searching Algorithms 
Trivial Approach - Algorithm 
SimpleMatcher(string P, string T) 
n  length[T] 
m  length[P] 
for s  0 to n  m do 
if P[1...m] = T[s+1 ... s+m] then 
print s 
T(n,m) = (n  m + 1) m (1) = (n m)
String Searching Algorithms 
Rabin-Karp Algorithm - Idea 
Pattern - P[1m] 
Target - T[1n] 
p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10m P[1] 
for s = 0 to n – m: 
ts = T[s+m] + 10 T[s+m–1] + 100 T[s+m–2] +  + 10m T[s+1] 
P matches T at position i if and only if p = ti
String Searching Algorithms 
Rabin-Karp Algorithm - Idea 
p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10m P[1] 
p = P[m] + 10 (P[m–1] + 10 (P[m–2] +  
 + 10 (P[2] + 10 P[1])))) 
t0 = T[m] + 10 T[m–1] + 100 T[m–2] +  + 10m T[1] 
t0 = T[m] + 10 (T[m–1] + 10 (T[m–2] +  
 + 10 (T[2] + 10 T[1])))) 
ts+1 = 10 (ts – 10m–1 T[s+1]) + T[s+m+1]
String Searching Algorithms 
Rabin-Karp Algorithm - Example 1 
T = 289462340372392345 
P = 234 
p = 234 
t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239, 
392, 923, 234, 345
String Searching Algorithms 
Rabin-Karp Algorithm - Problem 
What to do, if p is too large to be stored as 
integer data type? 
(The simplest) solution: 
Use p mod q and ti mod q, instead of p and ti 
If p mod q  ti mod q no match is possible at position i 
If p mod q = ti mod q we have to check for match explicitly
String Searching Algorithms 
Rabin-Karp Algorithm - Algorithm 
RabinKarpMatcher(string P, string T, integer d, integer q) 
n  length[T]; m  length[P] 
h  dm–1 mod q 
p  0; t0  0 
for i  1 to m do 
p  (d p + P[i]) mod q 
t0  (d t0 + T[i]) mod q 
for s  0 to n – 1 do 
if p = ts then 
if P[1...m] = T[s+1 ... s+m] then 
print s 
if s < n – 1 then 
ts  (d (ts – T[s+1] h) + T[s+m+1]) mod q
String Searching Algorithms 
Rabin-Karp Algorithm - Example 2 
T = 289462340372392345 
P = 234 
q = 5 
p = 234 
p mod q = 4 
t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239, 
392, 923, 234, 345 
t mod q = 4, 4, 1, 2, 3, 4, 0, 3, 2, 2, 3, 4, 2, 3, 4, 0
String Searching Algorithms 
Rabin-Karp Algorithm - generalization 
Instead of calculating numbers mod q, we can use an arbitrary 
hash function
String Searching Algorithms 
Rabin-Karp Algorithm - Complexity 
[Adapted from T.Ralphs]
String Searching Algorithms 
Rabin-Karp Algorithm - Complexity 
Worst case: 
T(n,m) = (n  m + 1) m (1) = (n m) 
Average case: 
number of correct matches - v 
number of incorrect matches -  n/q 
T(n,m) = (n + m) + (m(v + n/q)) 
If v is small and m  q, then T(n,m) = (n + m)
String Searching Algorithms 
Two dimensional pattern matching 
[Adapted from M.Crochemore,T.Lecroq]
String Searching Algorithms 
Two dimensional pattern matching 
[Adapted from M.Crochemore,T.Lecroq]
String Searching Algorithms 
Knuth-Morris-Pratt Algorithm - Idea 
[Adapted from A.Cawsey]
String Searching Algorithms 
Knuth-Morris-Pratt Algorithm - Some 
history 
[Adapted from K.Wayne]
String Searching Algorithms 
Knuth-Morris-Pratt Algorithm - Idea 
T = gadji beri bimba glandridi 
P = gadjama 
g a d j i b e r i b i m b a g l a n d r i d i 
g a d j a m a 
g a d j i b e r i b i m b a g l a n d r i d i 
g a d j a m a
String Searching Algorithms 
Knuth-Morris-Pratt Algorithm - Idea 
T = gadjama gramma berida 
P = gaga 
g a d j a m a g r a m m a b e r i d a 
g a g a 
g a d j a m a g r a m m a b e r i d a 
g a g a
String Searching Algorithms 
Knuth-Morris-Pratt Algorithm - Idea 
For each position q = 1, , m in P compute the number 
of positions by which pattern can be advanced, if a 
mismatch has been previously detected in q-th position.
String Searching Algorithms 
KMP - Algorithm 
KnuthMorrisPrattMatcher(string P, string T) 
n  length[T] 
m  length[P] 
  PrefixFunction(P) 
q  0 
for i  1 to n do 
while q > 0 & P[q+1]  T[i] do 
q  [q] 
if P[q+1] = T[i] then 
q  q + 1 
if q = m then 
print i  m 
q  [q]
String Searching Algorithms 
KMP - Prefix Function 
A << B - A is prefix of B, e.g. ab << abacae 
A >> B - A is suffix of B, e.g. ae >> abacae 
Ps - initial substring of P of length s 
sP - terminal substring of P of length s 
Prefix function: 
 : {1 ,2, , m}  {0, 1, 2, , m–1} 
[q] = max {k : k < q & Pk >> Pq}
String Searching Algorithms 
KMP - Prefix Function 
[q] = max {k : k < q & Pk >> Pq} 
[q] is the length of the longest prefix of P that is a proper 
suffix of Pq. 
If a mismatch is detected at position q, then pattern 
can be advanced by q – [q] positions.
String Searching Algorithms 
KMP - Prefix Function - Example 
[q] = max {k : k < q & Pk >> Pq} 
[q] is the length of the longest prefix of P that is a proper 
suffix of Pq. 
P = abracadabra 
 = 0,0,0,1,0,1,0,1,2,3,4
String Searching Algorithms 
KMP - Prefix Function - Algorithm 
PrefixFunction(string P) 
m  length[P] 
[1]  0 
k  0 
for q  2 to m do 
while k > 0 & P[k+1]  P[q] do 
k  [k] 
if P[k+1] = P[q] then 
k  k + 1 
[q]  k 
return 
String Searching Algorithms 
KMP - Complexity 
KnuthMorrisPrattMatcher(string P, string T) 
n  length[T] 
m  length[P] 
  PrefixFunction(P) 
q  0 
for i  1 to n do 
while q > 0 & P[q+1]  T[i] do 
q  [q] 
if P[q+1] = T[i] then 
q  q + 1 
if q = m then 
print i  m 
q  [q] 
(n) times 
In worst case (m) times 
Thus T(n,m) = O(nm)...
KnuthMorrisPrattMatcher(string P, string T) 
n  length[T] 
m  length[P] 
  PrefixFunction(P) 
q  0 
for i  1 to n do 
String Searching Algorithms 
KMP - Complexity 
T(n,m) = TP(m) + n TWhile(m) = O(n m)? 
• q value are increased at most n times 
• always q  0 
while q > 0 & P[q+1]  T[i] do 
q  [q] 
if P[q+1] = T[i] then 
q  q + 1 
if q = m then 
print i  m 
q  [q] 
Thus, q can not be decreased more than n times, i.e. 
while loop can be executed no more than n times. 
T(n,m) = TP(m) + n TWhile(m) = TP(m) + (n)
String Searching Algorithms 
KMP - Prefix Function - Correctness 
[q] = max {k : k < q & Pk >> Pq} 
[q] is the length of the longest prefix of P that is a proper 
suffix of Pq. 
We define : 0[q] = q, i+1[q] = [i[q]] 
*[q] = {q, [q], 2[q], , t[q] = 0}
String Searching Algorithms 
KMP - Prefix Function - Correctness 
0[q] = q, i+1[q] = [i[q]] 
*[q] = {q, [q], 2[q], , t[q] = 0} 
Lemma 
Let P be a pattern of length m with prefix function . 
Then, for q = 1,2, , m we have *[q] = {k : Pk >> Pq}
String Searching Algorithms 
KMP - Prefix Function - Correctness 
Lemma 
Let P be a pattern of length m with prefix function . 
For q = 1,2, , m, if [q] > 0, then [q] – 1  *[q–1].
String Searching Algorithms 
KMP - Prefix Function - Correctness 
For q = 2, , m we define Eq–1  *[q–1] by 
Eq–1 = {k : k  *[q–1] & P[k+1] = P[q]} 
Corollary 
Let P be a pattern of length m with prefix function . 
For q = 2, , m: 
[q] = 
0, if Eq–1 =  
1 + max{k  Eq–1}, if Eq–1  
String Searching Algorithms 
KMP - Prefix Function - Correctness 
We consecutively compute [1], [2], , [m] 
[1] = 0 
For k > 1: 
if P[k] = P[[k–1] + 1], then [k] = [k–1] + 1, 
else, if P[k] = P[[k–2] + 1], then [k] = [[k–1]] + 1, 
else, if P[k] = P[[k–3] + 1], then [k] = [[[k–1]]] + 1, 

String Searching Algorithms 
KMP - Prefix Function - Complexity 
TP(m) = const + m TWhile(m) = O(m2) 
• k value are increased at most n times 
• always k  0 
Thus, k can not be decreased more than n times, i.e. 
while loop can be executed no more than n times. 
TP(m) = const + m TWhile(m) = (m) 
PrefixFunction(string P) 
m  length[P] 
[1]  0 
k  0 
for q  2 to m do 
while k > 0 & P[k+1]  P[q] do 
k  [k] 
if P[k+1] = P[q] then 
k  k + 1 
[q]  k 
return 
String Searching Algorithms 
KMP - Complexity 
T(n,m) = TP(m) + n TWhile(m) = 
KnuthMorrisPrattMatcher(string P, string T) 
n  length[T] 
m  length[P] 
  PrefixFunction(P) 
q  0 
for i  1 to n do 
while q > 0 & P[q+1]  T[i] do 
TP(m) + (n) = 
(m) + (n) = 
(m + n) PrefixFunction(string P) 
m  length[P] 
[1]  0 
k  0 
for q  2 to m do 
while k > 0 & P[k+1]  P[q] do 
k  [k] 
if P[k+1] = P[q] then 
k  k + 1 
[q]  k 
return  
q  [q] 
if P[q+1] = T[i] then 
q  q + 1 
if q = m then 
print i  m 
q  [q]
String Searching Algorithms 
Boyer-Moore Algorithm - Idea 1 
T = gadji beri bimba glandridi 
P = lonni 
g a d j i b e r i b i m b a g l a n d r i d i 
l o n n i 
g a d j i b e r i b i m b a g l a n d r i d i 
l o n n i 
Bad character heuristic
String Searching Algorithms 
Boyer-Moore Algorithm - Idea 2 
T = gadji beri bimba glandridi 
P = ajiji 
g a d j i b e r i b i m b a g l a n d r i d i 
a j i j i 
g a d j i b e r i b i m b a g l a n d r i d i 
a j i j i 
Good suffix heuristic
String Searching Algorithms 
Boyer-Moore - Bad Character Function 
Bad character function: 
 :   {0,1,2, , m} 
[s] = max {k : P[k] = s} (if such k exists) 
[s] = 0 (otherwise) 
[Adapted from M.Goodrich, R.Tamassia]
String Searching Algorithms 
Boyer-Moore - Bad Character Function 
BadCharacterFunction(string P, set ) 
m  length[P] 
for a   do 
[a]  0 
for j  1 to m do 
[P[j]]  j 
return  
TB(m,||) = (m + ||)
String Searching Algorithms 
Boyer-Moore - Suffix Function 
Suffix function: 
 : {1, 2, , m}  {1, 2, , m} 
[j] = m – max {k : k < m & jP >> Pk  PK >> jP} 
[Adapted from R.Lee, C.Lu]
String Searching Algorithms 
Boyer-Moore - Suffix Function 
[Adapted from R.Lee, C.Lu]
String Searching Algorithms 
Boyer-Moore - Suffix Function 
[Adapted from R.Lee, C.Lu]
String Searching Algorithms 
Boyer-Moore - Suffix Function 
[Adapted from R.Lee, C.Lu]
String Searching Algorithms 
Boyer-Moore - Suffix Function 
SuffixFunction(string P) 
m  length[P] 
  PrefixFunction(P) 
P’  Reverse(P); ’  PrefixFunction(P’) 
for j  0 to m do 
[j]  m – [m] 
for l  1 to m do 
j  m – ’[l] 
if [j] > l – ’[l] then 
[j]  l – ’[l] 
return  
TS(m) = (m)
String Searching Algorithms 
Boyer-Moore Algorithm - Algorithm 
BoyerMooreMatcher(string P, string T, set ) 
n  length[T] 
m  length[P] 
  LastOccurenceFunction(P,m, ) 
  GoodSuffixFunction(P,m) 
s  0 
while s  n  m do 
j  m 
while j > 0 & P[j] = T[s + j] do 
j  j  1 
if j = 0 then 
print s 
s  s +  [0] 
else s  s + max( [j], j  [T[s + j]])
String Searching Algorithms 
Boyer-Moore Algorithm - Complexity 
TB(m,||) = (m + ||) 
TS(m) = (m) 
T(n,m,||) = TB(m,||) + TS(m) + n TWhile(m) = 
= (m + ||) + (m) + O(n m) = O(|| + n m)? 
It can be shown that 
T(n,m,||) = (|| + n + m)
String Searching Algorithms 
Boyer-Moore Algorithm - Complexity 
It can be shown that: 
T(n,m,||) = (|| + n m) using only bad character rule 
T(n,m,||) = (|| + n + m) using only good suffix rule, if 
the pattern does not occur in text 
T(n,m,||) = (|| + n m) using only good suffix rule, if 
the pattern does occur in text
String Searching Algorithms 
Boyer-Moore Algorithm - Complexity 
With Galil's modification: 
T(n,m,||) = (|| + n + m) using only good suffix rule 
There is also a similar Apostolico-Giancarlo algorithm that 
achieves (|| + n + m) time bound (which is much easier to 
prove) 
On average the number of character comparisons is n/m 
(for large ||)
String Searching Algorithms 
Algorithms - Complexity comparison 
[Adapted from H.Løvengreen]
String Searching Algorithms 
Algorithms - Efficiency comparison 
n=5000 
[Adapted from I.Spence]
String Searching Algorithms 
Complexity - Lower Bound 
Theorem (Rivest) 
Any string searching algorithm has worst-case time 
complexity 
T(n,m) = (m + n)
Theorem (Rivest) 
String Searching Algorithms 
Suffix Trees - The problem 
Any string searching algorithm has worst-case time 
complexity 
T(n,m) = (m + n) 
Despite this, we probably can do better! 
(Well, for slightly different problem...) 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Example 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Do they always exist? 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Application to string 
matching 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Construction 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Construction 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Construction - Example 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Construction - Example 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Construction - Example 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Construction - 
Complexity 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Compact representation 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Compact representation 
- Example 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Some history 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Implicit trees 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Implicit trees 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Implicit trees 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - String paths 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Extensions 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Extensions 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Extensions - Example 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm - 
Complexity 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm - 
Complexity 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm - 
Complexity 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Suffix links 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Suffix links 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Suffix links 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Speeding up 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Eliminating extensions 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Single phase algorithm 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm - 
Complexity 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm - 
Complexity 
[Adapted from P.Kilpeläinen]
String Searching Algorithms 
Suffix Trees - Ukkonen's algorithm - 
Complexity 
[Adapted from P.Kilpeläinen]

String searching

  • 1.
    String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or find in which position(s) P occurs as a substring in T). The strings P and T are called pattern and target respectively. [Adapted from G.Plaxton]
  • 2.
    String Searching Algorithms Some applications [Adapted from K.Wayne]
  • 3.
    String Searching Algorithms Trivial Approach - Algorithm SimpleMatcher(string P, string T) n  length[T] m  length[P] for s  0 to n  m do if P[1...m] = T[s+1 ... s+m] then print s T(n,m) = (n  m + 1) m (1) = (n m)
  • 4.
    String Searching Algorithms Rabin-Karp Algorithm - Idea Pattern - P[1m] Target - T[1n] p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10m P[1] for s = 0 to n – m: ts = T[s+m] + 10 T[s+m–1] + 100 T[s+m–2] +  + 10m T[s+1] P matches T at position i if and only if p = ti
  • 5.
    String Searching Algorithms Rabin-Karp Algorithm - Idea p = P[m] + 10 P[m–1] + 100 P[m–2] +  + 10m P[1] p = P[m] + 10 (P[m–1] + 10 (P[m–2] +   + 10 (P[2] + 10 P[1])))) t0 = T[m] + 10 T[m–1] + 100 T[m–2] +  + 10m T[1] t0 = T[m] + 10 (T[m–1] + 10 (T[m–2] +   + 10 (T[2] + 10 T[1])))) ts+1 = 10 (ts – 10m–1 T[s+1]) + T[s+m+1]
  • 6.
    String Searching Algorithms Rabin-Karp Algorithm - Example 1 T = 289462340372392345 P = 234 p = 234 t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239, 392, 923, 234, 345
  • 7.
    String Searching Algorithms Rabin-Karp Algorithm - Problem What to do, if p is too large to be stored as integer data type? (The simplest) solution: Use p mod q and ti mod q, instead of p and ti If p mod q  ti mod q no match is possible at position i If p mod q = ti mod q we have to check for match explicitly
  • 8.
    String Searching Algorithms Rabin-Karp Algorithm - Algorithm RabinKarpMatcher(string P, string T, integer d, integer q) n  length[T]; m  length[P] h  dm–1 mod q p  0; t0  0 for i  1 to m do p  (d p + P[i]) mod q t0  (d t0 + T[i]) mod q for s  0 to n – 1 do if p = ts then if P[1...m] = T[s+1 ... s+m] then print s if s < n – 1 then ts  (d (ts – T[s+1] h) + T[s+m+1]) mod q
  • 9.
    String Searching Algorithms Rabin-Karp Algorithm - Example 2 T = 289462340372392345 P = 234 q = 5 p = 234 p mod q = 4 t = 289, 894, 946, 462, 623, 234, 340, 403, 37, 372, 723, 239, 392, 923, 234, 345 t mod q = 4, 4, 1, 2, 3, 4, 0, 3, 2, 2, 3, 4, 2, 3, 4, 0
  • 10.
    String Searching Algorithms Rabin-Karp Algorithm - generalization Instead of calculating numbers mod q, we can use an arbitrary hash function
  • 11.
    String Searching Algorithms Rabin-Karp Algorithm - Complexity [Adapted from T.Ralphs]
  • 12.
    String Searching Algorithms Rabin-Karp Algorithm - Complexity Worst case: T(n,m) = (n  m + 1) m (1) = (n m) Average case: number of correct matches - v number of incorrect matches -  n/q T(n,m) = (n + m) + (m(v + n/q)) If v is small and m  q, then T(n,m) = (n + m)
  • 13.
    String Searching Algorithms Two dimensional pattern matching [Adapted from M.Crochemore,T.Lecroq]
  • 14.
    String Searching Algorithms Two dimensional pattern matching [Adapted from M.Crochemore,T.Lecroq]
  • 15.
    String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea [Adapted from A.Cawsey]
  • 16.
    String Searching Algorithms Knuth-Morris-Pratt Algorithm - Some history [Adapted from K.Wayne]
  • 17.
    String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea T = gadji beri bimba glandridi P = gadjama g a d j i b e r i b i m b a g l a n d r i d i g a d j a m a g a d j i b e r i b i m b a g l a n d r i d i g a d j a m a
  • 18.
    String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea T = gadjama gramma berida P = gaga g a d j a m a g r a m m a b e r i d a g a g a g a d j a m a g r a m m a b e r i d a g a g a
  • 19.
    String Searching Algorithms Knuth-Morris-Pratt Algorithm - Idea For each position q = 1, , m in P compute the number of positions by which pattern can be advanced, if a mismatch has been previously detected in q-th position.
  • 20.
    String Searching Algorithms KMP - Algorithm KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do q  [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m q  [q]
  • 21.
    String Searching Algorithms KMP - Prefix Function A << B - A is prefix of B, e.g. ab << abacae A >> B - A is suffix of B, e.g. ae >> abacae Ps - initial substring of P of length s sP - terminal substring of P of length s Prefix function:  : {1 ,2, , m}  {0, 1, 2, , m–1} [q] = max {k : k < q & Pk >> Pq}
  • 22.
    String Searching Algorithms KMP - Prefix Function [q] = max {k : k < q & Pk >> Pq} [q] is the length of the longest prefix of P that is a proper suffix of Pq. If a mismatch is detected at position q, then pattern can be advanced by q – [q] positions.
  • 23.
    String Searching Algorithms KMP - Prefix Function - Example [q] = max {k : k < q & Pk >> Pq} [q] is the length of the longest prefix of P that is a proper suffix of Pq. P = abracadabra  = 0,0,0,1,0,1,0,1,2,3,4
  • 24.
    String Searching Algorithms KMP - Prefix Function - Algorithm PrefixFunction(string P) m  length[P] [1]  0 k  0 for q  2 to m do while k > 0 & P[k+1]  P[q] do k  [k] if P[k+1] = P[q] then k  k + 1 [q]  k return 
  • 25.
    String Searching Algorithms KMP - Complexity KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do q  [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m q  [q] (n) times In worst case (m) times Thus T(n,m) = O(nm)...
  • 26.
    KnuthMorrisPrattMatcher(string P, stringT) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do String Searching Algorithms KMP - Complexity T(n,m) = TP(m) + n TWhile(m) = O(n m)? • q value are increased at most n times • always q  0 while q > 0 & P[q+1]  T[i] do q  [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m q  [q] Thus, q can not be decreased more than n times, i.e. while loop can be executed no more than n times. T(n,m) = TP(m) + n TWhile(m) = TP(m) + (n)
  • 27.
    String Searching Algorithms KMP - Prefix Function - Correctness [q] = max {k : k < q & Pk >> Pq} [q] is the length of the longest prefix of P that is a proper suffix of Pq. We define : 0[q] = q, i+1[q] = [i[q]] *[q] = {q, [q], 2[q], , t[q] = 0}
  • 28.
    String Searching Algorithms KMP - Prefix Function - Correctness 0[q] = q, i+1[q] = [i[q]] *[q] = {q, [q], 2[q], , t[q] = 0} Lemma Let P be a pattern of length m with prefix function . Then, for q = 1,2, , m we have *[q] = {k : Pk >> Pq}
  • 29.
    String Searching Algorithms KMP - Prefix Function - Correctness Lemma Let P be a pattern of length m with prefix function . For q = 1,2, , m, if [q] > 0, then [q] – 1  *[q–1].
  • 30.
    String Searching Algorithms KMP - Prefix Function - Correctness For q = 2, , m we define Eq–1  *[q–1] by Eq–1 = {k : k  *[q–1] & P[k+1] = P[q]} Corollary Let P be a pattern of length m with prefix function . For q = 2, , m: [q] = 0, if Eq–1 =  1 + max{k  Eq–1}, if Eq–1  
  • 31.
    String Searching Algorithms KMP - Prefix Function - Correctness We consecutively compute [1], [2], , [m] [1] = 0 For k > 1: if P[k] = P[[k–1] + 1], then [k] = [k–1] + 1, else, if P[k] = P[[k–2] + 1], then [k] = [[k–1]] + 1, else, if P[k] = P[[k–3] + 1], then [k] = [[[k–1]]] + 1, 
  • 32.
    String Searching Algorithms KMP - Prefix Function - Complexity TP(m) = const + m TWhile(m) = O(m2) • k value are increased at most n times • always k  0 Thus, k can not be decreased more than n times, i.e. while loop can be executed no more than n times. TP(m) = const + m TWhile(m) = (m) PrefixFunction(string P) m  length[P] [1]  0 k  0 for q  2 to m do while k > 0 & P[k+1]  P[q] do k  [k] if P[k+1] = P[q] then k  k + 1 [q]  k return 
  • 33.
    String Searching Algorithms KMP - Complexity T(n,m) = TP(m) + n TWhile(m) = KnuthMorrisPrattMatcher(string P, string T) n  length[T] m  length[P]   PrefixFunction(P) q  0 for i  1 to n do while q > 0 & P[q+1]  T[i] do TP(m) + (n) = (m) + (n) = (m + n) PrefixFunction(string P) m  length[P] [1]  0 k  0 for q  2 to m do while k > 0 & P[k+1]  P[q] do k  [k] if P[k+1] = P[q] then k  k + 1 [q]  k return  q  [q] if P[q+1] = T[i] then q  q + 1 if q = m then print i  m q  [q]
  • 34.
    String Searching Algorithms Boyer-Moore Algorithm - Idea 1 T = gadji beri bimba glandridi P = lonni g a d j i b e r i b i m b a g l a n d r i d i l o n n i g a d j i b e r i b i m b a g l a n d r i d i l o n n i Bad character heuristic
  • 35.
    String Searching Algorithms Boyer-Moore Algorithm - Idea 2 T = gadji beri bimba glandridi P = ajiji g a d j i b e r i b i m b a g l a n d r i d i a j i j i g a d j i b e r i b i m b a g l a n d r i d i a j i j i Good suffix heuristic
  • 36.
    String Searching Algorithms Boyer-Moore - Bad Character Function Bad character function:  :   {0,1,2, , m} [s] = max {k : P[k] = s} (if such k exists) [s] = 0 (otherwise) [Adapted from M.Goodrich, R.Tamassia]
  • 37.
    String Searching Algorithms Boyer-Moore - Bad Character Function BadCharacterFunction(string P, set ) m  length[P] for a   do [a]  0 for j  1 to m do [P[j]]  j return  TB(m,||) = (m + ||)
  • 38.
    String Searching Algorithms Boyer-Moore - Suffix Function Suffix function:  : {1, 2, , m}  {1, 2, , m} [j] = m – max {k : k < m & jP >> Pk  PK >> jP} [Adapted from R.Lee, C.Lu]
  • 39.
    String Searching Algorithms Boyer-Moore - Suffix Function [Adapted from R.Lee, C.Lu]
  • 40.
    String Searching Algorithms Boyer-Moore - Suffix Function [Adapted from R.Lee, C.Lu]
  • 41.
    String Searching Algorithms Boyer-Moore - Suffix Function [Adapted from R.Lee, C.Lu]
  • 42.
    String Searching Algorithms Boyer-Moore - Suffix Function SuffixFunction(string P) m  length[P]   PrefixFunction(P) P’  Reverse(P); ’  PrefixFunction(P’) for j  0 to m do [j]  m – [m] for l  1 to m do j  m – ’[l] if [j] > l – ’[l] then [j]  l – ’[l] return  TS(m) = (m)
  • 43.
    String Searching Algorithms Boyer-Moore Algorithm - Algorithm BoyerMooreMatcher(string P, string T, set ) n  length[T] m  length[P]   LastOccurenceFunction(P,m, )   GoodSuffixFunction(P,m) s  0 while s  n  m do j  m while j > 0 & P[j] = T[s + j] do j  j  1 if j = 0 then print s s  s +  [0] else s  s + max( [j], j  [T[s + j]])
  • 44.
    String Searching Algorithms Boyer-Moore Algorithm - Complexity TB(m,||) = (m + ||) TS(m) = (m) T(n,m,||) = TB(m,||) + TS(m) + n TWhile(m) = = (m + ||) + (m) + O(n m) = O(|| + n m)? It can be shown that T(n,m,||) = (|| + n + m)
  • 45.
    String Searching Algorithms Boyer-Moore Algorithm - Complexity It can be shown that: T(n,m,||) = (|| + n m) using only bad character rule T(n,m,||) = (|| + n + m) using only good suffix rule, if the pattern does not occur in text T(n,m,||) = (|| + n m) using only good suffix rule, if the pattern does occur in text
  • 46.
    String Searching Algorithms Boyer-Moore Algorithm - Complexity With Galil's modification: T(n,m,||) = (|| + n + m) using only good suffix rule There is also a similar Apostolico-Giancarlo algorithm that achieves (|| + n + m) time bound (which is much easier to prove) On average the number of character comparisons is n/m (for large ||)
  • 47.
    String Searching Algorithms Algorithms - Complexity comparison [Adapted from H.Løvengreen]
  • 48.
    String Searching Algorithms Algorithms - Efficiency comparison n=5000 [Adapted from I.Spence]
  • 49.
    String Searching Algorithms Complexity - Lower Bound Theorem (Rivest) Any string searching algorithm has worst-case time complexity T(n,m) = (m + n)
  • 50.
    Theorem (Rivest) StringSearching Algorithms Suffix Trees - The problem Any string searching algorithm has worst-case time complexity T(n,m) = (m + n) Despite this, we probably can do better! (Well, for slightly different problem...) [Adapted from P.Kilpeläinen]
  • 51.
    String Searching Algorithms Suffix Trees [Adapted from P.Kilpeläinen]
  • 52.
    String Searching Algorithms Suffix Trees [Adapted from P.Kilpeläinen]
  • 53.
    String Searching Algorithms Suffix Trees - Example [Adapted from P.Kilpeläinen]
  • 54.
    String Searching Algorithms Suffix Trees - Do they always exist? [Adapted from P.Kilpeläinen]
  • 55.
    String Searching Algorithms Suffix Trees - Application to string matching [Adapted from P.Kilpeläinen]
  • 56.
    String Searching Algorithms Suffix Trees - Construction [Adapted from P.Kilpeläinen]
  • 57.
    String Searching Algorithms Suffix Trees - Construction [Adapted from P.Kilpeläinen]
  • 58.
    String Searching Algorithms Suffix Trees - Construction - Example [Adapted from P.Kilpeläinen]
  • 59.
    String Searching Algorithms Suffix Trees - Construction - Example [Adapted from P.Kilpeläinen]
  • 60.
    String Searching Algorithms Suffix Trees - Construction - Example [Adapted from P.Kilpeläinen]
  • 61.
    String Searching Algorithms Suffix Trees - Construction - Complexity [Adapted from P.Kilpeläinen]
  • 62.
    String Searching Algorithms Suffix Trees - Compact representation [Adapted from P.Kilpeläinen]
  • 63.
    String Searching Algorithms Suffix Trees - Compact representation - Example [Adapted from P.Kilpeläinen]
  • 64.
    String Searching Algorithms Suffix Trees - Some history [Adapted from P.Kilpeläinen]
  • 65.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm [Adapted from P.Kilpeläinen]
  • 66.
    String Searching Algorithms Suffix Trees - Implicit trees [Adapted from P.Kilpeläinen]
  • 67.
    String Searching Algorithms Suffix Trees - Implicit trees [Adapted from P.Kilpeläinen]
  • 68.
    String Searching Algorithms Suffix Trees - Implicit trees [Adapted from P.Kilpeläinen]
  • 69.
    String Searching Algorithms Suffix Trees - String paths [Adapted from P.Kilpeläinen]
  • 70.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm [Adapted from P.Kilpeläinen]
  • 71.
    String Searching Algorithms Suffix Trees - Extensions [Adapted from P.Kilpeläinen]
  • 72.
    String Searching Algorithms Suffix Trees - Extensions [Adapted from P.Kilpeläinen]
  • 73.
    String Searching Algorithms Suffix Trees - Extensions - Example [Adapted from P.Kilpeläinen]
  • 74.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]
  • 75.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]
  • 76.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]
  • 77.
    String Searching Algorithms Suffix Trees - Suffix links [Adapted from P.Kilpeläinen]
  • 78.
    String Searching Algorithms Suffix Trees - Suffix links [Adapted from P.Kilpeläinen]
  • 79.
    String Searching Algorithms Suffix Trees - Suffix links [Adapted from P.Kilpeläinen]
  • 80.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 81.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 82.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 83.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 84.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 85.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 86.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 87.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 88.
    String Searching Algorithms Suffix Trees - Speeding up [Adapted from P.Kilpeläinen]
  • 89.
    String Searching Algorithms Suffix Trees - Eliminating extensions [Adapted from P.Kilpeläinen]
  • 90.
    String Searching Algorithms Suffix Trees - Single phase algorithm [Adapted from P.Kilpeläinen]
  • 91.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]
  • 92.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]
  • 93.
    String Searching Algorithms Suffix Trees - Ukkonen's algorithm - Complexity [Adapted from P.Kilpeläinen]