SlideShare a Scribd company logo
Analysis of Algorithms
String matching
Andres Mendez-Vazquez
November 24, 2015
1 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
2 / 40
What is string matching?
Given two sequences of characters drawn from a finite alphabet Σ,
T [1..n] and P [1..m]
a b c a b a a b c a b a c a b
a b a a
TEXT T
PATTERN P
s=3
Valid Shift
3 / 40
Where...
A Valid Shift
P occurs with a valid shift s if for some 0 ≤ s ≤ n − m =⇒
T [s + 1..s + m] == P [1..m].
Otherwise
it is an invalid shift.
Thus
The string-matching problem is the problem of of finding all valid shifts
given a patten P on a text T.
4 / 40
Where...
A Valid Shift
P occurs with a valid shift s if for some 0 ≤ s ≤ n − m =⇒
T [s + 1..s + m] == P [1..m].
Otherwise
it is an invalid shift.
Thus
The string-matching problem is the problem of of finding all valid shifts
given a patten P on a text T.
4 / 40
Where...
A Valid Shift
P occurs with a valid shift s if for some 0 ≤ s ≤ n − m =⇒
T [s + 1..s + m] == P [1..m].
Otherwise
it is an invalid shift.
Thus
The string-matching problem is the problem of of finding all valid shifts
given a patten P on a text T.
4 / 40
Possible Algorithms
We have the following ones
Algorithm Preprocessing Time Worst Case Matching Time
Naive 0 O ((n − m + 1) m)
Rabin-Karp Θ (m) O ((n − m + 1) m)
Finite Automaton O (m |Σ|) Θ (n)
Knuth-Morris-Pratt Θ (m) Θ (n)
5 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
6 / 40
Notation and Terminology
Definition
We denote by Σ∗ (read “sigma-star”) the set of all finite length strings
formed using characters from the alphabet Σ.
Constraint
We assume a finite length strings.
Some basic concepts
The zero empty string, , also belong to Σ∗.
The length of a string x is denoted |x|.
The concatenation of two strings x and y, denoted xy, has length
|x| + |y|
7 / 40
Notation and Terminology
Definition
We denote by Σ∗ (read “sigma-star”) the set of all finite length strings
formed using characters from the alphabet Σ.
Constraint
We assume a finite length strings.
Some basic concepts
The zero empty string, , also belong to Σ∗.
The length of a string x is denoted |x|.
The concatenation of two strings x and y, denoted xy, has length
|x| + |y|
7 / 40
Notation and Terminology
Definition
We denote by Σ∗ (read “sigma-star”) the set of all finite length strings
formed using characters from the alphabet Σ.
Constraint
We assume a finite length strings.
Some basic concepts
The zero empty string, , also belong to Σ∗.
The length of a string x is denoted |x|.
The concatenation of two strings x and y, denoted xy, has length
|x| + |y|
7 / 40
Notation and Terminology
Definition
We denote by Σ∗ (read “sigma-star”) the set of all finite length strings
formed using characters from the alphabet Σ.
Constraint
We assume a finite length strings.
Some basic concepts
The zero empty string, , also belong to Σ∗.
The length of a string x is denoted |x|.
The concatenation of two strings x and y, denoted xy, has length
|x| + |y|
7 / 40
Notation and Terminology
Definition
We denote by Σ∗ (read “sigma-star”) the set of all finite length strings
formed using characters from the alphabet Σ.
Constraint
We assume a finite length strings.
Some basic concepts
The zero empty string, , also belong to Σ∗.
The length of a string x is denoted |x|.
The concatenation of two strings x and y, denoted xy, has length
|x| + |y|
7 / 40
Notation and Terminology
Prefix
A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗.
Suffix
A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗.
Properties
Prefix: If w x ⇒ |w| ≤ |x|
Suffix: If w x ⇒ |w| ≤ |x|
The is both suffix and prefix of every string.
8 / 40
Notation and Terminology
Prefix
A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗.
Suffix
A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗.
Properties
Prefix: If w x ⇒ |w| ≤ |x|
Suffix: If w x ⇒ |w| ≤ |x|
The is both suffix and prefix of every string.
8 / 40
Notation and Terminology
Prefix
A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗.
Suffix
A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗.
Properties
Prefix: If w x ⇒ |w| ≤ |x|
Suffix: If w x ⇒ |w| ≤ |x|
The is both suffix and prefix of every string.
8 / 40
Notation and Terminology
Prefix
A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗.
Suffix
A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗.
Properties
Prefix: If w x ⇒ |w| ≤ |x|
Suffix: If w x ⇒ |w| ≤ |x|
The is both suffix and prefix of every string.
8 / 40
Notation and Terminology
Prefix
A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗.
Suffix
A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗.
Properties
Prefix: If w x ⇒ |w| ≤ |x|
Suffix: If w x ⇒ |w| ≤ |x|
The is both suffix and prefix of every string.
8 / 40
Notation and Terminology
In addition
For any string x and y and any character a, we have w x if and
only if aw ax
In addition, and are transitive relations.
9 / 40
Notation and Terminology
In addition
For any string x and y and any character a, we have w x if and
only if aw ax
In addition, and are transitive relations.
9 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
10 / 40
The naive string algorithm
Algorithm
NAIVE-STRING-MATCHER(T, P)
1 n = T.length
2 m = P.length
3 for s = 0 to n − m
4 if P [1..m] == T [s + 1..s + m]
5 print “Pattern occurs with shift” s
Complexity
O((n − m + 1)m) or Θ n2 if m = n
2
11 / 40
The naive string algorithm
Algorithm
NAIVE-STRING-MATCHER(T, P)
1 n = T.length
2 m = P.length
3 for s = 0 to n − m
4 if P [1..m] == T [s + 1..s + m]
5 print “Pattern occurs with shift” s
Complexity
O((n − m + 1)m) or Θ n2 if m = n
2
11 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
12 / 40
A more elaborated algorithm
Rabin-Karp algorithm
Lets assume that Σ = {0, 1, ..., 9} then we have the following:
Thus, each string of k consecutive characters is a k-length decimal
number:
c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck
Thus
p correspond the decimal number for pattern P [1..m].
ts denote decimal value of m-length substring T [s + 1..s + m] for
s = 0, 1, ..., n − m.
13 / 40
A more elaborated algorithm
Rabin-Karp algorithm
Lets assume that Σ = {0, 1, ..., 9} then we have the following:
Thus, each string of k consecutive characters is a k-length decimal
number:
c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck
Thus
p correspond the decimal number for pattern P [1..m].
ts denote decimal value of m-length substring T [s + 1..s + m] for
s = 0, 1, ..., n − m.
13 / 40
A more elaborated algorithm
Rabin-Karp algorithm
Lets assume that Σ = {0, 1, ..., 9} then we have the following:
Thus, each string of k consecutive characters is a k-length decimal
number:
c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck
Thus
p correspond the decimal number for pattern P [1..m].
ts denote decimal value of m-length substring T [s + 1..s + m] for
s = 0, 1, ..., n − m.
13 / 40
A more elaborated algorithm
Rabin-Karp algorithm
Lets assume that Σ = {0, 1, ..., 9} then we have the following:
Thus, each string of k consecutive characters is a k-length decimal
number:
c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck
Thus
p correspond the decimal number for pattern P [1..m].
ts denote decimal value of m-length substring T [s + 1..s + m] for
s = 0, 1, ..., n − m.
13 / 40
Then
Properties
Clearly ts == p if and only if T [s + 1..s + m] == P [1..m], thus s is a
valid shift.
14 / 40
Now, think about this
What if we put everything in a single word of the machine
If we can compute p in Θ (m).
If we can compute all the ts in Θ (n − m + 1).
We will get
Θ (m) + Θ (n − m + 1) = Θ (n)
15 / 40
Now, think about this
What if we put everything in a single word of the machine
If we can compute p in Θ (m).
If we can compute all the ts in Θ (n − m + 1).
We will get
Θ (m) + Θ (n − m + 1) = Θ (n)
15 / 40
Now, think about this
What if we put everything in a single word of the machine
If we can compute p in Θ (m).
If we can compute all the ts in Θ (n − m + 1).
We will get
Θ (m) + Θ (n − m + 1) = Θ (n)
15 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
16 / 40
Remember Horner’s Rule
Consider
We can compute, the decimal representation by the Horner’s rule:
p = P[m] + 10(P[m − 1] + 10(P[m − 2] + ... + 10(P[2] + 10P[1])...)) (1)
Thus, we can compute t0 using this rule in
Θ (m) (2)
17 / 40
Remember Horner’s Rule
Consider
We can compute, the decimal representation by the Horner’s rule:
p = P[m] + 10(P[m − 1] + 10(P[m − 2] + ... + 10(P[2] + 10P[1])...)) (1)
Thus, we can compute t0 using this rule in
Θ (m) (2)
17 / 40
Example
If you have the following set of digits
2 4 0 1
Using the Horner’s Rule for m = 4
2401 = 1 + 10 × (0 + 10 × (4 + 10 × 2))
= 2 × 103
+ 4 × 102
+ 0 × 10 + 1
18 / 40
Example
If you have the following set of digits
2 4 0 1
Using the Horner’s Rule for m = 4
2401 = 1 + 10 × (0 + 10 × (4 + 10 × 2))
= 2 × 103
+ 4 × 102
+ 0 × 10 + 1
18 / 40
Then
To compute the remaining values, we can use the previous value
ts+1 = 10 ts − 10m−1
T [s + 1] + T [s + m + 1] . (3)
Notice the following
Subtracting from it 10m−1T [s + 1] removes the high-order digit from
ts.
Multiplying the result by 10 shifts the number left by one digit
position.
Adding T [s + m + 1] brings in the appropriate low-order digit.
19 / 40
Then
To compute the remaining values, we can use the previous value
ts+1 = 10 ts − 10m−1
T [s + 1] + T [s + m + 1] . (3)
Notice the following
Subtracting from it 10m−1T [s + 1] removes the high-order digit from
ts.
Multiplying the result by 10 shifts the number left by one digit
position.
Adding T [s + m + 1] brings in the appropriate low-order digit.
19 / 40
Then
To compute the remaining values, we can use the previous value
ts+1 = 10 ts − 10m−1
T [s + 1] + T [s + m + 1] . (3)
Notice the following
Subtracting from it 10m−1T [s + 1] removes the high-order digit from
ts.
Multiplying the result by 10 shifts the number left by one digit
position.
Adding T [s + m + 1] brings in the appropriate low-order digit.
19 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Example
Imagine that you have...
index 1 2 3 4 5 6
digit 1 0 2 4 1 0
1 We have t0 = 1024 then we want to calculate 0241
2 Then we subtract 103 × T [1] == 1000 of 1024
3 We get 0024
4 Multiply by 10, and we get 0240
5 We add T [5] == 1
6 Finally, we get 0241
20 / 40
Remarks
First
We can extend this beyond the decimals to any d digit system!!!
What happens when the numbers are quite large?
We can use the module
Meaning
Compute p and ts values modulo a suitable modulus q.
21 / 40
Remarks
First
We can extend this beyond the decimals to any d digit system!!!
What happens when the numbers are quite large?
We can use the module
Meaning
Compute p and ts values modulo a suitable modulus q.
21 / 40
Remarks
First
We can extend this beyond the decimals to any d digit system!!!
What happens when the numbers are quite large?
We can use the module
Meaning
Compute p and ts values modulo a suitable modulus q.
21 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
22 / 40
Remember Hash Functions
Yes, we are mapping the large numbers into the set
{0, 1, 2, ..., q − 1}
23 / 40
Then, to reduce the possible representation
Use the module of q
It is possible to compute p mod q in Θ (m) time and all the ts mod q in
Θ (n − m + 1) time.
Something Notable
If we choose the modulus q as a prime such that 10q just fits within one
computer word, then we can perform all the necessary computations with
single-precision arithmetic.
24 / 40
Then, to reduce the possible representation
Use the module of q
It is possible to compute p mod q in Θ (m) time and all the ts mod q in
Θ (n − m + 1) time.
Something Notable
If we choose the modulus q as a prime such that 10q just fits within one
computer word, then we can perform all the necessary computations with
single-precision arithmetic.
24 / 40
Why 10q?
After all
10q is the number that will subtracting for or multiplying by!!!
We use “truncated division” to implement the modulo operation
For example given two numbers a and n, we can do the following
q = trunc
a
n
Then
r = a − nq
25 / 40
Why 10q?
After all
10q is the number that will subtracting for or multiplying by!!!
We use “truncated division” to implement the modulo operation
For example given two numbers a and n, we can do the following
q = trunc
a
n
Then
r = a − nq
25 / 40
Why 10q?
After all
10q is the number that will subtracting for or multiplying by!!!
We use “truncated division” to implement the modulo operation
For example given two numbers a and n, we can do the following
q = trunc
a
n
Then
r = a − nq
25 / 40
Why 10q?
Thus
If q is a prime we can use this as the element of truncated division then
r = a − 10q.
Truncated Algorithm
Truncated-Module(a, q)
1 r = a − 10q
2 while r > 10q
3 r = r − 10q
4 return r
26 / 40
Why 10q?
Thus
If q is a prime we can use this as the element of truncated division then
r = a − 10q.
Truncated Algorithm
Truncated-Module(a, q)
1 r = a − 10q
2 while r > 10q
3 r = r − 10q
4 return r
26 / 40
Then
Thus
Then, we can implement this in basic arithmetic CPU operations.
Not only that
In general, for a d-ary alphabet, we choose q such that dq fits within a
computer word.
27 / 40
Then
Thus
Then, we can implement this in basic arithmetic CPU operations.
Not only that
In general, for a d-ary alphabet, we choose q such that dq fits within a
computer word.
27 / 40
In general, we have
The following
ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q (4)
where h ≡ dm−1(mod q) is the value of the digit “1” in the high-order
position of an m-digit text window.
Here
We have a small problem!!!
Question?
Can we differentiate between p mod q and ts mod q?
28 / 40
In general, we have
The following
ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q (4)
where h ≡ dm−1(mod q) is the value of the digit “1” in the high-order
position of an m-digit text window.
Here
We have a small problem!!!
Question?
Can we differentiate between p mod q and ts mod q?
28 / 40
In general, we have
The following
ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q (4)
where h ≡ dm−1(mod q) is the value of the digit “1” in the high-order
position of an m-digit text window.
Here
We have a small problem!!!
Question?
Can we differentiate between p mod q and ts mod q?
28 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
What?
Look at this with q = 11
14 mod 11 == 3
25 mod 11 == 3
But 14 = 25
However
11 mod 11 == 0
25 mod 11 == 3
We can say that 11 = 25!!!
This means
We can use the modulo to differentiate numbers, but not to exactly to say
if they are equal!!!
29 / 40
Thus
We have the following logic
If ts ≡ p (mod q) does not mean that ts == p.
If ts ≡ p (mod q), we have that ts = p.
Fixing the problem
To fix this problem we simply test to see if the hit is not spurious.
Note
If q is large enough, then we can hope that spurious hits occur infrequently
enough that the cost of the extra checking is low.
30 / 40
Thus
We have the following logic
If ts ≡ p (mod q) does not mean that ts == p.
If ts ≡ p (mod q), we have that ts = p.
Fixing the problem
To fix this problem we simply test to see if the hit is not spurious.
Note
If q is large enough, then we can hope that spurious hits occur infrequently
enough that the cost of the extra checking is low.
30 / 40
Thus
We have the following logic
If ts ≡ p (mod q) does not mean that ts == p.
If ts ≡ p (mod q), we have that ts = p.
Fixing the problem
To fix this problem we simply test to see if the hit is not spurious.
Note
If q is large enough, then we can hope that spurious hits occur infrequently
enough that the cost of the extra checking is low.
30 / 40
Thus
We have the following logic
If ts ≡ p (mod q) does not mean that ts == p.
If ts ≡ p (mod q), we have that ts = p.
Fixing the problem
To fix this problem we simply test to see if the hit is not spurious.
Note
If q is large enough, then we can hope that spurious hits occur infrequently
enough that the cost of the extra checking is low.
30 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
31 / 40
The Final Algorithm
Rabin-Karp-Matcher(T, P, d, q)
1 n = T.length
2 m = P.length
3 h = dm−1
mod q // Storing the reminder of the highest power
4 p = 0
5 t0 = 0
6 for i = 1 to m // Preprocessing
7 p = (dp + P [i]) mod q
8 t0 = (dp + T [i]) mod q
9 for s = 0 to n − m
10 if p == ts
11 if P [1..m] == T [s + 1..s + m] // Actually a Loop
12 print “Pattern occurs with shift” s
13 if s < n − m
14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q
32 / 40
The Final Algorithm
Rabin-Karp-Matcher(T, P, d, q)
1 n = T.length
2 m = P.length
3 h = dm−1
mod q // Storing the reminder of the highest power
4 p = 0
5 t0 = 0
6 for i = 1 to m // Preprocessing
7 p = (dp + P [i]) mod q
8 t0 = (dp + T [i]) mod q
9 for s = 0 to n − m
10 if p == ts
11 if P [1..m] == T [s + 1..s + m] // Actually a Loop
12 print “Pattern occurs with shift” s
13 if s < n − m
14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q
32 / 40
The Final Algorithm
Rabin-Karp-Matcher(T, P, d, q)
1 n = T.length
2 m = P.length
3 h = dm−1
mod q // Storing the reminder of the highest power
4 p = 0
5 t0 = 0
6 for i = 1 to m // Preprocessing
7 p = (dp + P [i]) mod q
8 t0 = (dp + T [i]) mod q
9 for s = 0 to n − m
10 if p == ts
11 if P [1..m] == T [s + 1..s + m] // Actually a Loop
12 print “Pattern occurs with shift” s
13 if s < n − m
14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q
32 / 40
The Final Algorithm
Rabin-Karp-Matcher(T, P, d, q)
1 n = T.length
2 m = P.length
3 h = dm−1
mod q // Storing the reminder of the highest power
4 p = 0
5 t0 = 0
6 for i = 1 to m // Preprocessing
7 p = (dp + P [i]) mod q
8 t0 = (dp + T [i]) mod q
9 for s = 0 to n − m
10 if p == ts
11 if P [1..m] == T [s + 1..s + m] // Actually a Loop
12 print “Pattern occurs with shift” s
13 if s < n − m
14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q
32 / 40
Complexity
Preprocessing
1 p mod q is done in Θ(m).
2 tss mod q is done in Θ(n − m + 1).
Then checking P[1..m] == T[s + 1..s + m]
In the worst case, Θ((n − m + 1)m)
33 / 40
Complexity
Preprocessing
1 p mod q is done in Θ(m).
2 tss mod q is done in Θ(n − m + 1).
Then checking P[1..m] == T[s + 1..s + m]
In the worst case, Θ((n − m + 1)m)
33 / 40
Still we can do better!
First
The number of spurious hits is O (n/q).
Because
We can estimate the chance that an arbitrary ts will be equivalent to p
mod q as 1/q.
Properties
Since there are O (n) positions at which the test of line 10 fails (Thus, you
have O (n/q) non valid hits) and we spend O (m) time per hit
34 / 40
Still we can do better!
First
The number of spurious hits is O (n/q).
Because
We can estimate the chance that an arbitrary ts will be equivalent to p
mod q as 1/q.
Properties
Since there are O (n) positions at which the test of line 10 fails (Thus, you
have O (n/q) non valid hits) and we spend O (m) time per hit
34 / 40
Still we can do better!
First
The number of spurious hits is O (n/q).
Because
We can estimate the chance that an arbitrary ts will be equivalent to p
mod q as 1/q.
Properties
Since there are O (n) positions at which the test of line 10 fails (Thus, you
have O (n/q) non valid hits) and we spend O (m) time per hit
34 / 40
Finally, we have that
The expected matching time
The expected matching time by Rabin-Karp algorithm is
O(n) + O m v +
n
q
where v is the number of valid shifts.
In addition
If v = O(1) (Number of valid shifts small) and choose q ≥ m such that
n
q = O(1) (q to be larger enough than the pattern’s length).
Then
The algorithm takes O(m + n) for finding the matches.
Finally, because m ≤ n, thus the expected time is O(n).
35 / 40
Finally, we have that
The expected matching time
The expected matching time by Rabin-Karp algorithm is
O(n) + O m v +
n
q
where v is the number of valid shifts.
In addition
If v = O(1) (Number of valid shifts small) and choose q ≥ m such that
n
q = O(1) (q to be larger enough than the pattern’s length).
Then
The algorithm takes O(m + n) for finding the matches.
Finally, because m ≤ n, thus the expected time is O(n).
35 / 40
Finally, we have that
The expected matching time
The expected matching time by Rabin-Karp algorithm is
O(n) + O m v +
n
q
where v is the number of valid shifts.
In addition
If v = O(1) (Number of valid shifts small) and choose q ≥ m such that
n
q = O(1) (q to be larger enough than the pattern’s length).
Then
The algorithm takes O(m + n) for finding the matches.
Finally, because m ≤ n, thus the expected time is O(n).
35 / 40
Finally, we have that
The expected matching time
The expected matching time by Rabin-Karp algorithm is
O(n) + O m v +
n
q
where v is the number of valid shifts.
In addition
If v = O(1) (Number of valid shifts small) and choose q ≥ m such that
n
q = O(1) (q to be larger enough than the pattern’s length).
Then
The algorithm takes O(m + n) for finding the matches.
Finally, because m ≤ n, thus the expected time is O(n).
35 / 40
Finally, we have that
The expected matching time
The expected matching time by Rabin-Karp algorithm is
O(n) + O m v +
n
q
where v is the number of valid shifts.
In addition
If v = O(1) (Number of valid shifts small) and choose q ≥ m such that
n
q = O(1) (q to be larger enough than the pattern’s length).
Then
The algorithm takes O(m + n) for finding the matches.
Finally, because m ≤ n, thus the expected time is O(n).
35 / 40
Outline
1 String Matching
Introduction
Some Notation
2 Naive Algorithm
Using Brute Force
3 The Rabin-Karp Algorithm
Efficiency After All
Horner’s Rule
Generaiting Possible Matches
The Final Algorithm
Other Methods
4 Exercises
36 / 40
We have other methods
We have the following ones
Algorithm Preprocessing Time Worst Case Matching Time
Rabin-Karp Θ (m) O ((n − m + 1) m)
Finite Automaton O (m |Σ|) Θ (n)
Knuth-Morris-Pratt Θ (m) Θ (n)
37 / 40
Remarks about Knuth-Morris-Pratt
The Algorithm
It is quite an elegant algorithm that improves over the state machine.
How?
It avoid to compute the transition function in the state machine by using
the prefix function π
It encapsulate information how the pattern matches against shifts of
itself
38 / 40
Remarks about Knuth-Morris-Pratt
The Algorithm
It is quite an elegant algorithm that improves over the state machine.
How?
It avoid to compute the transition function in the state machine by using
the prefix function π
It encapsulate information how the pattern matches against shifts of
itself
38 / 40
However, At the same time (Circa 1977)
Boyer–Moore string search algorithm
It was presented at the same time
It is used in the GREP function for pattern matching in UNIX
Actually is the basic algorithm to beat when doing research in this area!!!
Richard Cole (Circa 1991)
He gave a a proof of the algorithm with an upper bound of 3m
comparisons in the worst case!!!
39 / 40
However, At the same time (Circa 1977)
Boyer–Moore string search algorithm
It was presented at the same time
It is used in the GREP function for pattern matching in UNIX
Actually is the basic algorithm to beat when doing research in this area!!!
Richard Cole (Circa 1991)
He gave a a proof of the algorithm with an upper bound of 3m
comparisons in the worst case!!!
39 / 40
However, At the same time (Circa 1977)
Boyer–Moore string search algorithm
It was presented at the same time
It is used in the GREP function for pattern matching in UNIX
Actually is the basic algorithm to beat when doing research in this area!!!
Richard Cole (Circa 1991)
He gave a a proof of the algorithm with an upper bound of 3m
comparisons in the worst case!!!
39 / 40
Exercises
32.1-1
32.1-2
32.1-4
32.2-1
32.2-2
32.2-3
32.2-4
40 / 40

More Related Content

What's hot

String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Dr Shashikant Athawale
 
Rabin karp string matching algorithm
Rabin karp string matching algorithmRabin karp string matching algorithm
Rabin karp string matching algorithm
Gajanand Sharma
 
Knuth morris pratt string matching algo
Knuth morris pratt string matching algoKnuth morris pratt string matching algo
Knuth morris pratt string matching algo
sabiya sabiya
 
STRING MATCHING
STRING MATCHINGSTRING MATCHING
STRING MATCHING
Hessam Yusaf
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithm
sabiya sabiya
 
String kmp
String kmpString kmp
String kmp
thinkphp
 
String matching algorithms-pattern matching.
String matching algorithms-pattern matching.String matching algorithms-pattern matching.
String matching algorithms-pattern matching.
Swapan Shakhari
 
06. string matching
06. string matching06. string matching
06. string matching
Onkar Nath Sharma
 
Rabin karp string matcher
Rabin karp string matcherRabin karp string matcher
Rabin karp string matcher
Amit Kumar Rathi
 
Modified Rabin Karp
Modified Rabin KarpModified Rabin Karp
Modified Rabin Karp
Garima Singh
 
Kmp
KmpKmp
Chpt9 patternmatching
Chpt9 patternmatchingChpt9 patternmatching
Chpt9 patternmatching
dbhanumahesh
 
KMP Pattern Matching algorithm
KMP Pattern Matching algorithmKMP Pattern Matching algorithm
KMP Pattern Matching algorithm
Kamal Nayan
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.
Malek Sumaiya
 
Rabin Karp Algorithm
Rabin Karp AlgorithmRabin Karp Algorithm
Rabin Karp Algorithm
Sohail Ahmed
 
Pattern matching
Pattern matchingPattern matching
Pattern matchingshravs_188
 
Naive String Matching Algorithm | Computer Science
Naive String Matching Algorithm | Computer ScienceNaive String Matching Algorithm | Computer Science
Naive String Matching Algorithm | Computer Science
Transweb Global Inc
 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,
Amit Kumar Rathi
 
String matching algorithms(knuth morris-pratt)
String matching algorithms(knuth morris-pratt)String matching algorithms(knuth morris-pratt)
String matching algorithms(knuth morris-pratt)
Neel Shah
 
RABIN KARP ALGORITHM STRING MATCHING
RABIN KARP ALGORITHM STRING MATCHINGRABIN KARP ALGORITHM STRING MATCHING
RABIN KARP ALGORITHM STRING MATCHING
Abhishek Singh
 

What's hot (20)

String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Rabin karp string matching algorithm
Rabin karp string matching algorithmRabin karp string matching algorithm
Rabin karp string matching algorithm
 
Knuth morris pratt string matching algo
Knuth morris pratt string matching algoKnuth morris pratt string matching algo
Knuth morris pratt string matching algo
 
STRING MATCHING
STRING MATCHINGSTRING MATCHING
STRING MATCHING
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithm
 
String kmp
String kmpString kmp
String kmp
 
String matching algorithms-pattern matching.
String matching algorithms-pattern matching.String matching algorithms-pattern matching.
String matching algorithms-pattern matching.
 
06. string matching
06. string matching06. string matching
06. string matching
 
Rabin karp string matcher
Rabin karp string matcherRabin karp string matcher
Rabin karp string matcher
 
Modified Rabin Karp
Modified Rabin KarpModified Rabin Karp
Modified Rabin Karp
 
Kmp
KmpKmp
Kmp
 
Chpt9 patternmatching
Chpt9 patternmatchingChpt9 patternmatching
Chpt9 patternmatching
 
KMP Pattern Matching algorithm
KMP Pattern Matching algorithmKMP Pattern Matching algorithm
KMP Pattern Matching algorithm
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.
 
Rabin Karp Algorithm
Rabin Karp AlgorithmRabin Karp Algorithm
Rabin Karp Algorithm
 
Pattern matching
Pattern matchingPattern matching
Pattern matching
 
Naive String Matching Algorithm | Computer Science
Naive String Matching Algorithm | Computer ScienceNaive String Matching Algorithm | Computer Science
Naive String Matching Algorithm | Computer Science
 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,
 
String matching algorithms(knuth morris-pratt)
String matching algorithms(knuth morris-pratt)String matching algorithms(knuth morris-pratt)
String matching algorithms(knuth morris-pratt)
 
RABIN KARP ALGORITHM STRING MATCHING
RABIN KARP ALGORITHM STRING MATCHINGRABIN KARP ALGORITHM STRING MATCHING
RABIN KARP ALGORITHM STRING MATCHING
 

Similar to 25 String Matching

String-Matching Algorithms Advance algorithm
String-Matching  Algorithms Advance algorithmString-Matching  Algorithms Advance algorithm
String-Matching Algorithms Advance algorithm
ssuseraf60311
 
Kernel Lower Bounds
Kernel Lower BoundsKernel Lower Bounds
Kernel Lower BoundsASPAK2014
 
Naive string matching algorithm
Naive string matching algorithmNaive string matching algorithm
Naive string matching algorithm
Kiran K
 
Automata
AutomataAutomata
Automata
Gaditek
 
Automata
AutomataAutomata
Automata
Gaditek
 
Convergence
ConvergenceConvergence
Convergence
Dr. Hari Arora
 
Programmable PN Sequence Generators
Programmable PN Sequence GeneratorsProgrammable PN Sequence Generators
Programmable PN Sequence Generators
Rajesh Singh
 
Chapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfChapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdf
dawod yimer
 
Nt lecture skm-iiit-bh
Nt lecture skm-iiit-bhNt lecture skm-iiit-bh
Nt lecture skm-iiit-bh
IIIT Bhubaneswar
 
5 parametric equations, tangents and curve lengths in polar coordinates
5 parametric equations, tangents and curve lengths in polar coordinates5 parametric equations, tangents and curve lengths in polar coordinates
5 parametric equations, tangents and curve lengths in polar coordinatesmath267
 
Quadrature
QuadratureQuadrature
Quadrature
Linh Tran
 
Aho corasick-lecture
Aho corasick-lectureAho corasick-lecture
Aho corasick-lecture
PekkaKilpelinen2
 
DSP_FOEHU - MATLAB 02 - The Discrete-time Fourier Analysis
DSP_FOEHU - MATLAB 02 - The Discrete-time Fourier AnalysisDSP_FOEHU - MATLAB 02 - The Discrete-time Fourier Analysis
DSP_FOEHU - MATLAB 02 - The Discrete-time Fourier Analysis
Amr E. Mohamed
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time Signals
Jayanshu Gundaniya
 
Pattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatchesPattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatches
Benjamin Sach
 
Algorithms on Strings
Algorithms on StringsAlgorithms on Strings
Algorithms on Strings
Michael Soltys
 
Zero. Probabilystic Foundation of Theoretyical Physics
Zero. Probabilystic Foundation of Theoretyical PhysicsZero. Probabilystic Foundation of Theoretyical Physics
Zero. Probabilystic Foundation of Theoretyical Physics
Gunn Quznetsov
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
Julyan Arbel
 

Similar to 25 String Matching (20)

String-Matching Algorithms Advance algorithm
String-Matching  Algorithms Advance algorithmString-Matching  Algorithms Advance algorithm
String-Matching Algorithms Advance algorithm
 
Kernel Lower Bounds
Kernel Lower BoundsKernel Lower Bounds
Kernel Lower Bounds
 
Naive string matching algorithm
Naive string matching algorithmNaive string matching algorithm
Naive string matching algorithm
 
Automata
AutomataAutomata
Automata
 
Automata
AutomataAutomata
Automata
 
Ch05 1
Ch05 1Ch05 1
Ch05 1
 
Convergence
ConvergenceConvergence
Convergence
 
Programmable PN Sequence Generators
Programmable PN Sequence GeneratorsProgrammable PN Sequence Generators
Programmable PN Sequence Generators
 
Chapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfChapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdf
 
Nt lecture skm-iiit-bh
Nt lecture skm-iiit-bhNt lecture skm-iiit-bh
Nt lecture skm-iiit-bh
 
5 parametric equations, tangents and curve lengths in polar coordinates
5 parametric equations, tangents and curve lengths in polar coordinates5 parametric equations, tangents and curve lengths in polar coordinates
5 parametric equations, tangents and curve lengths in polar coordinates
 
Quadrature
QuadratureQuadrature
Quadrature
 
Reductions
ReductionsReductions
Reductions
 
Aho corasick-lecture
Aho corasick-lectureAho corasick-lecture
Aho corasick-lecture
 
DSP_FOEHU - MATLAB 02 - The Discrete-time Fourier Analysis
DSP_FOEHU - MATLAB 02 - The Discrete-time Fourier AnalysisDSP_FOEHU - MATLAB 02 - The Discrete-time Fourier Analysis
DSP_FOEHU - MATLAB 02 - The Discrete-time Fourier Analysis
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time Signals
 
Pattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatchesPattern Matching Part Two: k-mismatches
Pattern Matching Part Two: k-mismatches
 
Algorithms on Strings
Algorithms on StringsAlgorithms on Strings
Algorithms on Strings
 
Zero. Probabilystic Foundation of Theoretyical Physics
Zero. Probabilystic Foundation of Theoretyical PhysicsZero. Probabilystic Foundation of Theoretyical Physics
Zero. Probabilystic Foundation of Theoretyical Physics
 
Hastings 1970
Hastings 1970Hastings 1970
Hastings 1970
 

More from Andres Mendez-Vazquez

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Andres Mendez-Vazquez
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Andres Mendez-Vazquez
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Andres Mendez-Vazquez
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Andres Mendez-Vazquez
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
 
Zetta global
Zetta globalZetta global
Zetta global
Andres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Andres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Andres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Andres Mendez-Vazquez
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Andres Mendez-Vazquez
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Andres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 

Recently uploaded (20)

Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 

25 String Matching

  • 1. Analysis of Algorithms String matching Andres Mendez-Vazquez November 24, 2015 1 / 40
  • 2. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 2 / 40
  • 3. What is string matching? Given two sequences of characters drawn from a finite alphabet Σ, T [1..n] and P [1..m] a b c a b a a b c a b a c a b a b a a TEXT T PATTERN P s=3 Valid Shift 3 / 40
  • 4. Where... A Valid Shift P occurs with a valid shift s if for some 0 ≤ s ≤ n − m =⇒ T [s + 1..s + m] == P [1..m]. Otherwise it is an invalid shift. Thus The string-matching problem is the problem of of finding all valid shifts given a patten P on a text T. 4 / 40
  • 5. Where... A Valid Shift P occurs with a valid shift s if for some 0 ≤ s ≤ n − m =⇒ T [s + 1..s + m] == P [1..m]. Otherwise it is an invalid shift. Thus The string-matching problem is the problem of of finding all valid shifts given a patten P on a text T. 4 / 40
  • 6. Where... A Valid Shift P occurs with a valid shift s if for some 0 ≤ s ≤ n − m =⇒ T [s + 1..s + m] == P [1..m]. Otherwise it is an invalid shift. Thus The string-matching problem is the problem of of finding all valid shifts given a patten P on a text T. 4 / 40
  • 7. Possible Algorithms We have the following ones Algorithm Preprocessing Time Worst Case Matching Time Naive 0 O ((n − m + 1) m) Rabin-Karp Θ (m) O ((n − m + 1) m) Finite Automaton O (m |Σ|) Θ (n) Knuth-Morris-Pratt Θ (m) Θ (n) 5 / 40
  • 8. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 6 / 40
  • 9. Notation and Terminology Definition We denote by Σ∗ (read “sigma-star”) the set of all finite length strings formed using characters from the alphabet Σ. Constraint We assume a finite length strings. Some basic concepts The zero empty string, , also belong to Σ∗. The length of a string x is denoted |x|. The concatenation of two strings x and y, denoted xy, has length |x| + |y| 7 / 40
  • 10. Notation and Terminology Definition We denote by Σ∗ (read “sigma-star”) the set of all finite length strings formed using characters from the alphabet Σ. Constraint We assume a finite length strings. Some basic concepts The zero empty string, , also belong to Σ∗. The length of a string x is denoted |x|. The concatenation of two strings x and y, denoted xy, has length |x| + |y| 7 / 40
  • 11. Notation and Terminology Definition We denote by Σ∗ (read “sigma-star”) the set of all finite length strings formed using characters from the alphabet Σ. Constraint We assume a finite length strings. Some basic concepts The zero empty string, , also belong to Σ∗. The length of a string x is denoted |x|. The concatenation of two strings x and y, denoted xy, has length |x| + |y| 7 / 40
  • 12. Notation and Terminology Definition We denote by Σ∗ (read “sigma-star”) the set of all finite length strings formed using characters from the alphabet Σ. Constraint We assume a finite length strings. Some basic concepts The zero empty string, , also belong to Σ∗. The length of a string x is denoted |x|. The concatenation of two strings x and y, denoted xy, has length |x| + |y| 7 / 40
  • 13. Notation and Terminology Definition We denote by Σ∗ (read “sigma-star”) the set of all finite length strings formed using characters from the alphabet Σ. Constraint We assume a finite length strings. Some basic concepts The zero empty string, , also belong to Σ∗. The length of a string x is denoted |x|. The concatenation of two strings x and y, denoted xy, has length |x| + |y| 7 / 40
  • 14. Notation and Terminology Prefix A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗. Suffix A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗. Properties Prefix: If w x ⇒ |w| ≤ |x| Suffix: If w x ⇒ |w| ≤ |x| The is both suffix and prefix of every string. 8 / 40
  • 15. Notation and Terminology Prefix A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗. Suffix A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗. Properties Prefix: If w x ⇒ |w| ≤ |x| Suffix: If w x ⇒ |w| ≤ |x| The is both suffix and prefix of every string. 8 / 40
  • 16. Notation and Terminology Prefix A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗. Suffix A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗. Properties Prefix: If w x ⇒ |w| ≤ |x| Suffix: If w x ⇒ |w| ≤ |x| The is both suffix and prefix of every string. 8 / 40
  • 17. Notation and Terminology Prefix A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗. Suffix A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗. Properties Prefix: If w x ⇒ |w| ≤ |x| Suffix: If w x ⇒ |w| ≤ |x| The is both suffix and prefix of every string. 8 / 40
  • 18. Notation and Terminology Prefix A string w is a prefix x, w x if x = wy for some string y ∈ Σ∗. Suffix A string w is a suffix x, w x if x = yw for some string y ∈ Σ∗. Properties Prefix: If w x ⇒ |w| ≤ |x| Suffix: If w x ⇒ |w| ≤ |x| The is both suffix and prefix of every string. 8 / 40
  • 19. Notation and Terminology In addition For any string x and y and any character a, we have w x if and only if aw ax In addition, and are transitive relations. 9 / 40
  • 20. Notation and Terminology In addition For any string x and y and any character a, we have w x if and only if aw ax In addition, and are transitive relations. 9 / 40
  • 21. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 10 / 40
  • 22. The naive string algorithm Algorithm NAIVE-STRING-MATCHER(T, P) 1 n = T.length 2 m = P.length 3 for s = 0 to n − m 4 if P [1..m] == T [s + 1..s + m] 5 print “Pattern occurs with shift” s Complexity O((n − m + 1)m) or Θ n2 if m = n 2 11 / 40
  • 23. The naive string algorithm Algorithm NAIVE-STRING-MATCHER(T, P) 1 n = T.length 2 m = P.length 3 for s = 0 to n − m 4 if P [1..m] == T [s + 1..s + m] 5 print “Pattern occurs with shift” s Complexity O((n − m + 1)m) or Θ n2 if m = n 2 11 / 40
  • 24. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 12 / 40
  • 25. A more elaborated algorithm Rabin-Karp algorithm Lets assume that Σ = {0, 1, ..., 9} then we have the following: Thus, each string of k consecutive characters is a k-length decimal number: c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck Thus p correspond the decimal number for pattern P [1..m]. ts denote decimal value of m-length substring T [s + 1..s + m] for s = 0, 1, ..., n − m. 13 / 40
  • 26. A more elaborated algorithm Rabin-Karp algorithm Lets assume that Σ = {0, 1, ..., 9} then we have the following: Thus, each string of k consecutive characters is a k-length decimal number: c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck Thus p correspond the decimal number for pattern P [1..m]. ts denote decimal value of m-length substring T [s + 1..s + m] for s = 0, 1, ..., n − m. 13 / 40
  • 27. A more elaborated algorithm Rabin-Karp algorithm Lets assume that Σ = {0, 1, ..., 9} then we have the following: Thus, each string of k consecutive characters is a k-length decimal number: c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck Thus p correspond the decimal number for pattern P [1..m]. ts denote decimal value of m-length substring T [s + 1..s + m] for s = 0, 1, ..., n − m. 13 / 40
  • 28. A more elaborated algorithm Rabin-Karp algorithm Lets assume that Σ = {0, 1, ..., 9} then we have the following: Thus, each string of k consecutive characters is a k-length decimal number: c1c2 · · · ck−1ck = 10k−1c1 + 10k−2c2 + . . . + 10ck−1 + ck Thus p correspond the decimal number for pattern P [1..m]. ts denote decimal value of m-length substring T [s + 1..s + m] for s = 0, 1, ..., n − m. 13 / 40
  • 29. Then Properties Clearly ts == p if and only if T [s + 1..s + m] == P [1..m], thus s is a valid shift. 14 / 40
  • 30. Now, think about this What if we put everything in a single word of the machine If we can compute p in Θ (m). If we can compute all the ts in Θ (n − m + 1). We will get Θ (m) + Θ (n − m + 1) = Θ (n) 15 / 40
  • 31. Now, think about this What if we put everything in a single word of the machine If we can compute p in Θ (m). If we can compute all the ts in Θ (n − m + 1). We will get Θ (m) + Θ (n − m + 1) = Θ (n) 15 / 40
  • 32. Now, think about this What if we put everything in a single word of the machine If we can compute p in Θ (m). If we can compute all the ts in Θ (n − m + 1). We will get Θ (m) + Θ (n − m + 1) = Θ (n) 15 / 40
  • 33. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 16 / 40
  • 34. Remember Horner’s Rule Consider We can compute, the decimal representation by the Horner’s rule: p = P[m] + 10(P[m − 1] + 10(P[m − 2] + ... + 10(P[2] + 10P[1])...)) (1) Thus, we can compute t0 using this rule in Θ (m) (2) 17 / 40
  • 35. Remember Horner’s Rule Consider We can compute, the decimal representation by the Horner’s rule: p = P[m] + 10(P[m − 1] + 10(P[m − 2] + ... + 10(P[2] + 10P[1])...)) (1) Thus, we can compute t0 using this rule in Θ (m) (2) 17 / 40
  • 36. Example If you have the following set of digits 2 4 0 1 Using the Horner’s Rule for m = 4 2401 = 1 + 10 × (0 + 10 × (4 + 10 × 2)) = 2 × 103 + 4 × 102 + 0 × 10 + 1 18 / 40
  • 37. Example If you have the following set of digits 2 4 0 1 Using the Horner’s Rule for m = 4 2401 = 1 + 10 × (0 + 10 × (4 + 10 × 2)) = 2 × 103 + 4 × 102 + 0 × 10 + 1 18 / 40
  • 38. Then To compute the remaining values, we can use the previous value ts+1 = 10 ts − 10m−1 T [s + 1] + T [s + m + 1] . (3) Notice the following Subtracting from it 10m−1T [s + 1] removes the high-order digit from ts. Multiplying the result by 10 shifts the number left by one digit position. Adding T [s + m + 1] brings in the appropriate low-order digit. 19 / 40
  • 39. Then To compute the remaining values, we can use the previous value ts+1 = 10 ts − 10m−1 T [s + 1] + T [s + m + 1] . (3) Notice the following Subtracting from it 10m−1T [s + 1] removes the high-order digit from ts. Multiplying the result by 10 shifts the number left by one digit position. Adding T [s + m + 1] brings in the appropriate low-order digit. 19 / 40
  • 40. Then To compute the remaining values, we can use the previous value ts+1 = 10 ts − 10m−1 T [s + 1] + T [s + m + 1] . (3) Notice the following Subtracting from it 10m−1T [s + 1] removes the high-order digit from ts. Multiplying the result by 10 shifts the number left by one digit position. Adding T [s + m + 1] brings in the appropriate low-order digit. 19 / 40
  • 41. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 42. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 43. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 44. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 45. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 46. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 47. Example Imagine that you have... index 1 2 3 4 5 6 digit 1 0 2 4 1 0 1 We have t0 = 1024 then we want to calculate 0241 2 Then we subtract 103 × T [1] == 1000 of 1024 3 We get 0024 4 Multiply by 10, and we get 0240 5 We add T [5] == 1 6 Finally, we get 0241 20 / 40
  • 48. Remarks First We can extend this beyond the decimals to any d digit system!!! What happens when the numbers are quite large? We can use the module Meaning Compute p and ts values modulo a suitable modulus q. 21 / 40
  • 49. Remarks First We can extend this beyond the decimals to any d digit system!!! What happens when the numbers are quite large? We can use the module Meaning Compute p and ts values modulo a suitable modulus q. 21 / 40
  • 50. Remarks First We can extend this beyond the decimals to any d digit system!!! What happens when the numbers are quite large? We can use the module Meaning Compute p and ts values modulo a suitable modulus q. 21 / 40
  • 51. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 22 / 40
  • 52. Remember Hash Functions Yes, we are mapping the large numbers into the set {0, 1, 2, ..., q − 1} 23 / 40
  • 53. Then, to reduce the possible representation Use the module of q It is possible to compute p mod q in Θ (m) time and all the ts mod q in Θ (n − m + 1) time. Something Notable If we choose the modulus q as a prime such that 10q just fits within one computer word, then we can perform all the necessary computations with single-precision arithmetic. 24 / 40
  • 54. Then, to reduce the possible representation Use the module of q It is possible to compute p mod q in Θ (m) time and all the ts mod q in Θ (n − m + 1) time. Something Notable If we choose the modulus q as a prime such that 10q just fits within one computer word, then we can perform all the necessary computations with single-precision arithmetic. 24 / 40
  • 55. Why 10q? After all 10q is the number that will subtracting for or multiplying by!!! We use “truncated division” to implement the modulo operation For example given two numbers a and n, we can do the following q = trunc a n Then r = a − nq 25 / 40
  • 56. Why 10q? After all 10q is the number that will subtracting for or multiplying by!!! We use “truncated division” to implement the modulo operation For example given two numbers a and n, we can do the following q = trunc a n Then r = a − nq 25 / 40
  • 57. Why 10q? After all 10q is the number that will subtracting for or multiplying by!!! We use “truncated division” to implement the modulo operation For example given two numbers a and n, we can do the following q = trunc a n Then r = a − nq 25 / 40
  • 58. Why 10q? Thus If q is a prime we can use this as the element of truncated division then r = a − 10q. Truncated Algorithm Truncated-Module(a, q) 1 r = a − 10q 2 while r > 10q 3 r = r − 10q 4 return r 26 / 40
  • 59. Why 10q? Thus If q is a prime we can use this as the element of truncated division then r = a − 10q. Truncated Algorithm Truncated-Module(a, q) 1 r = a − 10q 2 while r > 10q 3 r = r − 10q 4 return r 26 / 40
  • 60. Then Thus Then, we can implement this in basic arithmetic CPU operations. Not only that In general, for a d-ary alphabet, we choose q such that dq fits within a computer word. 27 / 40
  • 61. Then Thus Then, we can implement this in basic arithmetic CPU operations. Not only that In general, for a d-ary alphabet, we choose q such that dq fits within a computer word. 27 / 40
  • 62. In general, we have The following ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q (4) where h ≡ dm−1(mod q) is the value of the digit “1” in the high-order position of an m-digit text window. Here We have a small problem!!! Question? Can we differentiate between p mod q and ts mod q? 28 / 40
  • 63. In general, we have The following ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q (4) where h ≡ dm−1(mod q) is the value of the digit “1” in the high-order position of an m-digit text window. Here We have a small problem!!! Question? Can we differentiate between p mod q and ts mod q? 28 / 40
  • 64. In general, we have The following ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q (4) where h ≡ dm−1(mod q) is the value of the digit “1” in the high-order position of an m-digit text window. Here We have a small problem!!! Question? Can we differentiate between p mod q and ts mod q? 28 / 40
  • 65. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 66. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 67. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 68. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 69. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 70. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 71. What? Look at this with q = 11 14 mod 11 == 3 25 mod 11 == 3 But 14 = 25 However 11 mod 11 == 0 25 mod 11 == 3 We can say that 11 = 25!!! This means We can use the modulo to differentiate numbers, but not to exactly to say if they are equal!!! 29 / 40
  • 72. Thus We have the following logic If ts ≡ p (mod q) does not mean that ts == p. If ts ≡ p (mod q), we have that ts = p. Fixing the problem To fix this problem we simply test to see if the hit is not spurious. Note If q is large enough, then we can hope that spurious hits occur infrequently enough that the cost of the extra checking is low. 30 / 40
  • 73. Thus We have the following logic If ts ≡ p (mod q) does not mean that ts == p. If ts ≡ p (mod q), we have that ts = p. Fixing the problem To fix this problem we simply test to see if the hit is not spurious. Note If q is large enough, then we can hope that spurious hits occur infrequently enough that the cost of the extra checking is low. 30 / 40
  • 74. Thus We have the following logic If ts ≡ p (mod q) does not mean that ts == p. If ts ≡ p (mod q), we have that ts = p. Fixing the problem To fix this problem we simply test to see if the hit is not spurious. Note If q is large enough, then we can hope that spurious hits occur infrequently enough that the cost of the extra checking is low. 30 / 40
  • 75. Thus We have the following logic If ts ≡ p (mod q) does not mean that ts == p. If ts ≡ p (mod q), we have that ts = p. Fixing the problem To fix this problem we simply test to see if the hit is not spurious. Note If q is large enough, then we can hope that spurious hits occur infrequently enough that the cost of the extra checking is low. 30 / 40
  • 76. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 31 / 40
  • 77. The Final Algorithm Rabin-Karp-Matcher(T, P, d, q) 1 n = T.length 2 m = P.length 3 h = dm−1 mod q // Storing the reminder of the highest power 4 p = 0 5 t0 = 0 6 for i = 1 to m // Preprocessing 7 p = (dp + P [i]) mod q 8 t0 = (dp + T [i]) mod q 9 for s = 0 to n − m 10 if p == ts 11 if P [1..m] == T [s + 1..s + m] // Actually a Loop 12 print “Pattern occurs with shift” s 13 if s < n − m 14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q 32 / 40
  • 78. The Final Algorithm Rabin-Karp-Matcher(T, P, d, q) 1 n = T.length 2 m = P.length 3 h = dm−1 mod q // Storing the reminder of the highest power 4 p = 0 5 t0 = 0 6 for i = 1 to m // Preprocessing 7 p = (dp + P [i]) mod q 8 t0 = (dp + T [i]) mod q 9 for s = 0 to n − m 10 if p == ts 11 if P [1..m] == T [s + 1..s + m] // Actually a Loop 12 print “Pattern occurs with shift” s 13 if s < n − m 14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q 32 / 40
  • 79. The Final Algorithm Rabin-Karp-Matcher(T, P, d, q) 1 n = T.length 2 m = P.length 3 h = dm−1 mod q // Storing the reminder of the highest power 4 p = 0 5 t0 = 0 6 for i = 1 to m // Preprocessing 7 p = (dp + P [i]) mod q 8 t0 = (dp + T [i]) mod q 9 for s = 0 to n − m 10 if p == ts 11 if P [1..m] == T [s + 1..s + m] // Actually a Loop 12 print “Pattern occurs with shift” s 13 if s < n − m 14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q 32 / 40
  • 80. The Final Algorithm Rabin-Karp-Matcher(T, P, d, q) 1 n = T.length 2 m = P.length 3 h = dm−1 mod q // Storing the reminder of the highest power 4 p = 0 5 t0 = 0 6 for i = 1 to m // Preprocessing 7 p = (dp + P [i]) mod q 8 t0 = (dp + T [i]) mod q 9 for s = 0 to n − m 10 if p == ts 11 if P [1..m] == T [s + 1..s + m] // Actually a Loop 12 print “Pattern occurs with shift” s 13 if s < n − m 14 ts+1 = (d(ts − T[s + 1]h) + T[s + m + 1]) mod q 32 / 40
  • 81. Complexity Preprocessing 1 p mod q is done in Θ(m). 2 tss mod q is done in Θ(n − m + 1). Then checking P[1..m] == T[s + 1..s + m] In the worst case, Θ((n − m + 1)m) 33 / 40
  • 82. Complexity Preprocessing 1 p mod q is done in Θ(m). 2 tss mod q is done in Θ(n − m + 1). Then checking P[1..m] == T[s + 1..s + m] In the worst case, Θ((n − m + 1)m) 33 / 40
  • 83. Still we can do better! First The number of spurious hits is O (n/q). Because We can estimate the chance that an arbitrary ts will be equivalent to p mod q as 1/q. Properties Since there are O (n) positions at which the test of line 10 fails (Thus, you have O (n/q) non valid hits) and we spend O (m) time per hit 34 / 40
  • 84. Still we can do better! First The number of spurious hits is O (n/q). Because We can estimate the chance that an arbitrary ts will be equivalent to p mod q as 1/q. Properties Since there are O (n) positions at which the test of line 10 fails (Thus, you have O (n/q) non valid hits) and we spend O (m) time per hit 34 / 40
  • 85. Still we can do better! First The number of spurious hits is O (n/q). Because We can estimate the chance that an arbitrary ts will be equivalent to p mod q as 1/q. Properties Since there are O (n) positions at which the test of line 10 fails (Thus, you have O (n/q) non valid hits) and we spend O (m) time per hit 34 / 40
  • 86. Finally, we have that The expected matching time The expected matching time by Rabin-Karp algorithm is O(n) + O m v + n q where v is the number of valid shifts. In addition If v = O(1) (Number of valid shifts small) and choose q ≥ m such that n q = O(1) (q to be larger enough than the pattern’s length). Then The algorithm takes O(m + n) for finding the matches. Finally, because m ≤ n, thus the expected time is O(n). 35 / 40
  • 87. Finally, we have that The expected matching time The expected matching time by Rabin-Karp algorithm is O(n) + O m v + n q where v is the number of valid shifts. In addition If v = O(1) (Number of valid shifts small) and choose q ≥ m such that n q = O(1) (q to be larger enough than the pattern’s length). Then The algorithm takes O(m + n) for finding the matches. Finally, because m ≤ n, thus the expected time is O(n). 35 / 40
  • 88. Finally, we have that The expected matching time The expected matching time by Rabin-Karp algorithm is O(n) + O m v + n q where v is the number of valid shifts. In addition If v = O(1) (Number of valid shifts small) and choose q ≥ m such that n q = O(1) (q to be larger enough than the pattern’s length). Then The algorithm takes O(m + n) for finding the matches. Finally, because m ≤ n, thus the expected time is O(n). 35 / 40
  • 89. Finally, we have that The expected matching time The expected matching time by Rabin-Karp algorithm is O(n) + O m v + n q where v is the number of valid shifts. In addition If v = O(1) (Number of valid shifts small) and choose q ≥ m such that n q = O(1) (q to be larger enough than the pattern’s length). Then The algorithm takes O(m + n) for finding the matches. Finally, because m ≤ n, thus the expected time is O(n). 35 / 40
  • 90. Finally, we have that The expected matching time The expected matching time by Rabin-Karp algorithm is O(n) + O m v + n q where v is the number of valid shifts. In addition If v = O(1) (Number of valid shifts small) and choose q ≥ m such that n q = O(1) (q to be larger enough than the pattern’s length). Then The algorithm takes O(m + n) for finding the matches. Finally, because m ≤ n, thus the expected time is O(n). 35 / 40
  • 91. Outline 1 String Matching Introduction Some Notation 2 Naive Algorithm Using Brute Force 3 The Rabin-Karp Algorithm Efficiency After All Horner’s Rule Generaiting Possible Matches The Final Algorithm Other Methods 4 Exercises 36 / 40
  • 92. We have other methods We have the following ones Algorithm Preprocessing Time Worst Case Matching Time Rabin-Karp Θ (m) O ((n − m + 1) m) Finite Automaton O (m |Σ|) Θ (n) Knuth-Morris-Pratt Θ (m) Θ (n) 37 / 40
  • 93. Remarks about Knuth-Morris-Pratt The Algorithm It is quite an elegant algorithm that improves over the state machine. How? It avoid to compute the transition function in the state machine by using the prefix function π It encapsulate information how the pattern matches against shifts of itself 38 / 40
  • 94. Remarks about Knuth-Morris-Pratt The Algorithm It is quite an elegant algorithm that improves over the state machine. How? It avoid to compute the transition function in the state machine by using the prefix function π It encapsulate information how the pattern matches against shifts of itself 38 / 40
  • 95. However, At the same time (Circa 1977) Boyer–Moore string search algorithm It was presented at the same time It is used in the GREP function for pattern matching in UNIX Actually is the basic algorithm to beat when doing research in this area!!! Richard Cole (Circa 1991) He gave a a proof of the algorithm with an upper bound of 3m comparisons in the worst case!!! 39 / 40
  • 96. However, At the same time (Circa 1977) Boyer–Moore string search algorithm It was presented at the same time It is used in the GREP function for pattern matching in UNIX Actually is the basic algorithm to beat when doing research in this area!!! Richard Cole (Circa 1991) He gave a a proof of the algorithm with an upper bound of 3m comparisons in the worst case!!! 39 / 40
  • 97. However, At the same time (Circa 1977) Boyer–Moore string search algorithm It was presented at the same time It is used in the GREP function for pattern matching in UNIX Actually is the basic algorithm to beat when doing research in this area!!! Richard Cole (Circa 1991) He gave a a proof of the algorithm with an upper bound of 3m comparisons in the worst case!!! 39 / 40