The document describes the Knuth-Morris-Pratt (KMP) string matching algorithm. KMP finds all occurrences of a pattern string P in a text string T. It improves on the naive algorithm by not re-checking characters when a mismatch occurs. This is done by precomputing a function h that determines how many characters P can skip ahead while still maintaining the matching prefix. With h, KMP ensures each character is checked at most twice, giving it O(m+n) time complexity where m and n are the lengths of P and T.
Algorithms Discussed
Knuth–Morris–Pratt algorithm
Boyer–Moore string search algorithm
Bitap algorithm (for exact string searching)
-------------------
Checking whether two or more strings are same or not.
Finding a string (pattern) into another string (text). --> Looking for substring
Algorithms Discussed
Knuth–Morris–Pratt algorithm
Boyer–Moore string search algorithm
Bitap algorithm (for exact string searching)
-------------------
Checking whether two or more strings are same or not.
Finding a string (pattern) into another string (text). --> Looking for substring
Tree and Binary search tree in data structure.
The complete explanation of working of trees and Binary Search Tree is given. It is discussed such a way that everyone can easily understand it. Trees have great role in the data structures.
Given presentation tell us about string, string matching and the navie method of string matching. Well this method has O((n-m+1)*m) time complexicity. It also tells the problem with naive approach and gives list of approaches which can be applied to reduce the time complexicity
This presentation contains the following operations
Creating string objects.
Reading string objects from keyboard.
Displaying string objects to the screen.
Finding a substring from a string.
Modifying string objects.
Adding string objects.
Accessing characters in a string.
Obtaining the size of string.
Tree and Binary search tree in data structure.
The complete explanation of working of trees and Binary Search Tree is given. It is discussed such a way that everyone can easily understand it. Trees have great role in the data structures.
Given presentation tell us about string, string matching and the navie method of string matching. Well this method has O((n-m+1)*m) time complexicity. It also tells the problem with naive approach and gives list of approaches which can be applied to reduce the time complexicity
This presentation contains the following operations
Creating string objects.
Reading string objects from keyboard.
Displaying string objects to the screen.
Finding a substring from a string.
Modifying string objects.
Adding string objects.
Accessing characters in a string.
Obtaining the size of string.
I am Charles B. I am a Programming Exam Expert at programmingexamhelp.com. I hold a Ph.D. in Programming Texas University, USA. I have been helping students with their exams for the past 9 years. You can hire me to take your exam in Programming.
Visit programmingexamhelp.com or email support@programmingexamhelp.com. You can also call on +1 678 648 4277 for any assistance with the Programming Exam.
WHAT IS PATTERN RECOGNISITION ?
Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of the pattern within the text.
Example: T = 000010001010001 and P = 0001, the occurrences are:
first occurrence starts at T[1]
second occurrence starts at T[5]
third occurrence starts at T[11]
APPLICATION :
Image preprocessing
• Computer vision
• Artificial intelligence
• Radar signal classification/analysis
• Speech recognition/understanding
• Fingerprint identification
• Character (letter or number) recognition
• Handwriting analysis
• Electro-cardiographic signal analysis/understanding
• Medical diagnosis
• Data mining/reduction
ALGORITHM DESCRIPTION
The Knuth-Morris-Pratt (KMP) algorithm:
It was published by Donald E. Knuth, James H.Morris and Vaughan R. Pratt, 1977 in: “Fast Pattern Matching in Strings.“
To illustrate the ideas of the algorithm, consider the following example:T = xyxxyxyxyyxyxyxyyxyxyxxy
And P = xyxyyxyxyxx
it considers shifts in order from 1 to n-m, and determines if the pattern matches at that shift. The difference is that the KMP algorithm uses information gleaned from partial matches of the pattern and text to skip over shifts that are guaranteed not to result in a match.
The text and pattern are included in Figure 1, with numbering, to make it easier to follow.
1.Consider the situation when P[1……3] is successfully matched with T[1……..3]. We then find a mismatch: P[4] = T[4]. Based on our knowledge that P[1…… 3] =T[1…… 3], and ignoring symbols of the pattern and text after position 3, what can we deduce about where a potential match might be? In this case, the algorithm slides the pattern 2 positions to the right so that P[1] is lined up with T[3]. The next comparison is between P[2] and T[4].
P: x y x y y x y x y x x
q: 1 2 3 4 5 6 7 8 9 10 11
_(q): 0 0 1 2 0 3
Table 1: Table of values for pattern P.
Time Complexity :
The call to compute prefix is O(m)
using q as the value of the potential function, we argue in the same manner as above to show the loop is O(n)
Therefore the overall complexity is O(m + n)
Boyer-Moore algorithm:
It was developed by Bob Boyer and J Strother Moore in 1977. The algorithm preprocesses the pattern string that is being searched in text string.
String pattern matching - Boyer-Moore:
This algorithm uses fail-functions to shift the pattern efficiently. Boyer-Moore starts however at the end of the pattern, which can result in larger shifts.Two heuristics are used:1: if we encounter a mismatch at character c in Q, we can shift to the first occurrence of c in P from the right:
Q a b c a b c d g a b c e a b c d a c e d
P a b c e b c d
a b c e b c d
(restart here)
Time Complexity:
• performs the comparisons from right to left;
• preprocessing phase in O(m+ ) t
string searching algorithms. Given two strings P and T over the same alphabet E, determine whether P occurs as a substring in T (or find in which position(s) P occurs as a substring in T). The strings P and T are called pattern and target respectively.
I am Gabriel C. I love exploring new topics. Academic writing seemed an interesting option for me. After working for many years with programmingexamhelp.com, I have assisted many students with their exams. I can proudly say, each student I have served is happy with the quality of the solution that I have provided. I have acquired Business analyst of Information Technology, Montreal College of Information Technology, Canada.
1. Case Study: Searching for Patterns
Problem: find all occurrences of pattern
P of length m inside the text T of
length n.
Þ Exact matching problem
1
2. String Matching - Applications
2
Text editing
Term rewriting
Lexical analysis
Information retrieval
And, bioinformatics
3. Given a string P (pattern) and a longer string T (text).
Aim is to find all occurrences of pattern P in text T .
If m is the length of P , and n is the length of T , then
3
Exact matching problem
The naive method:
T a g c a g a a g a g t a
P Paagg Paaggagg Paaggaggag Paggaaggagg Pagaaggaggg Pgaggagaagg agaagggagg gaggag agg g
Time complexity = O(m.n),
Space complexity = O(m + n)
4. 4
Can we be more clever ?
When a mismatch is detected, say at position
k in the pattern string, we have already
successfully matched k-1 characters.
We try to take advantage of this to decide
where to restart matching
T a g c a g a a g a g t a
P aagg Pagg Pagaagg aagggagg Paggag Paaggagg aaggaggg aggag agg g
5. The Knuth-Morris-Pratt Algorithm
Observation: when a mismatch occurs, we
may not need to restart the comparison all way
back (from the next input position).
What to do:
Constructing an array h, that determines how
many characters to shift the pattern to the right
in case of a mismatch during the pattern-matching
5
process.
6. The key idea is that if we have
successfully matched the prefix P[1…i-
1] of the pattern with the substring T[j-i+
1,…,j-1] of the input string and P(i) ¹
T(j), then we do not need to reprocess
any of the suffix T[j-i+1,…,j-1] since we
know this portion of the text string is
the prefix of the pattern that we have
just matched.
6
KMP (2)
7. The failure function h
For each position i in pattern P, define hi to be
the length of the longest proper suffix of P[1,…,i]
that matches a prefix of P.
i = 11
1 2 3 4 5 6 7 8 9 10 11 12 13
Pattern: a b a a b a b a a b a a b
7
Hence, h(11) = 6.
If there is no proper suffix of P[1,…,i] with the
property . mentioned above, then h(i) = 0
8. The first mismatch in position
k=12 of T and in pos. i+1=12 of P.
T: a b a a b a b a a b a c a b a a b a b a a b
P:
P:
a b a a b a b a a b a a b
8
The KMP shift rule
a b a a b a b a a b a a b
h11 = 6
Shift P to the right so that P[1,…,h(i)] aligns with the suffix
T[k- h(i),…,k-1].
They must match because of the definition of h.
In other words, shift P exactly i – h(i) places to the right.
If there is no mismatch, then shift P by m – h(m) places to
the right.
9. The KMP algorithm finds all occurrences of
P in T.
P before shift
missed occurrence of P
9
P after shift
First mismatch in position
k of T and in pos. i+1 of P.
Suppose not, …
T:
b b
10. First mismatch in position
k of T and in pos. i+1 of P.
10
Correctness of KMP.
P before shift
P after shift
missed occurrence of P
b
a
a
b
|a| > |b| = h(i)
T:
It is a contradiction.
T(k)
11. Array h: 0 0 1 1 2 3 2 3 4 5 6 4 5
What is h(i)= h(11) = ? h(11) = 6 Þ i – h(i) = 11 – 6 = 5
11
An Example
1 2 3 4 5 6 7 8 9 10 11 12 13
Pattern: a b a a b a b a a b a a b
Next funciton: 0 1 0 2 1 0 4 0 2 1 0 7 1
abaababaabacabaababaabaab.
Given:
Input string:
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
Scenario 1:
i+1 = 12
k = 12
Scenario 2: i+1
k
i = 6 , h(6)= 3
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
12. a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
12
An Example
Scenario 3: i+1
k
i = 3, h(3) = 1
Subsequently i = 2, 1, 0
Finally, a match is found:
i+1
a b a a b a b a a b a a b
a b a a b a b a a b a c a b a a b a b a a b a a b
k
13. Complexity of KMP
In the KMP algorithm, the number of character
comparisons is at most 2n.
After shift, no character will be no backtrack
compared before this position
P before shift
Hence #comparisons £ #shifts + |T| £ 2|T| = 2n.
13
P after shift
T:
In any shift at most one comparison involves a
character of T that was previously compared.
14. Computing the failure function
We can compute h(i+1) if we know h(1)..h(i)
To do this we run the KMP algorithm where the text is
the pattern with the first character replaced with a $.
Suppose we have successfully matched a prefix of the
pattern with a suffix of T[1..i]; the length of this match is
h(i).
If the next character of the pattern and text match then
h(i+1)=h(i)+1.
If not, then in the KMP algorithm we would shift the
pattern; the length of the new match is h(h(i)).
If the next character of the pattern and text match then
h(i+1)=h(h(i))+1, else we continue to shift the pattern.
Since the no. of comparisons required by KMP is length
of pattern+text, time taken for computing the failure
function is O(n).
14
15. 1 2 3 4 5 6 7 8 9 10 11 12 13
Failure fPuantctteiornn: h: 0 0 1 1 2 3 2 3 4 5 6
1 2 3 4 5 6 7 8 9 10 11 12 13
Text: $ b a a b a b a a b a a b
15
Computing h: an example
h(12)= 4 = h(6)+1 = h(h(11))+1
h(13)= 5 = h(12)+1
Given:
Pattern: a b a a b a
b
a
h(11)=6
4
Pattern a b a b
h(6)=3
5
16. preprocessing searching
16
KMP - Analysis
The KMP algorithm never needs to
backtrack on the text string.
Time complexity = O(m + n)
Space complexity = O(m + n),
where m =|P| and n=|T|.