Knuth-Morris-Pratt Substring
Search Algorithm
Prepared By:
Sabiya Fatima
Email ID: sabiya1990fatima@gmail.com
1
Outline
 Definition
 History
 Components of KMP
 Algorithm
 Example
 Run-Time Analysis
 Complexity comparison of String Matching Algorithms
 Advantages and Disadvantages
 Real Time Applications
 References
2
What is Pattern Searching ?
 Suppose you are reading a text document.
 You want to search for a word.
 You click CTRL + F and search for that word.
 The word processor scans the document and shows the position of
occurrence.
What exactly happens is that, word i.e. pattern is searched inside the
text document.
3
Definition
 Best known for linear time for exact matching. Compares from left to right.
 Shifts more than one position.
 Preprocessing approach of Pattern to avoid trivial comparisions.
 Avoids recomputing matches.
4
History
 This algorithm was conceived by Donald Knuth and Vaughan Pratt and independently by
James H.Morris in 1977.
 Knuth, Morris and Pratt discovered first linear time string-matching algorithm by
analysis of the naive algorithm.
 It keeps the information that naive approach wasted gathered during the scan of the
text. By avoiding this waste of information, it achieves a running time of O(m + n).
 The implementation of Knuth-Morris-Pratt algorithm is efficient because it minimizes
the total number of comparisons of the pattern against the input string.
5
Naïve Approach
The naïve approach is to check whether the pattern matches the string at
every possible position in the string.
P= Pattern (word) of length m
T= Text (document) of length n
Naive string matching algorithm
takes time O((n-m+1)m) or
O(mn)
6
The KMP Algorithm - Motivation
x
j
. . a b a a b . . . . .
a b a a b a
a b a a b a
No need to
repeat these
comparisons
Resume
comparing
here
 Knuth-Morris-Pratt’s algorithm
compares the pattern to the text
in left-to-right, but shifts the
pattern more intelligently than
the brute-force algorithm.
 When a mismatch occurs, what
is the most we can shift the
pattern so as to avoid redundant
comparisons?
 Answer: the largest prefix of
P[0..j] that is a suffix of P[1..j]
7
Components of KMP algorithm
 The prefix function, Π
The prefix function,Π for a pattern encapsulates knowledge about how the pattern
matches against shifts of itself. This information can be used to avoid useless shifts of
the pattern ‘p’. In other words, this enables avoiding backtracking on the text ‘T’.
 The KMP Matcher
With text ‘T’, pattern ‘p’ and prefix function ‘Π’ as inputs, finds the occurrence of ‘p’ in
‘T’ and returns the number of shifts of ‘p’ after which occurrence is found.
8
The prefix function, Π
Following pseudocode computes the prefix function, Π:
Compute-Prefix-Function (p)
1 m  length[p] //’p’ pattern to be matched
2 Π[1]  0
3 k  0
4 for q  2 to m
5 do while k > 0 and p[k+1] != p[q]
6 do k  Π[k]
7 If p[k+1] = p[q]
8 then k  k +1
9 Π[q]  k
10 return Π
9
Example: compute Π for the pattern ‘p’ below:
a b a b a c a
Initially: m = length[p] = 7
Π[1] = 0
k = 0
Step 1: q = 2, k=0
Π[2] = 0
Step 2: q = 3, k = 0,
Π[3] = 1
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1
10
p
10
Contd…
11
Step 3: q = 4, k = 1
Π[4] = 2 q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2
Step 4: q = 5, k =2
Π[5] = 3
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3
Step 5: q = 6, k = 3
Π[6] = 0
Step 6: q = 7, k = 0
Π[7] = 1
After iterating 6 times, the prefix function
computation is complete: 
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3 0
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3 0 1
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3 0 1
The running time of the prefix function is O(m).
12Contd…..
The KMP Matcher
Input: The KMP Matcher, with pattern ‘p’, text ‘T’and prefix function ‘Π’, finds a match of p in T.
Following pseudocode computes the matching component of KMP algorithm:
KMP-Matcher(T,p)
1 n  length[T]
2 m  length[p]
3 Π  Compute-Prefix-Function(p)
4 q  0 //number of characters matched
5 for i  1 to n //scan T from left to right
6 do while q > 0 and p[q+1] != T[i]
7 do q  Π[q] //next character does not match
8 if p[q+1] = T[i]
9 then q  q + 1 //next character matches
10 if q = m //is all of p matched?
11 then print “Pattern occurs with shift” i – m
12 q  Π[ q] // look for the next match
Note: KMP finds every occurrence of a ‘p’in ‘T’. That is why KMP does not terminate in step 12, rather it searches
remainder of ‘T’for any more occurrences of ‘p’.
13
Illustration: given a Text ‘T’ and pattern ‘p’ as follows:
T
b a c b a b a b a b a c a c a
p a b a b a c a
Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘T’.
For ‘p’the prefix function, Π was computed previously and is as follows:
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3 0 1 14
14
b a c b a b a b a b a c a a b
a b a b a c a
Initially: n = size of T = 15;
m = size of p = 7
Step 1: i = 1, q = 0
comparing p[1] with T[1]
T
p
P[1] does not match with T[1]. ‘p’ will be shifted one position to the right.
15
15
Contd…
Step 2: i = 2, q = 0
comparing p[1] with T[2]
T
p
a b a b a c a
P[1] matches T[2]. Since there is a match, p is not shifted.
b a c b a b a b a b a c a a b
Contd…
16
T b a c b a b a b a b a c a a b
a b a b a c ap
p[2] does not match with T[3]
Backtracking on p, comparing p[1] and T[3]
Step 3: i = 3, q = 1
Comparing p[2] with T[3]
a b a b a c a
T
p
Step 4: i = 4, q = 0 comparing p[1] with T[4] p[1] does not match with T[4]
b a c b a b a b a b a c a a b
b a c b a b a b a b a c a a b
a b a b a c a
T
p
Step 5: i = 5, q = 0
comparing p[1] with T[5] p[1] matches with T[5]
17
17
Step 6: i = 6, q = 1 Comparing p[2] with T[6] p[2] matches with T[6]
T
p
b a c b a b a b a b a c a a b
a b a b a c a
Contd…
b a c b a b a b a b a c a a b
b a c b a b a b a b a c a a b
a b a b a c a
a b a b a c a
T
p
Step 7: i = 7, q = 2 Comparing p[3] with T[7]
p[3] matches with T[7]
Step 8: i = 8, q = 3 Comparing p[4] with T[8]
p[4] matches with T[8]
T
p
18
18
Contd…
Step 9: i = 9, q = 4
Comparing p[5] with T[9]
Comparing p[6] with T[10]Step 10: i = 10, q = 5
T
p
b a c b a b a b a b a c a a b
b a c b a b a b a b a c a a b
a b a b a c a
a b a b a c a
p[6] doesn’t match with T[10]
Backtracking on p, comparing p[4] with T[10] because after mismatch q = Π[5] = 3
p[5] matches with T[9]
19
19
T
p
Contd…
20
Step 11: i = 11, q = 4
Comparing p[5] with T[11] p[5] matches with T[11]
T
p
b a c b a b a b a b a c a a b
a b a b a c a
Contd…
Step 12: i = 12, q = 5 Comparing p[6] with T[12] p[6] matches with T[12]
a b a b a c ap
b a c b a b a b a b a c a a bT
b a c b a b a b a b a c a a b
a b a b a c a
Comparing p[7] with T[13]
T
p
Step 13: i = 13, q = 6 p[7] matches with T[13]
Pattern ‘p’ has been found to completely occur in text ‘T’. The total number of shifts
that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.
The running time of the KMP-Matcher function is O(n).
21
21
Contd…
Complexity
 O(m) - It is to compute the prefix function values.
 O(n) - It is to compare the pattern to the text.
 Total of O(n + m) run time.
22
Complexity comparison of String Matching
Algorithms
23
Advantage and Disadvantage
Advantages:
1.The running time of the KMP algorithm is optimal (O(m + n)), which is very fast.
2.The algorithm never needs to move backwards in the input text T. It makes the
algorithm good for processing very large files.
Disadvantages:
Doesn’t work so well as the size of the alphabets increases. By which more chances
of mismatch occurs.
24
Real time Applications
 Good for plagiarism analysis.
 search engines
 language syntax checker
 database queries
 music content retrieval
25
Real time Applications
26
 DNA sequences analysis :
• It is mainly composed of nucleotides of four types. The four bases in DNA are
Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). A DNA sequence is a
representation of a string of nucleotides contained in a strand of DNA.
• DNA sequences analysis of various diseases which are stored in database for retrieval
and comparison. This system compares similarity values with threshold value and
stores particular result which is diseased or not.
• For example: ATTCGTAACTAGTAAGTTA. The DNA sequencing techniques have
allowed the vast amount of data to be analyzed in a short span of time. So, pattern
matching techniques plays a vital role in computational biology for data analysis
related to biological data such as DNA sequences
References
 Thomas H.Cormen; Charles E.Leiserson., Introduction to algorithms second edition , “The Knuth-Morris-Pratt
Algorithm”, year = 2001.
 https://pdfs.semanticscholar.org/fe41/52465f96d09c94b46b86a3b6408dae5dbe13.pdf
 http://research.ijcaonline.org/volume115/number23/pxc3902734.pdf
27
Thank You
28

Knuth morris pratt string matching algo

  • 1.
    Knuth-Morris-Pratt Substring Search Algorithm PreparedBy: Sabiya Fatima Email ID: sabiya1990fatima@gmail.com 1
  • 2.
    Outline  Definition  History Components of KMP  Algorithm  Example  Run-Time Analysis  Complexity comparison of String Matching Algorithms  Advantages and Disadvantages  Real Time Applications  References 2
  • 3.
    What is PatternSearching ?  Suppose you are reading a text document.  You want to search for a word.  You click CTRL + F and search for that word.  The word processor scans the document and shows the position of occurrence. What exactly happens is that, word i.e. pattern is searched inside the text document. 3
  • 4.
    Definition  Best knownfor linear time for exact matching. Compares from left to right.  Shifts more than one position.  Preprocessing approach of Pattern to avoid trivial comparisions.  Avoids recomputing matches. 4
  • 5.
    History  This algorithmwas conceived by Donald Knuth and Vaughan Pratt and independently by James H.Morris in 1977.  Knuth, Morris and Pratt discovered first linear time string-matching algorithm by analysis of the naive algorithm.  It keeps the information that naive approach wasted gathered during the scan of the text. By avoiding this waste of information, it achieves a running time of O(m + n).  The implementation of Knuth-Morris-Pratt algorithm is efficient because it minimizes the total number of comparisons of the pattern against the input string. 5
  • 6.
    Naïve Approach The naïveapproach is to check whether the pattern matches the string at every possible position in the string. P= Pattern (word) of length m T= Text (document) of length n Naive string matching algorithm takes time O((n-m+1)m) or O(mn) 6
  • 7.
    The KMP Algorithm- Motivation x j . . a b a a b . . . . . a b a a b a a b a a b a No need to repeat these comparisons Resume comparing here  Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm.  When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons?  Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] 7
  • 8.
    Components of KMPalgorithm  The prefix function, Π The prefix function,Π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern ‘p’. In other words, this enables avoiding backtracking on the text ‘T’.  The KMP Matcher With text ‘T’, pattern ‘p’ and prefix function ‘Π’ as inputs, finds the occurrence of ‘p’ in ‘T’ and returns the number of shifts of ‘p’ after which occurrence is found. 8
  • 9.
    The prefix function,Π Following pseudocode computes the prefix function, Π: Compute-Prefix-Function (p) 1 m  length[p] //’p’ pattern to be matched 2 Π[1]  0 3 k  0 4 for q  2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k  Π[k] 7 If p[k+1] = p[q] 8 then k  k +1 9 Π[q]  k 10 return Π 9
  • 10.
    Example: compute Πfor the pattern ‘p’ below: a b a b a c a Initially: m = length[p] = 7 Π[1] = 0 k = 0 Step 1: q = 2, k=0 Π[2] = 0 Step 2: q = 3, k = 0, Π[3] = 1 q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 10 p 10
  • 11.
    Contd… 11 Step 3: q= 4, k = 1 Π[4] = 2 q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 2 Step 4: q = 5, k =2 Π[5] = 3 q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 2 3
  • 12.
    Step 5: q= 6, k = 3 Π[6] = 0 Step 6: q = 7, k = 0 Π[7] = 1 After iterating 6 times, the prefix function computation is complete:  q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 2 3 0 q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 2 3 0 1 q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 2 3 0 1 The running time of the prefix function is O(m). 12Contd…..
  • 13.
    The KMP Matcher Input:The KMP Matcher, with pattern ‘p’, text ‘T’and prefix function ‘Π’, finds a match of p in T. Following pseudocode computes the matching component of KMP algorithm: KMP-Matcher(T,p) 1 n  length[T] 2 m  length[p] 3 Π  Compute-Prefix-Function(p) 4 q  0 //number of characters matched 5 for i  1 to n //scan T from left to right 6 do while q > 0 and p[q+1] != T[i] 7 do q  Π[q] //next character does not match 8 if p[q+1] = T[i] 9 then q  q + 1 //next character matches 10 if q = m //is all of p matched? 11 then print “Pattern occurs with shift” i – m 12 q  Π[ q] // look for the next match Note: KMP finds every occurrence of a ‘p’in ‘T’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘T’for any more occurrences of ‘p’. 13
  • 14.
    Illustration: given aText ‘T’ and pattern ‘p’ as follows: T b a c b a b a b a b a c a c a p a b a b a c a Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘T’. For ‘p’the prefix function, Π was computed previously and is as follows: q 1 2 3 4 5 6 7 p a b a b a c a Π 0 0 1 2 3 0 1 14 14
  • 15.
    b a cb a b a b a b a c a a b a b a b a c a Initially: n = size of T = 15; m = size of p = 7 Step 1: i = 1, q = 0 comparing p[1] with T[1] T p P[1] does not match with T[1]. ‘p’ will be shifted one position to the right. 15 15 Contd… Step 2: i = 2, q = 0 comparing p[1] with T[2] T p a b a b a c a P[1] matches T[2]. Since there is a match, p is not shifted. b a c b a b a b a b a c a a b
  • 16.
    Contd… 16 T b ac b a b a b a b a c a a b a b a b a c ap p[2] does not match with T[3] Backtracking on p, comparing p[1] and T[3] Step 3: i = 3, q = 1 Comparing p[2] with T[3] a b a b a c a T p Step 4: i = 4, q = 0 comparing p[1] with T[4] p[1] does not match with T[4] b a c b a b a b a b a c a a b
  • 17.
    b a cb a b a b a b a c a a b a b a b a c a T p Step 5: i = 5, q = 0 comparing p[1] with T[5] p[1] matches with T[5] 17 17 Step 6: i = 6, q = 1 Comparing p[2] with T[6] p[2] matches with T[6] T p b a c b a b a b a b a c a a b a b a b a c a Contd…
  • 18.
    b a cb a b a b a b a c a a b b a c b a b a b a b a c a a b a b a b a c a a b a b a c a T p Step 7: i = 7, q = 2 Comparing p[3] with T[7] p[3] matches with T[7] Step 8: i = 8, q = 3 Comparing p[4] with T[8] p[4] matches with T[8] T p 18 18 Contd…
  • 19.
    Step 9: i= 9, q = 4 Comparing p[5] with T[9] Comparing p[6] with T[10]Step 10: i = 10, q = 5 T p b a c b a b a b a b a c a a b b a c b a b a b a b a c a a b a b a b a c a a b a b a c a p[6] doesn’t match with T[10] Backtracking on p, comparing p[4] with T[10] because after mismatch q = Π[5] = 3 p[5] matches with T[9] 19 19 T p Contd…
  • 20.
    20 Step 11: i= 11, q = 4 Comparing p[5] with T[11] p[5] matches with T[11] T p b a c b a b a b a b a c a a b a b a b a c a Contd… Step 12: i = 12, q = 5 Comparing p[6] with T[12] p[6] matches with T[12] a b a b a c ap b a c b a b a b a b a c a a bT
  • 21.
    b a cb a b a b a b a c a a b a b a b a c a Comparing p[7] with T[13] T p Step 13: i = 13, q = 6 p[7] matches with T[13] Pattern ‘p’ has been found to completely occur in text ‘T’. The total number of shifts that took place for the match to be found are: i – m = 13 – 7 = 6 shifts. The running time of the KMP-Matcher function is O(n). 21 21 Contd…
  • 22.
    Complexity  O(m) -It is to compute the prefix function values.  O(n) - It is to compare the pattern to the text.  Total of O(n + m) run time. 22
  • 23.
    Complexity comparison ofString Matching Algorithms 23
  • 24.
    Advantage and Disadvantage Advantages: 1.Therunning time of the KMP algorithm is optimal (O(m + n)), which is very fast. 2.The algorithm never needs to move backwards in the input text T. It makes the algorithm good for processing very large files. Disadvantages: Doesn’t work so well as the size of the alphabets increases. By which more chances of mismatch occurs. 24
  • 25.
    Real time Applications Good for plagiarism analysis.  search engines  language syntax checker  database queries  music content retrieval 25
  • 26.
    Real time Applications 26 DNA sequences analysis : • It is mainly composed of nucleotides of four types. The four bases in DNA are Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). A DNA sequence is a representation of a string of nucleotides contained in a strand of DNA. • DNA sequences analysis of various diseases which are stored in database for retrieval and comparison. This system compares similarity values with threshold value and stores particular result which is diseased or not. • For example: ATTCGTAACTAGTAAGTTA. The DNA sequencing techniques have allowed the vast amount of data to be analyzed in a short span of time. So, pattern matching techniques plays a vital role in computational biology for data analysis related to biological data such as DNA sequences
  • 27.
    References  Thomas H.Cormen;Charles E.Leiserson., Introduction to algorithms second edition , “The Knuth-Morris-Pratt Algorithm”, year = 2001.  https://pdfs.semanticscholar.org/fe41/52465f96d09c94b46b86a3b6408dae5dbe13.pdf  http://research.ijcaonline.org/volume115/number23/pxc3902734.pdf 27
  • 28.