STRING MATCHING
String Matching
 Definition of string matching
 Naive string-matching algorithm
 Rabin-Karp algorithm
 Finite automata
 Linear time matching using finite
automata
 Knuth-Morris-Pratt algorithm
Dr. AMIT KUMAR @JUET
Outline
String Matching
 Introduction
 Naïve Algorithm
Dr. AMIT KUMAR @JUET
Introduction
 What is string matching?
 Finding all occurrences of a pattern in a
given text (or body of text)
 Many applications
 While using editor/word processor/browser
 Login name & password checking
 Virus detection
 Header analysis in data communications
 DNA sequence analysis
Dr. AMIT KUMAR @JUET
TYPES OF STRING MATCHING:-
 Exact string matching:
means finding one or all exact occurrences
of a pattern in a text.
 Naïve (Brute force) algorithm
 Boyer and Moore
 Knuth-Morris and Pratt
are exact string matching
algorithms. Dr. AMIT KUMAR @JUET
 Approximate string matching
It is the technique of finding approximate
(may not exact) matches to a pattern in a
string
 Karp and Rabin algorithm
Dr. AMIT KUMAR @JUET
String-Matching Problem
 The text is in an array T [1..n] of length n
 The pattern is in an array P [1..m] of
length m
 Elements of T and P are characters from
a finite alphabet 
 E.g.,  = {0,1} or  = {a, b, …, z}
 Usually T and P are called strings of
characters
Dr. AMIT KUMAR @JUET
String-Matching Problem
…contd
 We say that pattern P occurs with shift s
in text T if:
a) 0 ≤ s ≤ n-m and
b) T [(s+1)..(s+m)] = P [1..m]
 If P occurs with shift s in T, then s is a valid
shift, otherwise s is an invalid shift
 String-matching problem: finding all
valid shifts for a given T and P
Dr. AMIT KUMAR @JUET
Example 1
a b c a b a a b c a b a c
a b a a
text T
pattern P s = 3
shift s = 3 is a valid shift
(n=13, m=4 and 0 ≤ s ≤ n-m holds)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
Dr. AMIT KUMAR @JUET
Example 2
a b c a b a a b c a b a a
a b a a
text T
pattern P
s = 3
a b a a
a b a a
s = 9
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
Dr. AMIT KUMAR @JUET
Terminology
 Concatenation of 2 strings x and y is xy
 E.g., x=“putra”, y=“jaya”  xy =
“putrajaya”
 A string w is a prefix of a string x, if x=wy
for some string y
 E.g., “putra” is a prefix of “putrajaya”
 A string w is a suffix of a string x, if x=yw
for some string y
 E.g., “jaya” is a suffix of “putrajaya”
Dr. AMIT KUMAR @JUET
Naïve String-Matching Algorithm
Input: Text strings T [1..n] and P[1..m]
Result: All valid shifts displayed
NAÏVE-STRING-MATCHER (T, P)
n ← length[T]
m ← length[P]
for s ← 0 to n-m
if P[1..m] = T [(s+1)..(s+m)]
print “pattern occurs with shift” s
Dr. AMIT KUMAR @JUET
WORKING OF NAÏVE STRING
MATCHING
 The naive string‐matching procedure can be
interpreted graphically as sliding a
"template“ containing the pattern over the
text, noting for which shifts all of the
characters on the template equal the
corresponding characters in the text.
Dr. AMIT KUMAR @JUET
Contd…
 The for loop beginning on line 3 considers
each possible shift explicitly.
 match successfully or a mismatch is found.
 Line 5 prints out each valid shift s
 The test on line 4 determines whether the
current shift is valid or not; this test involves an
implicit loop to check corresponding character
positions until all positions Dr. AMIT KUMAR @JUET
Analysis: Worst-case Example
a a a a a a a a a a a a atext T
pattern P
a a a b
a a a b
1 2 3 4 5 6 7 8 9 10 11 12 13
1 2 3 4
a a a bDr. AMIT KUMAR @JUET
Worst-case Analysis
 There are m comparisons for each shift
in the worst case
 There are n-m+1 shifts
 So, the worst-case running time is
Θ((n-m+1)m) , which is Θ(n2) if
m = floor(n/2)
 In the example on previous slide, we
have (13-4+1)4 comparisons in total
 Naïve method is inefficient because
information from a shift is not used again
Dr. AMIT KUMAR @JUET
ADVANTAGES:-
 No preprocessing phase required
because the running time of
NAIVE‐STRING‐ MATCHER is equal to its
matching time
 No extra space are needed.
 Also, the comparisons can be done in
any order.
Dr. AMIT KUMAR @JUET
Problem with naïve algorithm
 Problem with Naïve algorithm:
 Suppose p=ababc, T=cabababcd.
T: c a b a b a b c d
P: a …
P: a b a b c
P: a…
P: a b a b c
 Whenever a character mismatch occurs after
matching of several characters, the comparison
begins by going back in from the character
which follows the last beginning character.
Dr. AMIT KUMAR @JUET
QUESTION???
Consider a situation where all characters of
pattern are different. Can we modify the
original Naive String Matching algorithm so
that it works better for these types of patterns.
If we can, then what are the changes to
original algorithm?
Dr. AMIT KUMAR @JUET
ANSWER:-
In the original Naive String matching algorithm , we
always slide the pattern by 1. When all characters of
pattern are different, we can slide the pattern by
more than 1.
When a mismatch occurs after j matches, we know
that the first character of pattern will not match the j
matched characters because all characters of
pattern are different. So we can always slide the
pattern by j without missing any valid shifts.
Dr. AMIT KUMAR @JUET
QUESTION??
HOW TO REDUCE THE
PROCESSING TIME OF NAÏVE
STRING MATCHING ??
Dr. AMIT KUMAR @JUET
Three exact single pattern matching
algorithms:-
 FC-RJ (First Character-Rami and Jehad)
 FLC-RJ (First and Last Characters-Rami
and Jehad)
 FMLC-RJ (First, Middle and Last
Characters-Rami and Jehad) .
Dr. AMIT KUMAR @JUET
FC-RJ (First Character-Rami and Jehad
 The algorithm creates a new array called
(Occurrence_List) of size (n - m + 1), where
n is the size of the text and m is the size of
the pattern. The length of the
Occurrence_List is (n - m + 1) because it is
impossible to the pattern to occur after
the position (n - m) in the text
Dr. AMIT KUMAR @JUET
 This array will hold the indices of the
occurrences of the pattern’s first character in the
text using an integer variable (i) starting from (0)
and incremented by one after each match
 The algorithm scans the text in a single pass,
using an integer variable (j) and compares its
characters with the pattern’s first character. If
the current character of the text (jth character)
is equal to the pattern's first character, the
algorithm saves the index of the current
character in the text (the value of j) in the ith
index of the Occurrence_List array and
increments the value by one. Dr. AMIT KUMAR @JUET
FLC-RJ algorithm:
 The concept of FLC-RJ (first and Last
Characters-Rami and Jehad) algorithm
follows the concept of FC-RJ algorithm.
 It seems more efficient to attempt
matching the pattern only with the sub-
strings of the text that start with the
pattern’s first character and also end with
the pattern’s last character.
 This technique decreases the number of
character comparisons in the text.
Dr. AMIT KUMAR @JUET
FMLC-RJ Algorithm:-
 FMLC-RJ algorithm adds another restriction to a sub-
string of the text to be considered as an expected
occurrence of the pattern.
 It seems more efficient to attempt matching the pattern
only with the sub-strings of the text that start with the
pattern’s first character and end with the pattern’s last
character and at the same time, they have middle
characters equal the pattern’s middle character.
 This technique decreases the number of character
comparisons in the text during the searching phase.
Dr. AMIT KUMAR @JUET
RESULTS:-
 The best performance of the naïve string
algorithms is when the length of the
pattern was relatively short. Since the
algorithm compares almost m characters
at each index of the text, the execution
time increases as m gets larger.
 The best performance of the FLC-RJ
algorithms is when the length of the
pattern was two characters. Since, the
algorithm only outputs the content of the
Occurrence_List array if the pattern’s
length is two characters.
Dr. AMIT KUMAR @JUET
Contd…
 The best performance of the FMLC-RJ
algorithms is when the length of the
pattern was three characters. The
algorithm searches for the first, middle and
last characters of the pattern and then it
outputs the content of the Occurrence_List
array as a result.
Dr. AMIT KUMAR @JUET
Dr. AMIT KUMAR @JUET
Experimental results of FC-
RJ algorithm
Experimental results of FLC-RJ algorithm
Dr. AMIT KUMAR @JUET
Experimental results of FMLC-RJ algorithm
Experimental results of the naïve string
algorithm
Dr. AMIT KUMAR @JUET
CONCLUSION:-
Dr. AMIT KUMAR @JUET
 It is apparent that the FC-RJ, FLC-RJ and FMLC-RJ algorithms
outperform the performance of the brute force algorithm.
 It is clear that our proposed algorithms enhance the execution time of
string matching as compared to the brute force algorithm.
 This enhancement is calculated by considering the differences in
execution times of the algorithms to search for 14 patterns samples as
recorded in Table 1.
Dr. AMIT KUMAR @JUET
SUMMARY
 The "naive" approach is easy to understand and
implement but it can be too slow in some cases. If
the length of the text is n and the length of the
pattern m, in the worst case it may take as much as
(n * m) iterations to complete the task.
 It should be noted though, that for most practical
purposes, which deal with texts based on human
languages, this approach is much faster since the
inner loop usually quickly finds a mismatch and
breaks. A problem arises when we are faced with
different kinds of "texts," such as the genetic code.Dr. AMIT KUMAR @JUET
THANK YOU
Dr. AMIT KUMAR @JUET

String matching, naive,

  • 1.
  • 2.
    String Matching  Definitionof string matching  Naive string-matching algorithm  Rabin-Karp algorithm  Finite automata  Linear time matching using finite automata  Knuth-Morris-Pratt algorithm Dr. AMIT KUMAR @JUET
  • 3.
    Outline String Matching  Introduction Naïve Algorithm Dr. AMIT KUMAR @JUET
  • 4.
    Introduction  What isstring matching?  Finding all occurrences of a pattern in a given text (or body of text)  Many applications  While using editor/word processor/browser  Login name & password checking  Virus detection  Header analysis in data communications  DNA sequence analysis Dr. AMIT KUMAR @JUET
  • 5.
    TYPES OF STRINGMATCHING:-  Exact string matching: means finding one or all exact occurrences of a pattern in a text.  Naïve (Brute force) algorithm  Boyer and Moore  Knuth-Morris and Pratt are exact string matching algorithms. Dr. AMIT KUMAR @JUET
  • 6.
     Approximate stringmatching It is the technique of finding approximate (may not exact) matches to a pattern in a string  Karp and Rabin algorithm Dr. AMIT KUMAR @JUET
  • 7.
    String-Matching Problem  Thetext is in an array T [1..n] of length n  The pattern is in an array P [1..m] of length m  Elements of T and P are characters from a finite alphabet   E.g.,  = {0,1} or  = {a, b, …, z}  Usually T and P are called strings of characters Dr. AMIT KUMAR @JUET
  • 8.
    String-Matching Problem …contd  Wesay that pattern P occurs with shift s in text T if: a) 0 ≤ s ≤ n-m and b) T [(s+1)..(s+m)] = P [1..m]  If P occurs with shift s in T, then s is a valid shift, otherwise s is an invalid shift  String-matching problem: finding all valid shifts for a given T and P Dr. AMIT KUMAR @JUET
  • 9.
    Example 1 a bc a b a a b c a b a c a b a a text T pattern P s = 3 shift s = 3 is a valid shift (n=13, m=4 and 0 ≤ s ≤ n-m holds) 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 Dr. AMIT KUMAR @JUET
  • 10.
    Example 2 a bc a b a a b c a b a a a b a a text T pattern P s = 3 a b a a a b a a s = 9 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 Dr. AMIT KUMAR @JUET
  • 11.
    Terminology  Concatenation of2 strings x and y is xy  E.g., x=“putra”, y=“jaya”  xy = “putrajaya”  A string w is a prefix of a string x, if x=wy for some string y  E.g., “putra” is a prefix of “putrajaya”  A string w is a suffix of a string x, if x=yw for some string y  E.g., “jaya” is a suffix of “putrajaya” Dr. AMIT KUMAR @JUET
  • 12.
    Naïve String-Matching Algorithm Input:Text strings T [1..n] and P[1..m] Result: All valid shifts displayed NAÏVE-STRING-MATCHER (T, P) n ← length[T] m ← length[P] for s ← 0 to n-m if P[1..m] = T [(s+1)..(s+m)] print “pattern occurs with shift” s Dr. AMIT KUMAR @JUET
  • 13.
    WORKING OF NAÏVESTRING MATCHING  The naive string‐matching procedure can be interpreted graphically as sliding a "template“ containing the pattern over the text, noting for which shifts all of the characters on the template equal the corresponding characters in the text. Dr. AMIT KUMAR @JUET
  • 14.
    Contd…  The forloop beginning on line 3 considers each possible shift explicitly.  match successfully or a mismatch is found.  Line 5 prints out each valid shift s  The test on line 4 determines whether the current shift is valid or not; this test involves an implicit loop to check corresponding character positions until all positions Dr. AMIT KUMAR @JUET
  • 15.
    Analysis: Worst-case Example aa a a a a a a a a a a atext T pattern P a a a b a a a b 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 a a a bDr. AMIT KUMAR @JUET
  • 16.
    Worst-case Analysis  Thereare m comparisons for each shift in the worst case  There are n-m+1 shifts  So, the worst-case running time is Θ((n-m+1)m) , which is Θ(n2) if m = floor(n/2)  In the example on previous slide, we have (13-4+1)4 comparisons in total  Naïve method is inefficient because information from a shift is not used again Dr. AMIT KUMAR @JUET
  • 17.
    ADVANTAGES:-  No preprocessingphase required because the running time of NAIVE‐STRING‐ MATCHER is equal to its matching time  No extra space are needed.  Also, the comparisons can be done in any order. Dr. AMIT KUMAR @JUET
  • 18.
    Problem with naïvealgorithm  Problem with Naïve algorithm:  Suppose p=ababc, T=cabababcd. T: c a b a b a b c d P: a … P: a b a b c P: a… P: a b a b c  Whenever a character mismatch occurs after matching of several characters, the comparison begins by going back in from the character which follows the last beginning character. Dr. AMIT KUMAR @JUET
  • 19.
    QUESTION??? Consider a situationwhere all characters of pattern are different. Can we modify the original Naive String Matching algorithm so that it works better for these types of patterns. If we can, then what are the changes to original algorithm? Dr. AMIT KUMAR @JUET
  • 20.
    ANSWER:- In the originalNaive String matching algorithm , we always slide the pattern by 1. When all characters of pattern are different, we can slide the pattern by more than 1. When a mismatch occurs after j matches, we know that the first character of pattern will not match the j matched characters because all characters of pattern are different. So we can always slide the pattern by j without missing any valid shifts. Dr. AMIT KUMAR @JUET
  • 21.
    QUESTION?? HOW TO REDUCETHE PROCESSING TIME OF NAÏVE STRING MATCHING ?? Dr. AMIT KUMAR @JUET
  • 22.
    Three exact singlepattern matching algorithms:-  FC-RJ (First Character-Rami and Jehad)  FLC-RJ (First and Last Characters-Rami and Jehad)  FMLC-RJ (First, Middle and Last Characters-Rami and Jehad) . Dr. AMIT KUMAR @JUET
  • 23.
    FC-RJ (First Character-Ramiand Jehad  The algorithm creates a new array called (Occurrence_List) of size (n - m + 1), where n is the size of the text and m is the size of the pattern. The length of the Occurrence_List is (n - m + 1) because it is impossible to the pattern to occur after the position (n - m) in the text Dr. AMIT KUMAR @JUET
  • 24.
     This arraywill hold the indices of the occurrences of the pattern’s first character in the text using an integer variable (i) starting from (0) and incremented by one after each match  The algorithm scans the text in a single pass, using an integer variable (j) and compares its characters with the pattern’s first character. If the current character of the text (jth character) is equal to the pattern's first character, the algorithm saves the index of the current character in the text (the value of j) in the ith index of the Occurrence_List array and increments the value by one. Dr. AMIT KUMAR @JUET
  • 25.
    FLC-RJ algorithm:  Theconcept of FLC-RJ (first and Last Characters-Rami and Jehad) algorithm follows the concept of FC-RJ algorithm.  It seems more efficient to attempt matching the pattern only with the sub- strings of the text that start with the pattern’s first character and also end with the pattern’s last character.  This technique decreases the number of character comparisons in the text. Dr. AMIT KUMAR @JUET
  • 26.
    FMLC-RJ Algorithm:-  FMLC-RJalgorithm adds another restriction to a sub- string of the text to be considered as an expected occurrence of the pattern.  It seems more efficient to attempt matching the pattern only with the sub-strings of the text that start with the pattern’s first character and end with the pattern’s last character and at the same time, they have middle characters equal the pattern’s middle character.  This technique decreases the number of character comparisons in the text during the searching phase. Dr. AMIT KUMAR @JUET
  • 27.
    RESULTS:-  The bestperformance of the naïve string algorithms is when the length of the pattern was relatively short. Since the algorithm compares almost m characters at each index of the text, the execution time increases as m gets larger.  The best performance of the FLC-RJ algorithms is when the length of the pattern was two characters. Since, the algorithm only outputs the content of the Occurrence_List array if the pattern’s length is two characters. Dr. AMIT KUMAR @JUET
  • 28.
    Contd…  The bestperformance of the FMLC-RJ algorithms is when the length of the pattern was three characters. The algorithm searches for the first, middle and last characters of the pattern and then it outputs the content of the Occurrence_List array as a result. Dr. AMIT KUMAR @JUET
  • 29.
  • 30.
    Experimental results ofFC- RJ algorithm Experimental results of FLC-RJ algorithm Dr. AMIT KUMAR @JUET
  • 31.
    Experimental results ofFMLC-RJ algorithm Experimental results of the naïve string algorithm Dr. AMIT KUMAR @JUET
  • 32.
  • 33.
     It isapparent that the FC-RJ, FLC-RJ and FMLC-RJ algorithms outperform the performance of the brute force algorithm.  It is clear that our proposed algorithms enhance the execution time of string matching as compared to the brute force algorithm.  This enhancement is calculated by considering the differences in execution times of the algorithms to search for 14 patterns samples as recorded in Table 1. Dr. AMIT KUMAR @JUET
  • 34.
    SUMMARY  The "naive"approach is easy to understand and implement but it can be too slow in some cases. If the length of the text is n and the length of the pattern m, in the worst case it may take as much as (n * m) iterations to complete the task.  It should be noted though, that for most practical purposes, which deal with texts based on human languages, this approach is much faster since the inner loop usually quickly finds a mismatch and breaks. A problem arises when we are faced with different kinds of "texts," such as the genetic code.Dr. AMIT KUMAR @JUET
  • 35.
    THANK YOU Dr. AMITKUMAR @JUET