SlideShare a Scribd company logo
1 of 25
STRING MATCHING
Aditya Pratap Singh
215/CO/15
Netaji Subhas Institute Of Technology
CONTENTS
● Introduction
● String Matching
● Basic Classification
● Naive Algorithm
● Rabin-Karp Algorithm
○ String Hashing
○ Hash value for substrings
● Knuth-Morris-Pratt Algorithm
○ Prefix Function
○ KMP Matcher
● Summary
INTRODUCTION
● String matching algorithms are an important class of string
algorithms that tries to find one or many indices where one
or several strings(or patterns) are found in the larger
string(or text)
● Why do we need string matching?
String matching is used in various applications like spell
checkers, spam filters, search engines, plagiarism detectors,
bioinformatics and DNA sequencing etc.
STRING MATCHING
● To find all occurrences of a pattern in a given text
● Formally, given a pattern P[1..m] and a text T[1..n], find all
occurrences of P in T. Both P and T belongs to Σ*
● P occurs with shift s(beginning at s+1): P[1] = T[s+1], P[2] =
T[s+2],…, P[m] = T[s+m]
● If so, s is called a valid shift, otherwise an invalid shift
● Note: one occurrence can start within another one ie.
overlapping is allowed. eg P=abab T=abcabababbc, P occurs
at s=3 and s=5.
*text is the string that we are searching
*pattern is the string that we are searching for
*Shift is an offset into a string
BASIC CLASSIFICATION
1. Naive Algorithm - The naive approach is accomplished by
performing a brute-force comparison of each character in
the pattern at each possible placement of the pattern in the
string. It is O(mn) in the worst case scenario
1. Rabin-Karp Algorithm - It compares the string’s hash values,
rather than string themselves. Performs well in practice and
generalized to other algorithm for related problems such as
2D-string matching
1. Knuth-Morris-Pratt Algorithm - It is improved on brute-force
algorithm and is capable of running O(m+n) in the worst
case. It improves the running time by taking advantage of
prefix function
NAIVE ALGORITHM
One of the most obvious approach towards the string matching
problem would be to compare the first element of the pattern to
be searched ‘p’, with the first element of the string ‘s’ in which to
locate ‘p’.
If the first element of ‘p’ matches the first element of ‘s’ ,
compare the second element and so on. If match found proceed
likewise until entire ‘p’ is found. If a mismatch is found at any
position , shift index to one position to the right and continue
comparison
This approach is easy to understand and implement but it can
be too slow in some cases.
In worst case it may take (m*n) iterations to complete the task.
PSEUDOCODE
function naive(text[], pattern[]){
for(i = 0; i < n; i++) {
for(j = 0; j < m && i + j < n; j++) {
if(text[i + j] != pattern[j]) break; // mismatch found
if(j == m) // match found
}
}
}
ILLUSTRATION
String S = a b c a b a a b c a b a c
Pattern P = a b a a
Step 1: Compare P[1] with S[1]
a b c a b a a b c a b a c
a b a a
Step 2: Compare P[2] with S[2]
a b c a b a a b c a b a c
a b a a
ILLUSTRATION
Step 3: Compare P[3] with S[3]
a b c a b a a b c a b a c
a b a a
Since mismatch is detected, shift ‘p’ one position to the left and
perform steps analogous to those from step 1 to step 3. At
position where mismatch is detected, shift ‘p’ one position to
right and repeat matching procedure.
ILLUSTRATION
Finally, a match is found after shifting ‘p’ three times to the right
side.
a b c a b a a b c a b a c
a b a a
Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length
of text T, then the matching time is O(n*m), which is certainly a
very slow running time
RABIN-KARP ALGORITHM
This is actually the naive approach augmented with a powerful
programming technique - hash function
Algorithm :
1. Calculate the hash for the pattern P
2. Calculate the hash values for all the prefixes of the text T.
3. Now, we can compare a substring of length |s| in constant
time using the calculated hashes.
This algorithm was authored by Michael Rabin and Richard Karp
in 1987.
STRING HASHING
Problem - Given a string S of length n = |S| . Calculate the hash
value of S
Solution -
where p and m are suitably chosen prime numbers.
CHOICE OF PARAMETERS
‘p’ should be taken roughly equal to the number of characters in
the input alphabet. If input is composed of only lowercase
characters of English alphabet, p=31 is a good choice. If the
input may contain both uppercase and lowercase letters, then
p=53 is a good choice.
‘m’ should be a large prime. A popular choice is m = 10^9+7
This is a large number but still small enough so that we can
perform multiplication of two values using 64 bit integers.
HASH CALCULATION OF SUBSTRINGS OF GIVEN STRING
Problem : Given string S and indices i and j . Find the hash value
of S[i..j]
Solution :
By definition we have,
Multiplying by pi gives,
So by knowing the hash value of each prefix of string S, we can
compute the hash of any substring in constant O(1) time.
PSEUDOCODE
vector<int> rabin_karp(string const& pat, string const& text) {
const int p = 31, m = 1e9 + 9;
int S = pat.size(), T = text.size();
vector<long long> p_pow(max(S, T));
p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;
vector<long long> h(T + 1, 0);
for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (text[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0; i < S; i++)
h_s = (h_s + (pat[i] - 'a' + 1) * p_pow[i]) % m;
vector<int> occurrences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurrences.push_back(i);
}
return occurrences;
KNUTH-MORRIS-PRATT ALGORITHM
Knuth, Morris and Pratt proposed a linear time algorithm for the
string matching problem.
A matching time of O(n) is achieved by avoiding comparisons with
elements of ‘S’ that have previously been involved in comparison
with some element of the pattern ‘p’ to be matched ie.
backtracking on the string ‘S’ never occurs.
KMP makes use of ‘prefix function’
PREFIX FUNCTION
The prefix function of a string is defined as an array Ⲡ of length n,
where Ⲡ[i] is the length of the longest proper prefix of the
substring s[0..i] which is also a suffix of this substring.
A proper prefix of a string is a prefix that is not equal to the string
itself. So by definition Ⲡ[0] = 0
Mathematically,
EXAMPLE
S = “aabaaab”
PREFIX Ⲡ[i]
a a 0
aa aa 1
aab aab 0
aaba aaba 1
aabaa aabaa 2
aabaaa aabaaa 2
aabaaab aabaaab 3
ALGORITHM TO COMPUTE PREFIX FUNCTION
● We compute the prefix values Ⲡ[i] in a loop iterating from i=1 to
i=n-1 (Ⲡ[0] just gets assigned with 0)
● To calculate the current value Ⲡ[i] we set the variable j denoting
the length of the best suffix for ‘i-1’ . Initially j = Ⲡ[i-1]
● Test if the suffix of length ‘j+1’ is also a prefix by comparing
s[j] and s[i]. If they are equal then we assign Ⲡ[i] = j+1 .
Otherwise, we reduce j to Ⲡ[j-1] and repeat this step.
● If we have reached the length j=0 and still don’t have the
match, then we assign Ⲡ[i] = 0 and go to the next index ‘i+1’
PSEUDOCODE
vector<int> prefix_function(string s){
int n = (int)s.length();
vector<int> pi(n);
for(int i=1;i<n;i++){
int j = pi[i-1];
while(j>0 and s[i]!=s[j]) j = pi[j-1];
if(s[i] == s[j]) ++j;
pi[i] = j;
}
return pi;
}
Runtime - O(n)
KMP MATCHER
● This is a classical application of prefix function, which we just
learned
● Given text T and string S, we need to find all occurrences of S
in T
● Denote with n the length of the string S and with m the length
of the string T ie. n = |S| and m = |T|
● Generate a string S + # + T , where # is a separator that neither
appears in S nor T . Now calculate the prefix function of this
string
● By definition, Ⲡ[i] in this string corresponds to the largest block
that coincides with S and ends at position ‘i’ .
● Note: Ⲡ[i] can not be larger than ‘n’ because of the separator #
that we used
● If Ⲡ[i] == n, then we can say that string S appears completely at
this position.
EXAMPLE
S = “aba”
T = “aababac”
Generated string(G) = “aba#aababac”
Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices i
= 1 and i=3 in the Text , there is occurrence of the pattern(S)
Index (i) PREFIX Ⲡ[i]
4 a 1
5 aa 1
6 aab 2
7 aaba 3
8 aabab 2
9 aababa 3
10 aababac 0
PSEUDOCODE
vector<int> kmp(string pattern,string text){
string str = pattern + "#" + text;
int n = pattern.length(), m = str.length();
vector<int> pi = prefix_function(str);
vector<int> ret;
for(int i=n+1;i<m;i++) {
if(pi[i] == n) ret.pb(i-2*n);
}
return ret;
}
Runtime: O(n+m)
SUMMARY
Algorithm Time Complexity Key Ideas Approach
Brute Force (Naive)
O(m*n)
Searching with all
alphabets
Linear Searching
Rabin-Karp
Θ(m+n)
Compare the text
and patterns using
their hash functions
Hashing Based
Knuth-Morris-Pratt
O(m+n)
Constructs an
automaton from the
pattern
Heuristic Based
n = |pattern| , length of pattern
m = |text| , length of text
THANK YOU

More Related Content

What's hot

Greedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack ProblemGreedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack ProblemMadhu Bala
 
Introduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of OptimalityIntroduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of OptimalityBhavin Darji
 
Assignment problem branch and bound.pptx
Assignment problem branch and bound.pptxAssignment problem branch and bound.pptx
Assignment problem branch and bound.pptxKrishnaVardhan50
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.Malek Sumaiya
 
Bruteforce algorithm
Bruteforce algorithmBruteforce algorithm
Bruteforce algorithmRezwan Siam
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of AlgorithmsBulbul Agrawal
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 
15 puzzle problem using branch and bound
15 puzzle problem using branch and bound15 puzzle problem using branch and bound
15 puzzle problem using branch and boundAbhishek Singh
 

What's hot (20)

Approximation Algorithms
Approximation AlgorithmsApproximation Algorithms
Approximation Algorithms
 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,
 
Asymptotic notation
Asymptotic notationAsymptotic notation
Asymptotic notation
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Naive string matching
Naive string matchingNaive string matching
Naive string matching
 
Greedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack ProblemGreedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack Problem
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
Introduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of OptimalityIntroduction to Dynamic Programming, Principle of Optimality
Introduction to Dynamic Programming, Principle of Optimality
 
pushdown automata
pushdown automatapushdown automata
pushdown automata
 
Assignment problem branch and bound.pptx
Assignment problem branch and bound.pptxAssignment problem branch and bound.pptx
Assignment problem branch and bound.pptx
 
String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.String Matching Finite Automata & KMP Algorithm.
String Matching Finite Automata & KMP Algorithm.
 
Bruteforce algorithm
Bruteforce algorithmBruteforce algorithm
Bruteforce algorithm
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
Merge Sort
Merge SortMerge Sort
Merge Sort
 
Rabin karp string matcher
Rabin karp string matcherRabin karp string matcher
Rabin karp string matcher
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
String matching algorithm
String matching algorithmString matching algorithm
String matching algorithm
 
15 puzzle problem using branch and bound
15 puzzle problem using branch and bound15 puzzle problem using branch and bound
15 puzzle problem using branch and bound
 
Binary Search
Binary SearchBinary Search
Binary Search
 
8 queen problem
8 queen problem8 queen problem
8 queen problem
 

Similar to String Matching (Naive,Rabin-Karp,KMP)

Modified Rabin Karp
Modified Rabin KarpModified Rabin Karp
Modified Rabin KarpGarima Singh
 
String Matching algorithm String Matching algorithm String Matching algorithm
String Matching algorithm String Matching algorithm String Matching algorithmString Matching algorithm String Matching algorithm String Matching algorithm
String Matching algorithm String Matching algorithm String Matching algorithmpraweenkumarsahu9
 
Pattern matching programs
Pattern matching programsPattern matching programs
Pattern matching programsakruthi k
 
Knuth morris pratt string matching algo
Knuth morris pratt string matching algoKnuth morris pratt string matching algo
Knuth morris pratt string matching algosabiya sabiya
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithmsabiya sabiya
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfShiwani Gupta
 
String searching
String searching String searching
String searching thinkphp
 
Gp 27[string matching].pptx
Gp 27[string matching].pptxGp 27[string matching].pptx
Gp 27[string matching].pptxSumitYadav641839
 
An Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmAn Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmIDES Editor
 
Boyer-Moore-algorithm-Vladimir.pptx
Boyer-Moore-algorithm-Vladimir.pptxBoyer-Moore-algorithm-Vladimir.pptx
Boyer-Moore-algorithm-Vladimir.pptxssuserf56658
 
Data Representation of Strings
Data Representation of StringsData Representation of Strings
Data Representation of StringsProf Ansari
 

Similar to String Matching (Naive,Rabin-Karp,KMP) (20)

IMPLEMENTATION OF DIFFERENT PATTERN RECOGNITION ALGORITHM
IMPLEMENTATION OF DIFFERENT PATTERN RECOGNITION  ALGORITHM  IMPLEMENTATION OF DIFFERENT PATTERN RECOGNITION  ALGORITHM
IMPLEMENTATION OF DIFFERENT PATTERN RECOGNITION ALGORITHM
 
Modified Rabin Karp
Modified Rabin KarpModified Rabin Karp
Modified Rabin Karp
 
String Matching algorithm String Matching algorithm String Matching algorithm
String Matching algorithm String Matching algorithm String Matching algorithmString Matching algorithm String Matching algorithm String Matching algorithm
String Matching algorithm String Matching algorithm String Matching algorithm
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Pattern matching programs
Pattern matching programsPattern matching programs
Pattern matching programs
 
Daa chapter9
Daa chapter9Daa chapter9
Daa chapter9
 
Knuth morris pratt string matching algo
Knuth morris pratt string matching algoKnuth morris pratt string matching algo
Knuth morris pratt string matching algo
 
Rabin Carp String Matching algorithm
Rabin Carp String Matching  algorithmRabin Carp String Matching  algorithm
Rabin Carp String Matching algorithm
 
lec17.ppt
lec17.pptlec17.ppt
lec17.ppt
 
Lec17
Lec17Lec17
Lec17
 
Daa unit 5
Daa unit 5Daa unit 5
Daa unit 5
 
module6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdfmodule6_stringmatchingalgorithm_2022.pdf
module6_stringmatchingalgorithm_2022.pdf
 
String searching
String searching String searching
String searching
 
Gp 27[string matching].pptx
Gp 27[string matching].pptxGp 27[string matching].pptx
Gp 27[string matching].pptx
 
An Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching AlgorithmAn Index Based K-Partitions Multiple Pattern Matching Algorithm
An Index Based K-Partitions Multiple Pattern Matching Algorithm
 
Team 1
Team 1Team 1
Team 1
 
Boyer-Moore-algorithm-Vladimir.pptx
Boyer-Moore-algorithm-Vladimir.pptxBoyer-Moore-algorithm-Vladimir.pptx
Boyer-Moore-algorithm-Vladimir.pptx
 
4 report format
4 report format4 report format
4 report format
 
4 report format
4 report format4 report format
4 report format
 
Data Representation of Strings
Data Representation of StringsData Representation of Strings
Data Representation of Strings
 

Recently uploaded

Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdfKamal Acharya
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edgePaco Orozco
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2T.D. Shashikala
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfqasastareekh
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentjatinraor66
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfBertinKamsipa1
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...Roi Lipman
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdfKamal Acharya
 
Object Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxObject Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxRashidFaridChishti
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AISheetal Jain
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDrGurudutt
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineJulioCesarSalazarHer1
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdfKamal Acharya
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineAftabkhan575376
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Krakówbim.edu.pl
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdfKamal Acharya
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Lovely Professional University
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxGagandeepKaur617299
 

Recently uploaded (20)

Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
E-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are presentE-Commerce Shopping using MERN Stack where different modules are present
E-Commerce Shopping using MERN Stack where different modules are present
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
Object Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxObject Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docx
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Teachers record management system project report..pdf
Teachers record management system project report..pdfTeachers record management system project report..pdf
Teachers record management system project report..pdf
 
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdfONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
ONLINE CAR SERVICING SYSTEM PROJECT REPORT.pdf
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
 
Natalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in KrakówNatalia Rutkowska - BIM School Course in Kraków
Natalia Rutkowska - BIM School Course in Kraków
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
 

String Matching (Naive,Rabin-Karp,KMP)

  • 1. STRING MATCHING Aditya Pratap Singh 215/CO/15 Netaji Subhas Institute Of Technology
  • 2. CONTENTS ● Introduction ● String Matching ● Basic Classification ● Naive Algorithm ● Rabin-Karp Algorithm ○ String Hashing ○ Hash value for substrings ● Knuth-Morris-Pratt Algorithm ○ Prefix Function ○ KMP Matcher ● Summary
  • 3. INTRODUCTION ● String matching algorithms are an important class of string algorithms that tries to find one or many indices where one or several strings(or patterns) are found in the larger string(or text) ● Why do we need string matching? String matching is used in various applications like spell checkers, spam filters, search engines, plagiarism detectors, bioinformatics and DNA sequencing etc.
  • 4. STRING MATCHING ● To find all occurrences of a pattern in a given text ● Formally, given a pattern P[1..m] and a text T[1..n], find all occurrences of P in T. Both P and T belongs to Σ* ● P occurs with shift s(beginning at s+1): P[1] = T[s+1], P[2] = T[s+2],…, P[m] = T[s+m] ● If so, s is called a valid shift, otherwise an invalid shift ● Note: one occurrence can start within another one ie. overlapping is allowed. eg P=abab T=abcabababbc, P occurs at s=3 and s=5. *text is the string that we are searching *pattern is the string that we are searching for *Shift is an offset into a string
  • 5. BASIC CLASSIFICATION 1. Naive Algorithm - The naive approach is accomplished by performing a brute-force comparison of each character in the pattern at each possible placement of the pattern in the string. It is O(mn) in the worst case scenario 1. Rabin-Karp Algorithm - It compares the string’s hash values, rather than string themselves. Performs well in practice and generalized to other algorithm for related problems such as 2D-string matching 1. Knuth-Morris-Pratt Algorithm - It is improved on brute-force algorithm and is capable of running O(m+n) in the worst case. It improves the running time by taking advantage of prefix function
  • 6. NAIVE ALGORITHM One of the most obvious approach towards the string matching problem would be to compare the first element of the pattern to be searched ‘p’, with the first element of the string ‘s’ in which to locate ‘p’. If the first element of ‘p’ matches the first element of ‘s’ , compare the second element and so on. If match found proceed likewise until entire ‘p’ is found. If a mismatch is found at any position , shift index to one position to the right and continue comparison This approach is easy to understand and implement but it can be too slow in some cases. In worst case it may take (m*n) iterations to complete the task.
  • 7. PSEUDOCODE function naive(text[], pattern[]){ for(i = 0; i < n; i++) { for(j = 0; j < m && i + j < n; j++) { if(text[i + j] != pattern[j]) break; // mismatch found if(j == m) // match found } } }
  • 8. ILLUSTRATION String S = a b c a b a a b c a b a c Pattern P = a b a a Step 1: Compare P[1] with S[1] a b c a b a a b c a b a c a b a a Step 2: Compare P[2] with S[2] a b c a b a a b c a b a c a b a a
  • 9. ILLUSTRATION Step 3: Compare P[3] with S[3] a b c a b a a b c a b a c a b a a Since mismatch is detected, shift ‘p’ one position to the left and perform steps analogous to those from step 1 to step 3. At position where mismatch is detected, shift ‘p’ one position to right and repeat matching procedure.
  • 10. ILLUSTRATION Finally, a match is found after shifting ‘p’ three times to the right side. a b c a b a a b c a b a c a b a a Drawbacks : If ‘m’ is the length of pattern P and ‘n’ is the length of text T, then the matching time is O(n*m), which is certainly a very slow running time
  • 11. RABIN-KARP ALGORITHM This is actually the naive approach augmented with a powerful programming technique - hash function Algorithm : 1. Calculate the hash for the pattern P 2. Calculate the hash values for all the prefixes of the text T. 3. Now, we can compare a substring of length |s| in constant time using the calculated hashes. This algorithm was authored by Michael Rabin and Richard Karp in 1987.
  • 12. STRING HASHING Problem - Given a string S of length n = |S| . Calculate the hash value of S Solution - where p and m are suitably chosen prime numbers.
  • 13. CHOICE OF PARAMETERS ‘p’ should be taken roughly equal to the number of characters in the input alphabet. If input is composed of only lowercase characters of English alphabet, p=31 is a good choice. If the input may contain both uppercase and lowercase letters, then p=53 is a good choice. ‘m’ should be a large prime. A popular choice is m = 10^9+7 This is a large number but still small enough so that we can perform multiplication of two values using 64 bit integers.
  • 14. HASH CALCULATION OF SUBSTRINGS OF GIVEN STRING Problem : Given string S and indices i and j . Find the hash value of S[i..j] Solution : By definition we have, Multiplying by pi gives, So by knowing the hash value of each prefix of string S, we can compute the hash of any substring in constant O(1) time.
  • 15. PSEUDOCODE vector<int> rabin_karp(string const& pat, string const& text) { const int p = 31, m = 1e9 + 9; int S = pat.size(), T = text.size(); vector<long long> p_pow(max(S, T)); p_pow[0] = 1; for (int i = 1; i < (int)p_pow.size(); i++) p_pow[i] = (p_pow[i-1] * p) % m; vector<long long> h(T + 1, 0); for (int i = 0; i < T; i++) h[i+1] = (h[i] + (text[i] - 'a' + 1) * p_pow[i]) % m; long long h_s = 0; for (int i = 0; i < S; i++) h_s = (h_s + (pat[i] - 'a' + 1) * p_pow[i]) % m; vector<int> occurrences; for (int i = 0; i + S - 1 < T; i++) { long long cur_h = (h[i+S] + m - h[i]) % m; if (cur_h == h_s * p_pow[i] % m) occurrences.push_back(i); } return occurrences;
  • 16. KNUTH-MORRIS-PRATT ALGORITHM Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. A matching time of O(n) is achieved by avoiding comparisons with elements of ‘S’ that have previously been involved in comparison with some element of the pattern ‘p’ to be matched ie. backtracking on the string ‘S’ never occurs. KMP makes use of ‘prefix function’
  • 17. PREFIX FUNCTION The prefix function of a string is defined as an array Ⲡ of length n, where Ⲡ[i] is the length of the longest proper prefix of the substring s[0..i] which is also a suffix of this substring. A proper prefix of a string is a prefix that is not equal to the string itself. So by definition Ⲡ[0] = 0 Mathematically,
  • 18. EXAMPLE S = “aabaaab” PREFIX Ⲡ[i] a a 0 aa aa 1 aab aab 0 aaba aaba 1 aabaa aabaa 2 aabaaa aabaaa 2 aabaaab aabaaab 3
  • 19. ALGORITHM TO COMPUTE PREFIX FUNCTION ● We compute the prefix values Ⲡ[i] in a loop iterating from i=1 to i=n-1 (Ⲡ[0] just gets assigned with 0) ● To calculate the current value Ⲡ[i] we set the variable j denoting the length of the best suffix for ‘i-1’ . Initially j = Ⲡ[i-1] ● Test if the suffix of length ‘j+1’ is also a prefix by comparing s[j] and s[i]. If they are equal then we assign Ⲡ[i] = j+1 . Otherwise, we reduce j to Ⲡ[j-1] and repeat this step. ● If we have reached the length j=0 and still don’t have the match, then we assign Ⲡ[i] = 0 and go to the next index ‘i+1’
  • 20. PSEUDOCODE vector<int> prefix_function(string s){ int n = (int)s.length(); vector<int> pi(n); for(int i=1;i<n;i++){ int j = pi[i-1]; while(j>0 and s[i]!=s[j]) j = pi[j-1]; if(s[i] == s[j]) ++j; pi[i] = j; } return pi; } Runtime - O(n)
  • 21. KMP MATCHER ● This is a classical application of prefix function, which we just learned ● Given text T and string S, we need to find all occurrences of S in T ● Denote with n the length of the string S and with m the length of the string T ie. n = |S| and m = |T| ● Generate a string S + # + T , where # is a separator that neither appears in S nor T . Now calculate the prefix function of this string ● By definition, Ⲡ[i] in this string corresponds to the largest block that coincides with S and ends at position ‘i’ . ● Note: Ⲡ[i] can not be larger than ‘n’ because of the separator # that we used ● If Ⲡ[i] == n, then we can say that string S appears completely at this position.
  • 22. EXAMPLE S = “aba” T = “aababac” Generated string(G) = “aba#aababac” Ⲡ[i] = n(=3) at positions i = 7 and 9 of G , which means at indices i = 1 and i=3 in the Text , there is occurrence of the pattern(S) Index (i) PREFIX Ⲡ[i] 4 a 1 5 aa 1 6 aab 2 7 aaba 3 8 aabab 2 9 aababa 3 10 aababac 0
  • 23. PSEUDOCODE vector<int> kmp(string pattern,string text){ string str = pattern + "#" + text; int n = pattern.length(), m = str.length(); vector<int> pi = prefix_function(str); vector<int> ret; for(int i=n+1;i<m;i++) { if(pi[i] == n) ret.pb(i-2*n); } return ret; } Runtime: O(n+m)
  • 24. SUMMARY Algorithm Time Complexity Key Ideas Approach Brute Force (Naive) O(m*n) Searching with all alphabets Linear Searching Rabin-Karp Θ(m+n) Compare the text and patterns using their hash functions Hashing Based Knuth-Morris-Pratt O(m+n) Constructs an automaton from the pattern Heuristic Based n = |pattern| , length of pattern m = |text| , length of text