The Rabin-Karp Algorithm
String Matching
Jonathan M. Elchison
19 November 2004
CS-3410 Algorithms
Dr. Shomper
Background
• String matching
• Naïve method
• n ≡ size of input string
• m ≡ size of pattern to be matched
• O( (n-m+1)m )
• Θ( n2
) if m = floor( n/2 )
• We can do better
How it works
• Consider a hashing scheme
• Each symbol in alphabet Σ can be represented by
an ordinal value { 0, 1, 2, ..., d }
• |Σ| = d
• “Radix-d digits”
How it works
• Hash pattern P into a numeric value
• Let a string be represented by the sum of these
digits
• Horner’s rule (§ 30.1)
• Example
• { A, B, C, ..., Z } → { 0, 1, 2, ..., 26 }
• BAN → 1 + 0 + 13 = 14
• CARD → 2 + 0 + 17 + 3 = 22
Upper limits
• Problem
• For long patterns, or for large alphabets, the number
representing a given string may be too large to be practical
• Solution
• Use MOD operation
• When MOD q, values will be < q
• Example
• BAN = 1 + 0 + 13 = 14
• 14 mod 13 = 1
• BAN → 1
• CARD = 2 + 0 + 17 + 3 = 22
• 22 mod 13 = 9
• CARD → 9
Searching
Spurious Hits
• Question
• Does a hash value match mean that the patterns match?
• Answer
• No – these are called “spurious hits”
• Possible cases
• MOD operation interfered with uniqueness of hash values
• 14 mod 13 = 1
• 27 mod 13 = 1
• MOD value q is usually chosen as a prime such that 10q just fits
within 1 computer word
• Information is lost in generalization (addition)
• BAN → 1 + 0 + 13 = 14
• CAM → 2 + 0 + 12 = 14
Code
RABIN-KARP-MATCHER( T, P, d, q )
n ← length[ T ]
m ← length[ P ]
h ← dm-1
mod q
p ← 0
t0 ← 0
for i ← 1 to m ► Preprocessing
do p ← ( d*p + P[ i ] ) mod q
t0 ← ( d*t0 + T[ i ] ) mod q
for s ← 0 to n – m► Matching
do if p = ts
then if P[ 1..m ] = T[ s+1 .. s+m ]
then print “Pattern occurs with shift” s
if s < n – m
then ts+1 ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] )
mod q
Performance
• Preprocessing (determining each pattern hash)
• Θ( m )
• Worst case running time
• Θ( (n-m+1)m )
• No better than naïve method
• Expected case
• If we assume the number of hits is constant
compared to n, we expect O( n )
• Only pattern-match “hits” – not all shifts
Demonstration
• http://www-igm.univ-mlv.fr/~lecroq/string/
node5.html
The Rabin-Karp Algorithm
String Matching
Jonathan M. Elchison
19 November 2004
CS-3410 Algorithms
Dr. Shomper
Sources:
• Cormen, Thomas S., et al. Introduction to Algorithms. 2nd ed. Boston: MIT Press, 2001.
• Karp-Rabin algorithm. 15 Jan 1997. <http://www-igm.univ-mlv.fr/~lecroq/string/node5.html>.
• Shomper, Keith. “Rabin-Karp Animation.” E-mail to Jonathan Elchison. 12 Nov 2004.

Karp-Rabin algorithm-15-Jan-1997-rabin_karp_matching.pptx

  • 1.
    The Rabin-Karp Algorithm StringMatching Jonathan M. Elchison 19 November 2004 CS-3410 Algorithms Dr. Shomper
  • 2.
    Background • String matching •Naïve method • n ≡ size of input string • m ≡ size of pattern to be matched • O( (n-m+1)m ) • Θ( n2 ) if m = floor( n/2 ) • We can do better
  • 3.
    How it works •Consider a hashing scheme • Each symbol in alphabet Σ can be represented by an ordinal value { 0, 1, 2, ..., d } • |Σ| = d • “Radix-d digits”
  • 4.
    How it works •Hash pattern P into a numeric value • Let a string be represented by the sum of these digits • Horner’s rule (§ 30.1) • Example • { A, B, C, ..., Z } → { 0, 1, 2, ..., 26 } • BAN → 1 + 0 + 13 = 14 • CARD → 2 + 0 + 17 + 3 = 22
  • 5.
    Upper limits • Problem •For long patterns, or for large alphabets, the number representing a given string may be too large to be practical • Solution • Use MOD operation • When MOD q, values will be < q • Example • BAN = 1 + 0 + 13 = 14 • 14 mod 13 = 1 • BAN → 1 • CARD = 2 + 0 + 17 + 3 = 22 • 22 mod 13 = 9 • CARD → 9
  • 6.
  • 7.
    Spurious Hits • Question •Does a hash value match mean that the patterns match? • Answer • No – these are called “spurious hits” • Possible cases • MOD operation interfered with uniqueness of hash values • 14 mod 13 = 1 • 27 mod 13 = 1 • MOD value q is usually chosen as a prime such that 10q just fits within 1 computer word • Information is lost in generalization (addition) • BAN → 1 + 0 + 13 = 14 • CAM → 2 + 0 + 12 = 14
  • 8.
    Code RABIN-KARP-MATCHER( T, P,d, q ) n ← length[ T ] m ← length[ P ] h ← dm-1 mod q p ← 0 t0 ← 0 for i ← 1 to m ► Preprocessing do p ← ( d*p + P[ i ] ) mod q t0 ← ( d*t0 + T[ i ] ) mod q for s ← 0 to n – m► Matching do if p = ts then if P[ 1..m ] = T[ s+1 .. s+m ] then print “Pattern occurs with shift” s if s < n – m then ts+1 ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] ) mod q
  • 9.
    Performance • Preprocessing (determiningeach pattern hash) • Θ( m ) • Worst case running time • Θ( (n-m+1)m ) • No better than naïve method • Expected case • If we assume the number of hits is constant compared to n, we expect O( n ) • Only pattern-match “hits” – not all shifts
  • 10.
  • 11.
    The Rabin-Karp Algorithm StringMatching Jonathan M. Elchison 19 November 2004 CS-3410 Algorithms Dr. Shomper Sources: • Cormen, Thomas S., et al. Introduction to Algorithms. 2nd ed. Boston: MIT Press, 2001. • Karp-Rabin algorithm. 15 Jan 1997. <http://www-igm.univ-mlv.fr/~lecroq/string/node5.html>. • Shomper, Keith. “Rabin-Karp Animation.” E-mail to Jonathan Elchison. 12 Nov 2004.