2. String-Searching Algorithms
• The goal of any string-searching algorithm is
to determine whether or not a match of a
particular string exists within another
(typically much longer) string.
• Many such algorithms exist, with varying
efficiencies.
• String-searching algorithms are important to a
number of fields, including computational
biology, computer science, and mathematics.
3. The Boyer-Moore String Search
Algorithm
• Developed in 1977, the B-M string search
algorithm is a particularly efficient algorithm,
and has served as a standard benchmark for
string search algorithm ever since.
• This algorithm’s execution time can be sub-
linear, as not every character of the string to
be searched needs to be checked.
• Generally speaking, the algorithm gets faster
as the target string becomes larger.
4. How does it work?
• The B-M algorithm takes a ‘backward’ approach: the
target string is aligned with the start of the check
string, and the last character of the target string is
checked against the corresponding character in the
check string.
• In the case of a match, then the second-to-last
character of the target string is compared to the
corresponding check string character. (No gain in
efficiency over brute-force method)
• In the case of a mismatch, the algorithm computes a
new alignment for the target string based on the
mismatch. This is where the algorithm gains
considerable efficiency.
5. An example
• Target string: rockstar
Check string: -------x-----
• Aligning the start of each string pairs ‘r’ with ‘x’.
• Since ‘x’ is not a character in ‘rockstar’, it makes
no sense to check alignments beginning with any
character in the check string which comes before
‘x’, and the B-M algorithm skips all such
alignments.
• This eliminates several (7, in this case) alignments
to be checked by the algorithm, and we needed
to compare only two characters.
6. Efficiency of the B-M Algorithm
• The average-case performance of the B-M
algorithm, for a target string of length M and
check string of length N, is N/M.
• In the best case, only one in M characters
needs to be checked.
• In the worst case, 3N comparisons need to be
made, leading to a complexity of O(n),
regardless of whether or not a match exists.
7. Pre-processing Tables
• The B-M algorithm computes 2 preprocessing tables to
determine the next suitable alignment after each failed
verification.
• The first table calculates how many positions ahead of the
current position to start the next search (based on
character which caused failed verification).
• The second table makes a similar calculation based on how
many characters were matched successfully before a failed
verification
• These tables are often referred to as ‘jump tables’, though
this leads to some ambiguity with the more common
meaning of the term in computer science, which refers to
an efficient way of transferring control from one part of a
program to another.
8. Calculation of Preprocessing Tables
• Table 1
– Starting at the last character of the target string, move
left toward the first character. At each character, if
the character is not already in the table, add it to the
table.
– This character’s shift value is equal to it’s distance
from the right-most character in the string.
– All other characters receive a shift value equal to the
total length of the string.
– Example: ‘peterpan’ would produce the following
table: (character, shift) = (A, 1), (P, 2), (R, 3), (E, 4),
(T, 5), (all other characters, 8)
9. Calculation of Preprocessing Tables
• Table 2
– First, for each value of i less than the length of the
target string, calculate the pattern of the last i
characters of the target string preceded by a mis-
match for the character before it.
– Then, determine the least number of characters of the
partial pattern that must be shifted left before two
patterns match.
– Example: for ‘ANPANMAN’, the table would be (I,
pattern, shift) = (0, -N, 1), (1, (-A)N, 8), (2, (-M)AN, 3),
(3, (-N)MAN, 6), (4, (-A)NMAN, 6), (5, (-P)ANMAN, 6),
(6, (-N)PANMAN, 6), (7, (-A)NPANMAN, 6). (here, -X
means ‘not X’)
10. Comparison of String Searching
Algorithm Complexities
• Boyer-Moore: O(n)
• Naïve string search algorithm: O((n-m+1)m)
• Bitap Algorithm: O(mn)
• Rabin-Karp string search algorithm: [average
O(n+m)]
(n = length of search string, m = length of target
string)
11. About the Creators
• Robert Boyer is a retired Professor Emeritus of the
University of Texas at Austin Computer Science Department.
He received his BA and PhD in mathematics at UT Austin,
and has authored and co-authored several books
concerning automatic theorem-proving.
J. Strother Moore is Admiral B.R. Inman Centennial Chair in
Computer Theory of the Department of Computer Sciences at UT
Austin. He received his BS in mathematics from MIT in 1970, and
his PhD in computational logic from the University of Edinburgh
in 1973. He has authored and co-authored several books
concerning automatic theorem-proving, some of them in
cooperation with Robert Boyer.