The bad character shift rule of Boyer-Moore string search algorithm is studied in this paper for the purpose of extending it to more general string match problems. An abstract problem of string match is defined in general. An optimized string match algorithm based one the bad character heuristics is proposed to solve the abstract match problem efficiently.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
1. Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Liwei Ren
Data Center Research
Trend Micro
Cupertino, USA
e-mail: liwei_ren@trendmicro.com
Abstract— The bad character shift rule of Boyer-Moore string
search algorithm is studied in this paper for the purpose of
extending it to more general string match problems. An abstract
problem of string match is defined in general. An optimized string
match algorithm based one the bad character heuristics is
proposed to solve the abstract match problem efficiently.
Keywords: pattern; string; sequence; search; match; bad
character; Boyer-Moore
I. INTRODUCTION
String searching is a classic problem in many text
processing applications. Among many string searching
algorithms, Boyer-Moore algorithm [1] is a particular
efficient one for single pattern string match. It uses both
the concepts of good suffix shift and bad character heuristics
to accelerate the string match. Two shift tables are
established to determine how many shifts to make after
match fails. The algorithm shifts the pattern according to the
larger shift given by two shift tables.
The Horspool algorithm [2] is the best known variant of
Boyer-Moore algorithm. It only uses the bad character
heuristics to build the shift table. There are other variants as
well such as the algorithms given by Raita [3] and Sunday
[4].
In summary, the essence of all the Boyer-Moore style
algorithms is to skip the unnecessary character comparisons
as many as possible.
If we introduce the concept of match window as a
substring of the reference string , the naïve string searching
algorithm is basically a sliding window match algorithm
with N-M+1 match windows, where N and M are the sizes
of the reference string and the pattern respectively. Hence,
in practice, the Boyer-Moore algorithm selects only a few of
candidate match windows that possibly contains the target
strings. This is done by ruling out many windows that
definitely have no target substrings.
The bad character shift with Boyer-Moore algorithm can
take a weaker form as character identity verification. It
verifies whether a given character in the reference string
belongs to the alphabet of the search pattern or not.
We can extends the concepts of both match window and
character identity verification to other string match
problems, for instance, the regular expression based pattern
match problem which has many applications in practice.
This paper proposes an abstract problem of string match
which includes the two classic string matching problems, i.e.,
single pattern string search and regular expression pattern
match, as the special cases.
An efficient algorithm is constructed to solve the abstract
problem based on the concepts of match window and
character identity verification.
II. A GENERAL PROBLEM OF STRING MATCH
In this section, we uses an abstract model to present
string match problems in more general terms. With this
model, many practical problems can be covered beyond the
scope of both single pattern string searching and regular
expression based string matching.
Before we define the problem, lets observes the follows
from classic string match problems:
1. The target string has a small alphabet S when
comparing to the whole character space. In the case
of single pattern string search problem, S consists
of all unique characters of the pattern string. In the
case of regular expression match, it is typical that
most entities defined by regular expression patterns
in practical applications have small alphabets as
well. Examples of these entities include IP
addresses, dates, credit card numbers, bank account
numbers , ID numbers and etc..
2. The target strings have well-defined minimum and
maximum lengths. This is obvious with the single
pattern search problem. As to the regular
expression match, it is not uncommon that these
two numbers can be pre-defined. For example, to
match master credit card number from a text, the
minimum length is 16 while the maximum length
can be defined as 19 if one also includes the format
dddd-dddd-dddd-dddd.
2. Pattern Match Function: For any given reference
string R and the match window R[s,e], a pattern match
function F can extract a target string, based on well-defined
matching rules, from the window R[s,e] if there is any,
otherwise it returns NIL. The function can be denoted as
F(R,s,e). The match mechanism is defined inside F itself.
Abstract Problem of String Match: The string match
problem is to retrieve all target substrings from a given
reference string R[1,…,N] with pattern match function F(R,
s,t), where the pattern match function F defines what the
target substrings should be with the following conditions:
All target substrings consist of characters from a
small alphabet S.
The length of each target substring falls in the
interval [m,M] where m is the minimum length and
M the maximum.
Both single pattern string search and regular expression
pattern search are special cases of this abstract match
problem.
Yet another example is the problem of regular
expression pattern match with checksum validation that
requires all target substrings must be validated by a
checksum. This example is useful for data discovery
systems for minimizing false positives.
III. OPTIMIZED STRING MATCH ALGORITHM
A naïve algorithm to solve the abstract problem of string
match can be easily constructed. It is based on the
mechanism of sliding match windows.
Naïve String Match Algorithm : One starts from the 1st
match window R[1,M]. Call match function F. If a match
exists, obtain the target substring and move to the next
match window immediately after the target substring,
otherwise, slide the match window one step further. Repeat
this until the reference string R is exhausted.
With the naïve string match, one will go through N-M+1
matching windows if there is no target string at all. That is
not efficient.
We can reduce the number of matching windows if we
are able to determine quickly that a match windows does not
contain a target string at all. That can be done with the
character identity verification. Lets construct the optimized
algorithm as follows.
Optimized String Match Algorithm:
Input: Minimum length m, maximum length M, target
string alphabet S, pattern match function F, reference
string R[1,…,N]
Matching Procedure:
Step 1: set s=1
Step 2: Let r= MIN(s+M-1, N)
Step 3: If r-s<m-1, RETURN
Step 4: Set match window as W=T[s, …,r]
Step 5: Set sub-window w=T[s,…,s + m - 1]. Lets find
out the rightmost character T[s + p] that does not belong
to S, set s = s + p, go to step 2
Step 6: Otherwise, all characters of sub-window w pass
identity verification. Lets match with the function
F(R,s,r):
a. If result is NIL, let s=s+1
b. If a target substring is matched as T[t,e], save
it, let s=e+1
Step 7: Go to step 2
Output: Matches
IV. ANALYSIS OF THE ALGORITHM
The algorithm starts with the first match window defined
by step 1. The key step for optimization is step 5. Step 5
does the identity verification for characters in the sub-
window w. The verification is done character by character
from the rightmost of the sub-window. When any character
fails the verification, we slide the match window ahead with
multiple steps instead of one step. This step is somewhat
like the Raita’s [3] multiple point checking. It may cost
more time when the target substring does exist in the
window, however, in most cases, it reduces the number of
matching windows by shifting multiple steps. The best case
is that we shift m steps ahead if no character in w belongs to
S. The step 6 does the pattern match. If the match fails,
unlike the Boyer-Moore or Horspool algorithms, there is no
shift table that advises shifting more than one step.
The optimized algorithm is not designed to exceed
Boyer-Moore algorithm or its variants for single pattern
string match. Instead, its purpose is to extend the concept of
bad character shift rule to more general case. This extension
has immediate applications in two special pattern match
problems:
Regular expression pattern match.
Regular expression pattern match with checksum
validation.
Example 1: One needs to search all social security
numbers (SSN) from a text with the regular expression
pattern defined as d{9}|d{3}-d{2}-d{3}. The alphabet
S={0,1,2,3,4,5,6,7,8,9,-} has 11 characters. The minimum
and maximum length for SSN are 9 and 11 respectively. The
best case is that we do not need to apply regular expression
pattern match at all if the text does not contain any numbers
or -.
Example 2: One needs to search Master or Visa credit
card numbers (CCN) from a text with the regular
expression pattern defined as d{16}|d{4}-d{4}-d{4}-
3. d{4}. The alphabet S={0,1,2,3,4,5,6,7,8,9,-} has 11
characters. The minimum and maximum lengths for SSN
are 16 and 19 respectively. The checksum applies the Luhn
algorithm [5] to validate the CCN.
V. PROBLEM OF MATCHING SEQUENCE OF OBJECTS
This paper has been focusing on problem of string
search. Due to the fact that we have been using general
terms to discuss the problem and the solution, the abstract
problem of string match can be extended to more general
problem. This is the problem of sequence match if we define
a sequence as a sequence of objects and a subsequence of
objects as a consecutive subsequence. We can achieve this
by extending two basic concepts --- character and string.
Lets use object instead of character and sequence instead of
string. Then pattern match function, abstract problem of
sequence match and optimized algorithm can be introduced
accordingly. It is not sure yet whether this further
abstraction of problem has any practical implication.
However, it deserves a theoretical perspective.
VI. CONCLUSION
We presented a general problem of string match and its
optimized algorithm inspired by the bad character shift rule
of Boyer-Moore string search algorithm. The abstract
nature of the problem allows us to include both single
pattern string search and regular expression pattern match as
its two special cases.
While the optimized algorithm discussed is not better
than Boyer-Moore type string search algorithms, it can be
used for match optimization in other pattern problem such as
regular expression pattern match or the problem of regular
expression pattern match with checksum validation. One
can even use it for many other pattern match problems
beyond the scope of strings of characters such as sequence of
objects, where the concept of object can be very general.
ACKNOWLEDGMENT
Special thanks to Joe Lin, the engineering site director at
Trend Micro for his support. Without his sponsorship, this
research work will not be possible.
REFERENCES
[1] R. Boyer, J. Moore, "A fast string searching algorithm",
Comm. ACM vol 20, pp. 762–772., 1977
[2] R. Horspool, "Practical fast searching in strings", Software -
Practice & Experience , vol.10 (6), pp. 501–506, 1980
[3] T. Raita, “Tuning the Boyer–Moore–Horspool String
Searching Algorithm”, Software - Practice & Experience , vol
22(10), pp. 879–884, 1992
[4] D. Sunday, “Very Fast Substring Search Algorithm”, Comm.
ACM, vol 33, issue 8, pp. 132-142 , 1990
[5] http://en.wikipedia.org/wiki/Luhn_algorithm.