Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Liwei Ren
Data Center Research
Trend Micro
Cupertino, USA
e-mail:
[email protected]
Abstract— The bad character shift rule of Boyer-Moore string
search algorithm is studied in this paper for the purpose of
extending it to more general string match problems. An abstract
problem of string match is defined in general. An optimized string
match algorithm based one the bad character heuristics is
proposed to solve the abstract match problem efficiently.
Keywords: pattern; string; sequence; search; match; bad
character; Boyer-Moore
I. INTRODUCTION
String searching is a classic problem in many text
processing applications. Among many string searching
algorithms, Boyer-Moore algorithm [1] is a particular
efficient one for single pattern string match. It uses both
the concepts of good suffix shift and bad character heuristics
to accelerate the string match. Two shift tables are
established to determine how many shifts to make after
match fails. The algorithm shifts the pattern according to the
larger shift given by two shift tables.
The Horspool algorithm [2] is the best known variant of
Boyer-Moore algorithm. It only uses the bad character
heuristics to build the shift table. There are other variants as
well such as the algorithms given by Raita [3] and Sunday
[4].
In summary, the essence of all the Boyer-Moore style
algorithms is to skip the unnecessary character comparisons
as many as possible.
If we introduce the concept of match window as a
substring of the reference string , the naïve string searching
algorithm is basically a sliding window match algorithm
with N-M+1 match windows, where N and M are the sizes
of the reference string and the pattern respectively. Hence,
in practice, the Boyer-Moore algorithm selects only a few of
candidate match windows that possibly contains the target
strings. This is done by ruling out many windows that
definitely have no target substrings.
The bad character shift with Boyer-Moore algorithm can
take a weaker form as character identity verification. It
verifies whether a given character in the reference string
belongs to the alphabet of the search pattern or not.
We can extends the concepts of both match window and
character identity verification to other string match
problems, for instance, the regular expression based pattern
match problem which has many applications in practice.
This paper proposes an abstract problem of string match
which includes the two classic string matching problems, i.e.,
single pattern string search and regular expression pattern
match, as the special cases.
An efficient algorithm is constructed to solve the abstract
problem based on the concepts of match window and
character identity verification.
II. A GENERAL PROBLEM OF STRING MATCH
In this section, we uses an abstract model to present
string match problems in more general terms. With this
model, many practical problems can be covered beyond the
scope of both single pattern string searching and regular
expression based string matching.
Before we define the problem, lets observes the follows
from classic string match problems:
1. The target string has a small alphabet S when
comparing to the whole character space. In the case
of single pattern string search problem, S consists
of all unique characters of the pattern string. In the
case of regular expression match, it is typical that
most entities defined by regular expression patterns
in practical applications have small alphabets as
well. Examples of these entities include IP
addresses, dates, credit card numbers, bank account
numbers , ID numbers and etc..
2. The target strings have well-defined minimum and
maximum lengths. This is obvious with the single
pattern search problem. As to the regular
expression match, it is not uncommon that these
two numbers can be pre-defined. For example, to
match master credit card number from a text, the
minimum length is 16 while the maximum length
can be defined as 19 if one also includes the format
dddd-dddd-dddd-dddd.