A Comparison of Serial and Parallel Substring Matching Algorithms
1. 1
A Comparison of Serial and Parallel Substring Matching
Algorithms
Gerrett Diamond, Thomas Manzini, Zexin Wan, Paul Zhou
Rensselaer Polytechnic Institute - Parallel Programming and Computing
File Saved to Gerrett Diamond’s Account
Abstract
We present a study on serial and parallel
algorithms for solving an advanced string
matching problem between to large texts. This
problem involves finding the longest common
substring between texts A and B. The study
performed was done by parallelizing a naive
serial algorithm and a dynamic programming
algorithm using RPI’s BlueGene/Q AMOS
supercomputer. We use a range of test cases and
situations to show where each method has
bottlenecks. We show that both algorithms in
parallel have different peak points and therefore
the better algorithm depends on the input being
run on.
1 Introduction
String matching has been a big focus. It
has many applications in the academic realm as
a tool to make sure academic integrity is
maintained, in the form of a plagiarism checker,
but also in other fields such as search and
information retrieval. For our purposes, we
decided to see what the differences were
between the different implementations of both
naive and dynamically programmed solutions
and how those different versions would work
when parallelized. Our different solutions were
run on the RPI BlueGene /Q AMOS machine
with many different initial conditions regarding
both tasks and the number of nodes. All versions
of the code performed differently under different
circumstances. As a result a wide range of tests
were applied to insure that no issues were lost in
the grand scheme. From there we compared both
the runtimes of the various algorithms and their
overall performance as a metric of the two
previous values. On the whole we saw a wide
range of performance differences.
2 Related Works
Typically in research, string matching
refers to the task of finding a small pattern string
within a larger text. Many algorithms have been
implemented in serial for this ranging from
O(nm), where n and m are the lengths of the text
and pattern respectively, to O(n) with a
preprocessing step of O(m). For parallel
algorithms, two were shown by Zvi Galil. The
first being a O(loglogn) algorithm [1] and then
being followed by a O(1) algorithm [2]. Both of
these algorithms by Galil were made for the
parallel random access machine computation
model and use preprocessing of the pattern
string to achieve better runtimes. This form of
string matching has been explored to its
potential hitting constant time but our problem
involves the search between two large bodies of
texts and cannot do the same preprocessing
tricks done in these algorithms.
A similar problem to ours is explained
by Landau [3]. Landau explores the problem
where the smaller pattern may have up to k
differences from a substring in the larger text.
Although our problem involves comparing two
texts similar concepts can be taken in terms of
handling the difference checking. While our
model looks for exact matching the idea of
having to check throughout the pattern string for
these differences provides key concepts for
comparing each text to each other.
3 Implementation
Two simple algorithms for finding the
longest common substring between two texts are
a naive 3 for loop structure and dynamic
programming. Our goal was to compare these
two methods both in serial and parallel and see if
their differences carried over to parallel version.
To do this we implemented both serial
algorithms and used these as a basis for the
parallel algorithms.
3.1 Serial Algorithms
The naive method runs loops over each
character of the two texts and then runs a
subsequent loop while the two characters are
equal to find common substring length. This
2. 2
method has a runtime of O(n*m*k) where n and
m are the size of the texts and k is the size of the
longest common substring. This method is good
for its simplicity and low calculations per loop
but as the size of the substring increases, more
work will have to be done and the runtime will
take a huge hit. If two files that are compared are
equal, the runtime is O(n^3). The dynamic
method runs a two dimensional array that keeps
track of the current longest substring along
diagonals. This method results in a runtime of
O(n*m). This method unlike the naive way
avoids any effects from the substring length. The
extra overhead of this method is the two
dimensional array that takes up an extra O(n*m)
space in memory and more calculations must be
done to maintain this array.
3.2 Parallel Naive
The parallel naive algorithm, like the
serial version, tries to find the longest substring
length through brute force, but uses
parallelization to break up the tasks. The
parallelization is done on string B, splitting it
into multiple chunks, and comparing them
individually to string A. This, however, was not
trivial, because processes must now pass partial
match information to each other in order to
combine for the longest common substring. To
accomplish this, it stores an associative set of
starting index and length of all matches
including the beginning of the B chunk, and an
associative set of ending index and length of all
matches including the end of chunk B. Using
MPI, each process passes the ending set to the
process of rank one greater, which compares the
receive ending set to its own starting set, and
adds the substring lengths if they have the same
start and end index. This is done starting from
rank 0 to the last rank, after which all of the
partial matchings are combined and the
maximum substring length is known. Because
the sends and receives must be blocking, the
MPI passing at the end is expected to cause
more overhead as the number of tasks grows.
3.3 Parallel Dynamic
The parallelization of the dynamic
method maintained the two dimensional array
structure but only as needed per node. For
dynamic both strings were split between nodes
evenly and the largest pieces were given to the
last node. The algorithm was done as a two-step
process where first a local computation was
done and then global corrections were
completed to get the correct lengths. The local
computation is done by passing sections of the
second text around in a ring using MPI and
therefore compares each section of both texts to
each other while building a local two
dimensional array. This initial computation
however does not take into account substrings
from previous nodes. Therefore a second step is
done using blocking MPI calls in sequence to
compute the overall substring lengths. This
second step has the bottleneck of more latency
as the number of tasks grows. Each of these two
steps has opposing overheads that cause a
maximum performance point. The first
computation will be done faster with more
processes as the computation will be split up
more while the second computation will slow
down as more blocking must occur to correct
each node’s calculations.
4 Contribution
Thomas Manzini wrote serial
implementations of a string matching algorithm
that was initially implemented in python. This
code was later adapted by Gerrett Diamond in
C++ for the serial portions of the project that
were run as benchmarks. The python code was
used for proof of concept for both the serial and
the parallel implementations. At the same time,
additional code was written to create test cases
that could be modified to fit the problems that
we need. Our group was broken up into two
teams. The first consisting of Thomas Manzini
and Paul Zhou who worked on the parallel naive
implementation. The second group was Gerrett
Diamond and Zexin Wan who implemented the
parallel dynamic algorithm. Thomas Manzini
wrote the File I/O portion of the parallel naive
implementation. This was then used by Paul
Zhou who wrote the string matching algorithm
portion of the C++ parallel naive
implementation. Zexin Wan was responsible for
implementing the Parallel I/O for the dynamic
parallel program and Gerrett Diamond
implemented the algorithm for dynamic parallel.
As the code was being finished all members
contributed to running tests and gathering data.
3. 3
5 Testing and Expectations
We used both randomly generated texts
with certain substring lengths and published
books for testing our programs. The randomly
generated tests were used to compare the serial
algorithms to the parallel ones as well as show
the effects of substring lengths and text length
on each program. We generated two sets of tests
one with constant substring length of 50% of the
file and the other with a set file size of 16384
bytes. The first set ranges the size of the file
from 1KB to 64KB. The second tested substring
lengths from 0% to 100%. Lastly the real world
cases were selected via the top selections portion
of the Gutenburg project website [4]. These
selections were then trimmed down to a size that
could be handled by the programs. For small
data cases, we were expecting that the serial
code would be faster than the parallel algorithms
since when computation time is short, the
blocking time would be the majority of
execution time. However as the data grows
larger, the parallel algorithms would be faster.
Also, we were expecting that the naive
algorithms would have equal or even better
performance compared to the dynamic algorithm
but when the longest common substring grows
to large percentages of the file, the dynamic
algorithm should be much faster.
6 Performance Results
The graphs displaying execution times
as a function of the size of the two input files
(Fig. 1, Fig. 2) show that the naive serial code is
always faster than the dynamic serial code. On
the other hand, the parallel naive method is
faster than the parallel dynamic method only for
smaller file sizes, with the parallel dynamic
method outperforming it as the file size
increases.
Fig. 1 This figure shows the graph of the execution time
versus the size of the files that are being searched by the
naïve methods
For the naive algorithm, the serial code
outperforms the parallel code at 1KB, .but
becomes much slower at greater than 2KB. The
curves for numbers of tasks show that runs with
smaller task counts tend to be faster than those
with larger task counts
Fig. 2 This figure shows the graph of execution time versus
the size of the files that are being searched by the parallel
methods.
Similarly, for the dynamic
implementations, the serial code outperforms the
parallel code at 1KB and is slower for greater
than 2K. Runs with smaller tasks counts are
faster than those with larger tasks counts except
when the file size is large, at 64KB.
The graphs as seen in Fig. 3 and Fig. 4
show us the time that the programs took to
execute versus the size of the substring. From
this data we can see two obvious trends in the
data. The first is that between the two graphs, we
0.01
0.1
1
10
100
1000
2^10 2^12 2^14 2^16
TimeInSeconds
Naive Time vs File Size
Naïve Serial
Parallel
Naïve 256
Parallel
Naïve 512
Parallel
Naïve 1024
0.1
1
10
100
1000
2^10 2^12 2^14 2^16
TimeInSeconds
Dynamic Time vs File Size
Dynamic
Serial
Parallel
Dynamic 256
Parallel
Dynamic 512
Parallel
Dynamic
1024
4. 4
see that the serial implementations at this scale
take significantly more time than either of the
parallel implementations.
Fig. 3 This is the graph that shows the execution time
versus the percent of the file that contains the substring that
is being searched for by the naïve method
For the naive implementation we can
see an obvious trend that shows that as the size
of the substring in the file increases the
execution time increases as well. This data
shows a rapid uptake in execution time that
follows when 20% to 40% of the file is a shared
substring. From that point we see the trend in the
execution time level off and approach a limit.
This limit is different for each different number
of tasks but appears to grow as the number of
tasks grows. One thing worth noting is that the
time of the serial implementation increases as
well as the size of the substring increases as
well.
Fig. 4 This is the graph that shows the execution time
versus the percent of the file that contains the substring that
is being searched for by the parallel method.
For the dynamic implementation, we see
something strikingly different. Aside from the
serial implementation we see that the execution
time for the different numbers of tasks seems to
stay around a constant value. These constant
values appear to be different for each different
number of tasks. It appears that as the number of
tasks increases, the execution time increases as
well. For all implementations, including the
serial one, the execution time appears to be, for
the most part, constant regardless of the size of
the substring.
Fig. 5 This graph shows the execution time versus the
number of tasks that the system used when performing the
string matching, this graph is a comparison of the naïve and
dynamic implementations when comparing Pride and
Prejudice and The Divine Comedy.
Fig. 6 This graph shows the execution time versus the
number of tasks that the system used when performing the
string matching, this graph is a comparison of the naïve and
dynamic implementations when comparing The Adventures
of Huckleberry Finn and The Divine Comedy.
0.1
1
10
100
TimeInSeconds
Naive Time vs Substring Size
Naïve Serial
Parallel
Naïve 256
Parallel
Naïve 512
Parallel
Naïve 1024
0.1
1
10
100
TimeInSeconds
Dynamic Time vs Substring Size
Dynamic
Serial
Parallel
Dynamic
256
Parallel
Dynamic
512
Parallel
Dynamic
1024
0
20
40
60
80
100
128 256 512 10242048
TimeInSecnds
Time vs Tasks
Pride and Prejudice vs The Divine Comedy
Naïve
Dynamic
0
20
40
60
80
100
128 256 512 1024 2048
TimeInSeconds
Time vs Tasks
Adventures of Huckleberry Finn vs The
Divine Comedy
Naïve
Dynamic
5. 5
Fig. 7 This graph shows the execution time versus the
number of tasks that the system used when performing the
string matching, this graph is a comparison of the naïve and
dynamic implementations when comparing Pride and
Prejudice and The Adventures of Huckleberry Finn.
The graphs seen in Fig. 5, Fig. 6, and
Fig. 7 show us the performance time as a
function of the number of tasks. These graphs
refer to the time that it takes for the comparison
of the novels Pride and Prejudice and The
Divine Comedy, The Adventures of Huckleberry
Finn and The Divine Comedy, and Pride and
Prejudice and Huckleberry Finn, respectively.
The data shows an interesting trend which is that
the parallel naive implementation is better in
most cases than the parallelized implementation,
this holds for the vast majority of the test cases.
This changes however when it comes to the
1024 task case and the 2048 case. The results
differ between the different graphs, however, the
naive results appear much more consistent
whereas the parallel results become much less
consistent when the number of tasks passes 512.
Though the naive implementation is not always
faster the inconsistency of the dynamic
implementation means that the naive
implementation performs better on the whole.
Fig. 8 This graph shows the percentage of time that each
program spent using the message passing interface. It is the
percent time spent versus the number of tasks utilized.
For every algorithm, a significant
portion of the execution time is spent performing
MPI sending and receiving, depending on the
number of tasks. For every number of tasks, the
dynamic code takes less time in MPI than the
naive. The dynamic code, however, shows a
steeper difference between the minimum and
maximum numbers of tasks, with a factor of 6,
than the naive code, with a factor of 2.2.
On the whole we saw an average speed
up of 25.79683378 times when comparing the
average runtime of the fastest serial
implementation (naïve) and the average runtime
of the all both the naïve and dynamic with 256,
512, and 1024 tasks.
7 Analysis of Performance Results
The serial execution times grow
exponentially; however, although the times are
lower for our test cases, the parallel execution
times appear to grow at an even greater
exponential rate. This is due to the increasing
overhead of blocking MPI calls. We expect that
given larger test cases, the parallel code may end
up slower than the serial code.
But, the parallel dynamic code had
memory problems for large input files; thus, we
had to limit the file size of our test cases to
200KB. Larger files cause memory allocation
errors. This is because the dynamic code creates
a 2D array with dimensions (size of file A) x
(size of file B), meaning that memory usage
scales intensively with file size. We believe that
0
20
40
60
80
100
128 256 512 10242048
TimeIinSeconds
Time vs Tasks
Pride and Prejudice vs Adventures of
Huckleberry Finn
Naïve
Dynamic
0%
20%
40%
60%
80%
100%
PercentTime
Communication Overhead
Percent Time Spent in MPI vs Task Count
% Time in
MPI
(Dynamic)
% Time in
MPI (Naïve)
6. 6
an environment with more memory would be
needed to try these larger test cases.
We observed that as the data size
increases, the gaps between the execution times
of the parallel method with different numbers of
nodes become smaller. The reason for this is
because the time of data transfer between nodes
remains constant while the time of computing
data grows linearly. Comparing Fig. 1 and Fig.
2, it shows that the parallel naive is faster than
the parallel dynamic when the data is really
small and when the data grows bigger, the
parallel dynamic outperforms the parallel naive
when the data size gets bigger than 16KB. The
reason is because
The results in Fig. 3 and Fig. 4 showed
us data that is in line with what we were
expecting. For the Naive implementation, we see
that as the size of the substring increases, the
execution time increases as well. This is to be
expected with the larger shared string as it being
calculated, it must be passed to all the relevant
nodes. This increases communication time
amongst the nodes not only because the data
must be passed but also because the amount of
data that must be passed increases as well. We
see this same trend for all of the parallel
implementations. For the serial case however,
we see a slight increase as well as the size of the
substring increases as well. This increase is
caused by the simple fact that as the serial
implementation continues looking through the
contents of the file even after a substring has
been found. This means that after the initial
computation has been found, the program will
continue searching to insure that it hasn't missed
anything.
For the dynamic implementation we also
see data that is in line with what we were
expecting. We see a very consistent execution
time regardless of the size of the substring that is
in the file. This makes sense as the amount of
communication is not dependent on the size of
the substring as the data that is being passed
around to each node remains constant
throughout the program. The changes that we
see in the data as the number of tasks increases
is also consistent as the number of blocking
calls, and hence communication, increases as the
number of tasks increases. From there it is
simple to see how the execution time would
increase as the time that the program spends
communication is outweighed by the
performance increase and therefore outweighs
the advantage of splitting up the file at a certain
point.
Fig. 5, Fig. 6 and Fig. 7 are testing the
ability of running very large text files. The three
books we are used for testing are Pride and
Prejudice, Adventures of Huckleberry Finn and
the Divine Comedy and each book is about a
size of 600KB. From the performance curves of
the naive method, we can see it reach its peak
performance with 256 tasks and then the
execution times start to raise. The reason for that
is because the longest substrings between books
are very short compare to the size of a book, the
naive method which strongly dependent on the
length of longest substring would have very few
computations for each task. When the program
get more tasks, the time for computing the local
max substring decreases in a very small scale
while the time spending on MPI blocking
increases by the number of nodes times a
constant. The dynamic method has similar
performance as well. Fig. 6 and Fig. 7 show the
dynamic method reach its peak performance
with 1024 tasks and outperformance the peak of
naive method. The reason for the dynamic takes
more tasks to max out is because the dynamic
methods does not dependent on the length of
longest substring and would have more stuff to
compute.
A general trend displayed in these
graphs is that the more tasks we use, the slower
the execution time is. The only times when using
more than 256 is beneficial are with a string
match of 0% using the naive algorithm and with
file size greater than 16KB using the dynamic
algorithm. We attribute this to the MPI I/O
overhead, which scales only to the number of
tasks. For smaller calculations, the overhead
incurred by larger task counts outdoes the
benefit of increased parallelization from having
more tasks.
It was expected that the naive algorithm
spent a larger percentage of its running time
doing MPI communication, because the passing
of partial matches requires sending more data. It
is notable however that the naive algorithm was
generally faster for our test cases, meaning that
despite having greater percentage of
7. 7
communication overhead, the running time was
significantly faster than that of the dynamic
algorithm.
8 Future Works
A current bottleneck in the naive
method is that while text B is broken up between
the processes, text A is kept whole on each node.
A method for breaking up A as well was
explored but not implemented as the amount of
information needed to be stored is at least
doubled and the model becomes much more
complex. This does however have a potential
speedup for the program as less checking will
happen per node at a time.
During the discussion while we were
researching for parallel naive algorithms, we
also found out other algorithms that could have
been faster than the current parallel naive one.
The idea of the method is to split both string A
and B, and then convert the string B into a circle
of small strings. For each rank, we will rotate the
circle of string B and let it compare to part of the
string A, recording the length of longest
substring, the position of substring and the order
of String B in the circle and then pass then to the
next String. Ideally this method should be a lot
faster than the current naive algorithms since it
has better scaling and less dependence.
However, we did not implement this method
because it was not only more memory intensive
than the current method, but also a lot more
complicated. We listed this method for a
possible direction for future study.
Also there are potentially better
algorithms that can be done in serial that could
be extended to parallel. We took two of the
simplest methods in order to have easy
comparison by runtime but future work could be
done on better algorithms with better runtimes.
9 Conclusions
For all the data we have for this
assignment, we realize there is no ultimate
method to find the longest substring in two texts.
Performances differ when the cases change. If
the user has a powerful machine with massive
texts and multiple long substrings, the dynamic
methods would be the better choice. However in
real world, most text would not be extremely
similar and people would not be able to own a
supercomputer like blue geneQ, the naive
methods would be a more realistic choice. (The
alternative naive method which is mentioned in
future work section should have an even better
performance since it has a better scaling in
theory.)
References
[1] Dany Breslauer, Zvi Galil. “An Optimal
$O(loglog n)$ Time Parallel String Matching
Algorithm”. SIAM J. Comput., 19(6), 1051–
1058.
http://epubs.siam.org/doi/abs/10.1137/0219072?
journalCode=smjcat
[2] Zvi Galil. “A constant-time optimal parallel
string-matching algorithm”. Journal of the ACM.
Volume 42 Issue 4, July 1995. Pages 908-918.
http://dl.acm.org/citation.cfm?id=210341
[3] Gad M. Landau. “Fast parallel and serial
approximate string matching”. Journal of
Algorithms. Volume 10, Issue 2, June 1989,
Pages 157–169
http://www.sciencedirect.com/science/article/pii/
0196677489900102
[4]Michael S. Hart. Project Gutenberg.
University of North Carolina, 1 Dec. 1996. Web.
7 May. 2014
http://www.gutenberg.org/