This document describes a project to parallelize the GNU grep utility using a parallel string matching algorithm. It discusses implementing a parallel version of fgrep that uses Uzi Vishkin's parallel string matching algorithm. The implementation details and experimental results showing speedups on large files with up to 32 threads are provided. However, the parallel grep is not yet optimized enough to outperform or replace the sequential GNU grep for all use cases.
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdf
report
1. UNIVERSITY OF UTAH, CS 6230: PARALLEL AND HIGH PERFORMANCE COMPUTING
Fast fgrep using parallel string matching algorithm
Myungho Jung
May 10, 2015
1. INTRODUCTION
Grep is a common tool to search patterns in files and to print lines for Unix-like systems. GNU grep
can find the pattern fast using string matching algorithms like Boyer-Moore algorithm. However, the
GNU grep cannot maximize the performance on multi-core systems because it is implemented using
sequential algorithms. The purpose of this project is to implement parallel algorithm for string matching
and to apply to the existing grep code. It would be hard to show the better performance than GNU grep
since it has been developed and optimized for a long time. Nevertheless, this project will be able to
propose a new possibility for grep to utilize multi-core system.
GNU grep is divided into several parts. First, egrep which is equal to option ‘-E’ searches patterns of
extended regular expression. Second, the grep can also find fixed strings using fgrep or option ‘-F’. Lastly,
pcregrep searches patterns of Perl regular expressions in files. As a first step to parallelize GNU grep,
fgrep will be parallelized in this project.
Fgrep uses two algorithms depending on the number of patterns. If the number of input string is one,
Boyer-Moore algorithm is used. And if the input string is more than one, the grep tool finds the pattern
by Commentz-Walter algorithm. In this project, the first part will be parallelized.
1
2. 2. METHODS
There are several methods to parallelize grep. First, file level parallelism is possible. For example, while
the first thread searches the pattern in the first file, the second thread can search the other file at the same
time. It is easy to implement but it is impossible to search a large file in parallel. And results of different
files will be mixed. Second, the grep can be parallelized for each line. But, it has similar problems with
the first one. Lines will not be printed sequentially and it cannot be applied to binary files because there
is no carriage return character. So, the purpose of the project is to show the same result as the sequential
algorithm.
Many parallel algorithms for string matching are studied. One of the earliest optimal algorithms is
devised by Zvi Galil [2]. However, this algorithm is limited to fixed alphabets. Uzi Vishkin improved this
algorithm to work for general cases [3]. Constant time algorithm is also developed by Zvi Galil at el using
randomization [1]. In this project, the algorithm is based on Uzi Vishkin’s idea and several mechanisms
are added for grep utility.
2.1. DEFINITIONS
It is necessary to explain definitions for this algorithm.
• Given a pattern of length m and a text of length n, the purpose of the algorithms is searching the
pattern in the text.
• Period : Given a string of length m, X is a period if the string is X k
X
(X is repeated k times before X which is the prefix of X )
• The period : The shortest period if there are more than one period
• Periodic : A string is periodic if the length is less than or equal to m/2.
(m is the length of string)
Ex>‘abaabcab’ is not periodic but ‘abcabcab’ is periodic.
• Witness : An array indicating indices of the positions that two overlapped characters are different.
Ex> Given witness[i] = j, this means that when a string overlaps on top of the the same string
at the index of i, j is the index of upper string where two characters of strings are different at the
position.
• Duel : Using the characteristics of witness, we can guarantee that at least one of the characters of
patterns at the position should be different from that of the text.
2
3. 2.2. ALGORITHMS
2.2.1. PATTERN ANALYSIS
First, the witness array should be created from the pattern string. This part can be parallelized but it
is not useful in practice because the length of the pattern is usually short. Therefore, the overhead for
parallelism would be higher than the sequential algorithm.
Algorithm 1 The algorithm for pattern analysis
function PATTERNANALYSIS(Pattern[1..m])
for i = 1;i < logm;++i do
Divide the pattern into blocks of size 2i
Get j which is a candidate of the first block for matching
Compute witness[j] with patter n[1..2i+1
] by brute-force D : O(1),W : O(2i
)
if witness[j] = 0 then
Apply the duel function on blocks left
else
Find the largest α (α ≥ i +1 and j is the period for patter n[1..2α
]
but not for patter n[1..2α+1
]) D : O(1),W : O(2α
)
if α == logm then
return witness
else
Eliminate all candidates in the first α−1 blocks using the duel function
i = α
Continue
end if
end if
return witness
end for
end function
2.2.2. TEXT ANALYSIS (NONPERIODIC CASE)
This is a common case. If the pattern string is not periodic, follow the algorithm below:
1. Partition the text into blocks of size m/2
2. For each block, eliminate all but one candidate using the duel function. D : O(logm),W : O(m)
3. For each candidate, check if the string is matched at the position by brute-force algorithm.
D : O(1),W : O(n)
Finally, the total depth is O(logm), and the work is O(n).
2.2.3. TEXT ANALYSIS (PERIODIC CASE)
In reality, it is a rare case that the pattern is periodic. However, if the pattern is periodic, the algorithm
above cannot be applied because there are more than one position where the string can be matched in
each block of size m/2 . The algorithm is below:
3
4. 1. Given p is the period of the pattern, compute all candidates using the prefix of patter n[1..2p −1].
D : O(logm),W : O(n)
2. For each occurence, check if the p2
p (p is prefix of p at the end of the pattern) occurs at the
postiion. D : O(logm),W : O(p)
3. At the position of i, if the p2
p occurs all of the positions of i + j p(j = 0..m/p), the pattern occurs
at the position. D : O(logm),W : O(n)
Therefore, the total depth is O(logm), and the work is O(n).
3. IMPLEMENTATION DETAILS
There are many parts that can be parallelized in this algorithm. However, it is not a good way to par-
allelize every part because it would cause more overhead. Therefore, only the outermost loop is par-
allelized. Applying this algorithm to grep is little tricky. The sequential algorithm for the original grep
returns the first occurrence of the pattern in the buffer. However, parallel algorithm can only find all
occurrences regardless of the order. Thus, the code should be modified before applying.
In the sequential grep, it reads a file into a buffer. If the file is larger than the buffer, the program
reads as much data as the size of buffer. Then, it doesn’t search the string line by line. To improve the
performance, it returns the position of the first occurrence of the string and prints the line. The algorithm
searches other occurrences of the pattern only if the color option is set. This algorithm is efficient and
fast but it is hard to parallelize it.
To apply the parallel algorithm, a part of the original grep code is edited. However, almost all parts
of the code for the parallel algorithm is separately added by extra source files which are ‘vuset.c’ and
‘vuset.h’. The program loads data into buffer similarly to the original grep. Then, the algorithm finds
all occurrences of the pattern in the buffer and returns them as a linked list. This linked list acts as a
cache to print lines. The linked list is usually not fast for random accesses. However, it works effectively
because the program should print the results sequentially. Although this algorithm will not be effective
for small files, it works well in common cases.
4. EXPERIMENTS & RESULTS
4.1. TEST ENVIRONMENT
The program is implemented using openmp and tested on stampede. To compare with sequential gnu
grep, parallel parts are added to the GNU grep code. By doing so, many functions of original GNU grep
could be used for parallel grep. Each case is tested 10 times and averaged. Largemem queue of stampede
is used to test on 32 cores per a node. The execution time is measured using time command of linux.
A large test file is generated for the test. The size of the file is about 600MB. It consists of 5 millions
of lines and each line has 12 random words. The file is created by a ruby script using the dictionary in
Linux. Also, the program is tested for many small files. Test files are source codes of autotools(automake,
autoconf, and m4) and there are 3,216 text files.
4.2. RESULTS
4.2.1. TESTS ON A LARGE FILE
First, the program is tested on a file larger than 600MB. The pattern is ‘apple’ and it occurs in the file
several times, which means this is not the worst case. The result is below:
4
5. 0 5 10 15 20 25 30
0
2
4
6
8
number of threads
executiontime(seconds)
Figure 4.1: Execution times of the normal case (there are occurrences of the pattern)
As seen in the result, the execution time decreases as the number of threads increases. The execution
time is 1.3374 seconds when the sequential grep (option -F) is used. However, comparing with the origi-
nal grep is meaningless because the GNU grep has been developed and optimized for many years unlike
the parallel grep. Nevertheless, the result shows the possibility that the performance of the parallel grep
can be better than the sequential one.
The second test is at the worst case. In this case, there is no occurrence of the pattern in the test file.
The result is below:
0 5 10 15 20 25 30
0
1
2
3
4
number of threads
executiontime
Figure 4.2: Execution times at the worst case (there is no occurrence of the pattern)
5
6. Although this is the worst case, the execution time is less than the first test. The first test takes more
time because it includes the time to print lines. Therefore, the process for the first test becomes I/O
bound.
Finally, the program is tested with a periodic pattern. The result is below:
0 5 10 15 20 25 30
1
2
3
4
5
6
number of threads
executiontime
Figure 4.3: Execution times at the worst case when the pattern ‘abcdeabcdeabc’ is periodic.
It shows a similar result with the nonperiodic case. In all of the tests, the execution time doesn’t de-
creases much as the number of threads increases. After analyzing, it is found that there are many se-
quential parts that cannot be parallelized. For example, codes to read files to buffer or print lines should
runs sequentially. That’s the limitation of the project and the most of the codes should be modified to
parallelize more.
6
7. 4.2.2. TESTS ON MULTIPLE FILES
The program is also tested for multiple small files(3,216 files from autotools source codes). Results of
normal and the worst case are below:
0 5 10 15 20 25 30
1.3
1.4
1.5
1.6
number of threads
executiontime
Figure 4.4: Execution times of a normal case for many small files
0 5 10 15 20 25 30
1
1.02
1.04
1.06
1.08
1.1
number of threads
executiontime
Figure 4.5: Execution times at the worst case for many small files
As we can see from the result above, this program is not efficient for many small files. The reason
is that the overhead coming from frequent parallelizing is high. In other words, it takes a long time to
frequently create and join threads.
7
8. There is a way to overcome this. Instead of loading only a file to buffer, read multiple files to a larger
buffer and store the positions of the beginning and the end of files. And find all occurrences of the pattern
in the buffer. Finally, print lines and file names using the stored offsets.
4.2.3. COMPARISON WITH ANOTHER METHOD TO PARALLELIZE
There is not a good way to parallelize grep and that’s why I started this project. A simple way to parallelize
grep is using GNU parallel. GNU parallel tool is a command-line driven utility for Linux or other Unix-
like operating systems which allows the user to execute shell scripts in parallel. For example, ‘parallel
-j 20 –pipe –block 1024k grep -F –color=always apple < text.txt’ command can easily parallelize the grep
command. The command divides a large file into several blocks of size 1024KB and executes the grep in
parallel. However, the execution time was 8.55 seconds when the example command is executed with
the same test file and environment. Moreover, the result is different from the original grep. Thus, the
parallel command cannot effectively parallelize the grep.
5. CONCLUSIONS
By using parallel string matching algorithm, parallel grep is implemented and tested. Unfortunately, the
result shows that it is not ideal to substitute the original grep at this time. It requires more works to solve
problems.
Although this algorithm may not be the best for GNU grep, it can be used for other programs. For
example, gzip which is a file compression tool uses string matching algorithm to compress files. If the
algorithm is applied to it, the compression time would decrease when using multi-core systems. And the
find tool searches files by name for unix-like systems. Not only that, there are many unix utilities using a
string matching algorithm. If the algorithm is implemented as a library, it can be utilized for many areas.
By doing so, it would be possible to increase the load balance and efficiency of these tools on multiple
core machines.
A. HOW TO BUILD AND EXECUTE THE PROGRAM
A.1. BUILD
The GNU grep is packaged by autotools and thus, it can be compiled and installed using a common
command to build package below:
> configure && make && make install
If you use a system that the root permission is not granted, you can change the directory to be installed.
> configure - -prefix=<path to be installed>
If the latest version of autotools is not installed, the program may not be compiled. In this case, autotools
(automake, autoconf, and m4) should be installed before compiling.
automake 1.15: http://ftp.gnu.org/gnu/automake/automake-1.15.tar.gz
autoconf 2.69: http://ftp.gnu.org/gnu/autoconf/autoconf-2.69.tar.gz
m4 1.4.17: http://ftp.gnu.org/gnu/m4/m4-1.4.17.tar.gz
Likewise, the path to be installed can change using configure command. To execute the program on
stampede, these tools should be installed on your account.
8
9. A.2. EXECUTION
The program runs in parallel with the option ‘-p’.
Usage: grep -p <number of threads> <pattern> <files>
Ex> grep -p 4 - -color=always apple test.txt
Ex> grep -p 4 - -color=always -r apple /test/*
The number of patterns should be one.
REFERENCES
[1] Maxime Crochemore, Zvi Galil, Leszek Gasieniec, Kunsoo Park, and Wojciech Rytter. Constant-time
randomized parallel string matching. SIAM Journal on Computing, 26(4):950–960, 1997.
[2] Zvi Galil. Optimal parallel algorithms for string matching. Information and Control, 67(1):144–157,
1985.
[3] Uzi Vishkin. Optimal parallel pattern matching in strings. Springer, 1985.
9