The document discusses parallelizing the string-matching Aho-Corasick algorithm. It describes splitting the text into chunks that are processed independently by threads. However, this can miss occurrences that span chunks, so an overlap between chunks is used. The matching phase is parallelized using OpenMP pragmas. Testing on large Bible and dictionary files showed speedups from using multiple threads, up to a point determined by hardware threads.
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Parallelization of the Aho-Corasick string-matching algorithm
1. Parallelization of a string-matching algorithm
Advanced Algorithms
Alessandro Liparoti
2. <Name Surname>
2
String-matching: AC algorithm
String-matching algorithms are a class of algorithms
that aim to find occurrences of words (patterns) within
a larger string (text)
Aho-Corasick algorithm (AC) is a classic solution to
exact set matching.
Given
pattern set 𝑃 = { 𝑃1, . . . , 𝑃𝑘 }
text 𝑇[1 … 𝑚]
total length of patterns n = 𝑖=1
𝑘
|𝑃𝑖|
the AC algorithms complexity is 𝑂(𝑛 + 𝑚 + 𝑧), where 𝑧
is the number of pattern occurrences in 𝑇
3. <Name Surname>
3
AC algorithm: finite-state machine
The AC algorithm builds a finite-state machine to
efficiently memorize the pattern set
The FSA is memorized along with three functions
the goto function 𝑔(𝑞, 𝑎) gives the state entered
from current state 𝑞 by matching target char 𝑎
the failure function 𝑓 𝑞 , 𝑞 ≠ 0 gives the state
entered at a mismatch
the output function out 𝑞 gives the set of patterns
recognized when entering state q
5. <Name Surname>
5
AC algorithm: matching phase
The AC algorithm uses the FSA to match the text
against the keywords
𝐴𝐶_𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝑇 1 … 𝑚
𝑞 ≔ 0; // initial state (root)
𝒇𝒐𝒓 𝑖 ≔ 1 𝒕𝒐 𝑚 𝒅𝒐
𝒘𝒉𝒊𝒍𝒆 𝑔 𝑞, 𝑇 𝑖 = 0 𝒅𝒐
𝑞 ≔ 𝑓 𝑞 ; // follow a fail
𝑞 ≔ 𝑔 𝑞, 𝑇 𝑖 ; // follow a goto
𝒊𝒇 𝑜𝑢𝑡 𝑞 ≠ 0 𝒕𝒉𝒆𝒏 𝒑𝒓𝒊𝒏𝒕 𝑖, 𝑜𝑢𝑡 𝑞 ;
𝒆𝒏𝒅𝒇𝒐𝒓
The number of steps of the loop is equal to the length
of the text
6. <Name Surname>
6
Parallelization step
Idea: parallelize the matching phase of the AC
algorithm (the FSA can be built once for each
pattern data set)
The 𝑚 steps of the loop can be split in 𝑘 chunks,
each one of length 𝑙 = 𝑚 𝑘 and then each chunk
can be processed by a thread
Feasible because a chunk can be independently
analyzed
𝑚 = 19 𝑘 = 3 𝑙 = 7
7. <Name Surname>
7
Parallelization: problems
The splitting phase as performed before can lead
to missing occurrences
Let assume 𝑃 = 𝑎𝑑𝑣, 𝑜𝑟𝑖𝑡, 𝑒𝑑
Each thread would run AC on its related chunk
Thread 1: 𝑇 = 𝑎𝑑𝑣𝑎𝑛𝑐𝑒
Thread 2: 𝑇 = 𝑑 𝑎𝑙𝑔𝑜𝑟
Thread 3: 𝑇 = 𝑖𝑡ℎ𝑚𝑠
None of them would find the occurrences of the
second and third keyword
Needed a redundancy for text overlapping two
chunks
8. <Name Surname>
8
Parallelization: solutions
The maximum needed overlap o is the lenght of
the longest word in the pattern data set – 1
Each chunk will contain the last o characters of the
previous one
However: orit correctly found by thread 3 but ed
incorrectly matched twice (threads 1 and 2)
Correction: start counting matches only after o
characters read
9. <Name Surname>
9
Implementation
AC has been implemented in C using openMP; the
matching-phase has been split among threads
using the pragma for structure
Input: text, keywords, number of threads
Output: number of occurences
The chunk size 𝑙 is computed with the following
formula
𝑙 = 𝑚 + 𝑜𝑣 ( 𝑘 − 1)
The output variable is aggregated after the end of
the loop ( reduction statement )
10. <Name Surname>
10
Implementation
Each read character is converted in its ASCII code
Therefore, the FSA
allows 256 different
transitions
It allows to use the AC
algorithm even with
non-textual files
Binary files must be
read bytewise
11. <Name Surname>
11
Test
Very large input files have been used in order to
test the algorithm’s performance
a text file containing the English version of the
bible
a dictionary including the 10000 most common
English words
A single test consists of an aggregation measure of
10 different runs of the algorithm on the inputs
using the same number of threads
12. <Name Surname>
12
Test
1 2 3 4 5 6 7 8 9 10
0
50
100
150
200
250
300
350
number of threads
executiontime(sec)
i7 4700MQ - 4 cores/8 threads
Mean
Minimum
14. <Name Surname>
14
Conclusion
In this work it has been showed a parallelization
procedure for a serial-designed algorithm
The more threads are used the faster the
algorithm runs until a certain point after which we
do not get any improvements
Parallelization improves performance but requires
modifications not always clear from the beginning
that often lead to overheads