Parallelization of the Aho-Corasick string-matching algorithm

Parallelization of a string-matching algorithm
Advanced Algorithms
Alessandro Liparoti

<Name Surname>
2
String-matching: AC algorithm
 String-matching algorithms are a class of algorithms
that aim to find occurrences of words (patterns) within
a larger string (text)
 Aho-Corasick algorithm (AC) is a classic solution to
exact set matching.
 Given
pattern set 𝑃 = { 𝑃1, . . . , 𝑃𝑘 }
text 𝑇[1 … 𝑚]
total length of patterns n = 𝑖=1
𝑘
|𝑃𝑖|
the AC algorithms complexity is 𝑂(𝑛 + 𝑚 + 𝑧), where 𝑧
is the number of pattern occurrences in 𝑇

<Name Surname>
3
AC algorithm: finite-state machine
 The AC algorithm builds a finite-state machine to
efficiently memorize the pattern set
 The FSA is memorized along with three functions
the goto function 𝑔(𝑞, 𝑎) gives the state entered
from current state 𝑞 by matching target char 𝑎
the failure function 𝑓 𝑞 , 𝑞 ≠ 0 gives the state
entered at a mismatch
the output function out 𝑞 gives the set of patterns
recognized when entering state q

<Name Surname>
4
AC algorithm: FSA example
 𝑃 = ℎ𝑒, 𝑠ℎ𝑒, ℎ𝑖𝑠, ℎ𝑒𝑟𝑠
 Dashed arrows are fail transitions

<Name Surname>
5
AC algorithm: matching phase
 The AC algorithm uses the FSA to match the text
against the keywords
 𝐴𝐶_𝑚𝑎𝑡𝑐ℎ𝑖𝑛𝑔 𝑇 1 … 𝑚
𝑞 ≔ 0; // initial state (root)
𝒇𝒐𝒓 𝑖 ≔ 1 𝒕𝒐 𝑚 𝒅𝒐
𝒘𝒉𝒊𝒍𝒆 𝑔 𝑞, 𝑇 𝑖 = 0 𝒅𝒐
𝑞 ≔ 𝑓 𝑞 ; // follow a fail
𝑞 ≔ 𝑔 𝑞, 𝑇 𝑖 ; // follow a goto
𝒊𝒇 𝑜𝑢𝑡 𝑞 ≠ 0 𝒕𝒉𝒆𝒏 𝒑𝒓𝒊𝒏𝒕 𝑖, 𝑜𝑢𝑡 𝑞 ;
𝒆𝒏𝒅𝒇𝒐𝒓
 The number of steps of the loop is equal to the length
of the text

<Name Surname>
6
Parallelization step
 Idea: parallelize the matching phase of the AC
algorithm (the FSA can be built once for each
pattern data set)
 The 𝑚 steps of the loop can be split in 𝑘 chunks,
each one of length 𝑙 = 𝑚 𝑘 and then each chunk
can be processed by a thread
 Feasible because a chunk can be independently
analyzed
 𝑚 = 19 𝑘 = 3 𝑙 = 7

<Name Surname>
7
Parallelization: problems
 The splitting phase as performed before can lead
to missing occurrences
 Let assume 𝑃 = 𝑎𝑑𝑣, 𝑜𝑟𝑖𝑡, 𝑒𝑑
 Each thread would run AC on its related chunk
Thread 1: 𝑇 = 𝑎𝑑𝑣𝑎𝑛𝑐𝑒
Thread 2: 𝑇 = 𝑑 𝑎𝑙𝑔𝑜𝑟
Thread 3: 𝑇 = 𝑖𝑡ℎ𝑚𝑠
 None of them would find the occurrences of the
second and third keyword
 Needed a redundancy for text overlapping two
chunks

<Name Surname>
8
Parallelization: solutions
 The maximum needed overlap o is the lenght of
the longest word in the pattern data set – 1
 Each chunk will contain the last o characters of the
previous one
 However: orit correctly found by thread 3 but ed
incorrectly matched twice (threads 1 and 2)
 Correction: start counting matches only after o
characters read

<Name Surname>
9
Implementation
 AC has been implemented in C using openMP; the
matching-phase has been split among threads
using the pragma for structure
 Input: text, keywords, number of threads
 Output: number of occurences
 The chunk size 𝑙 is computed with the following
formula
𝑙 = 𝑚 + 𝑜𝑣 ( 𝑘 − 1)
 The output variable is aggregated after the end of
the loop ( reduction statement )

<Name Surname>
10
Implementation
 Each read character is converted in its ASCII code
 Therefore, the FSA
allows 256 different
transitions
 It allows to use the AC
algorithm even with
non-textual files
 Binary files must be
read bytewise

<Name Surname>
11
Test
 Very large input files have been used in order to
test the algorithm’s performance
a text file containing the English version of the
bible
a dictionary including the 10000 most common
English words
 A single test consists of an aggregation measure of
10 different runs of the algorithm on the inputs
using the same number of threads

<Name Surname>
12
Test
1 2 3 4 5 6 7 8 9 10
0
50
100
150
200
250
300
350
number of threads
executiontime(sec)
i7 4700MQ - 4 cores/8 threads
Mean
Minimum

<Name Surname>
13
Test
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0
15
30
45
60
75
90
105
number of threads
executiontime(sec)
12 cores/24 threads machine
mean
min

<Name Surname>
14
Conclusion
 In this work it has been showed a parallelization
procedure for a serial-designed algorithm
 The more threads are used the faster the
algorithm runs until a certain point after which we
do not get any improvements
 Parallelization improves performance but requires
modifications not always clear from the beginning
that often lead to overheads

Parallelization of the Aho-Corasick string-matching algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallelization of the Aho-Corasick string-matching algorithm

Similar to Parallelization of the Aho-Corasick string-matching algorithm (20)

Recently uploaded

Recently uploaded (20)

Parallelization of the Aho-Corasick string-matching algorithm