Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Efficient String Matching with Aho-Corasick Algorithm
1. String Matching with Finite
Automata
Aho-Corasick String Matching
By Waqas Shehzad
Fast NU Pakistan
2. String Matching
Whenever you use a search engine, or
a “find” function like grep, you are
utilizing a string matching program.
Many of these programs create finite
automata in order to effectively search
for your string.
3. Finite state machines
A finite state machine (FSM, also
known as a deterministic finite
automaton or DFA) is a way of
representing a language
we represent the language as the set
of those strings accepted by some
program. So, once you've found the
right machine, we can test whether a
given string matches just by running it.
4. How it works
We'll draw pictures with circles and arrows. A
circle will represent a state, an arrow with a
label will represent that we go to that state if
we see that character.
A finite automaton accepts strings in a
specific language. It begins in state q 0 and
reads characters one at a time from the input
string. It makes transitions (φ) based on
these characters, and if when it reaches the
end of the tape it is in one of the accept
states, that string is accepted by the
language.
5. Example
Example, that could be used by the C preprocessor (a part of most C compilers)
to tell which characters are part of comments and can be removed from the input
They can be viewed as just being a special kind of graph, and we can use any of
the normal graph representations to store them.
6. cont
One particularly useful representation is a transition
table: we make a table with rows indexed by states,
and columns indexed by possible input characters
7. Finite Automata
A finite automaton is a quintuple (Q, Σ, δ, s,
F):
Q: the finite set of states
Σ: the finite input alphabet
δ: the “transition function” from QxΣ to
Q
s ∈ Q: the start state
F ⊂ Q: the set of final (accepting) states
8. Example: nano
State diagram for finding word “Nano "through grep
utility.
Simulating this on the string "banananona“
We get the sequence of states empty, empty, empty, "n", "na", "nan",
"na", "nan", "nano", "nano", "nano".
10. Running Time of
Compute-Transition-Function
It takes something like O(m^3 + n) time:
O(m^3) to build the state table described
above,
O(n) to simulate it on the input file.
12. Introduction
Locate all occurrences of any of a finite
number of keywords in a string of text.
Consists of constructing a finite state
pattern matching machine from the
keywords and then using the pattern
matching machine to process the text
string in a single pass.
13. Pattern Matching Machine(1)
Let K = { y , y ,, ybe a finite set of
1 2 k
}
strings which we shall call keywords
and let x be an arbitrary string which we
shall call the text string.
The behavior of the pattern matching
machine is dictated by three functions:
a goto function g , a failure function f ,
and an output function output.
14.
15. Pattern Matching Machine(2)
Goto function g : maps a pair consisting of
a state and an input symbol into a state or the
message fail.
Failure function f : maps a state into a
state, and is consulted whenever the goto
function reports fail.
Output function : associating a set of
keyword (possibly empty) with every state.
16.
17. Start state is state 0.
Let s be the current state and a the
current symbol of the input string x.
Operating cycle
g ( s, a ) = s '
If , makes a goto transition, and
enters state s’ and the next symbol of x
becomes the current input symbol.
g ( s, a ) = fail
If f ( s ) = s' , make a failure transition f. If
, the machine repeats the cycle with s’
as the current state and a as the current
input symbol.
18.
19. Example
Text: u s h e r s
State: 0 0 3 4 5 8 9
2
In state 4, since g ( 4, e ) = 5, and the
machine enters state 5, and finds
keywords “she” and “he” at the end of
position four in text string, emits output ( 5)
20. Example Cont’d
In state 5 on input symbol r, the machine
makes two state transitions in its
operating cycle.
Since g ( 5, r ) = fail, M enters state 2 = f (. )
5
Then since g ( 2, r ) = 8, M enters state 8 and
advances to the next input symbol.
No output is generated in this operating
cycle.
21. Construction the functions
Two part to the construction
First : Determine the states and the goto
function.
Second : Compute the failure function.
Output function start at first, complete at
second.
22. Construction of Goto function
Construct a goto graph like next page.
New vertices and edges to the graph,
starting at the start state.
Add new edges only when necessary.
Add a loop from state 0 to state 0 on all
input symbols other than keywords.
23.
24.
25.
26.
27. About construction
When we determine f ( s ) = s ' we merge the
,
outputs of state s with the output of state s’.
In fact, if the keyword “his” were not present,
then could go directly from state 4 to state 0,
skipping an unnecessary intermediate
transition to state 1.
To avoid above, we can use the deterministic
finite automaton, which discuss later.
28. Time Complexity of Algorithms 1,
2, and 3
Algorithms 1 makes fewer than 2n state
transitions in processing a text string of length
n.
Algorithms 2 requires time linearly
proportional to the sum of the lengths of the
keywords.
Algorithms 3 can be implemented to run in
time proportional to the sum of the lengths of
the keywords.
29. Eliminating Failure Transitions
Using in algorithm 1
δ ( s, a ), a next move function δsuch that
for each state s and input symbol a.
By using the next move function δ , we
can dispense with all failure transitions,
and make exactly one state transition
per input character.
30.
31.
32. Conclusion
Attractive in large numbers of
keywords, since all keywords can be
simultaneously matched in one pass.
Using Next move function
can reduce state transitions by 50%, but
more memory.
Spend most time in state 0 from which
there are no failure transitions.