SlideShare a Scribd company logo
Biosequence Algorithms, Spring 2012
Lecture 4: Set Matching and
Aho-Corasick Algorithm
Pekka Kilpel¨ainen
University of Eastern Finland
School of Computing
BSA Lecture 4: Aho-Corasick matching – p.1/27
Exact Set Matching Problem
In the exact set matching problem we locate occurrences
of any pattern of a set P = {P1, . . . , Pk}, in target T[1 . . . m]
Let n = k
i=1 |Pi|. Exact set matching can be solved in time
O(|P1| + m + · · · + |Pk| + m) = O(n + km)
by applying any linear-time exact matching k times
Aho-Corasick algorithm (AC) is a classic solution to
exact set matching. It works in time O(n + m + z), where z
is number of pattern occurrences in T
(Main reference here [Aho and Corasick, 1975])
AC is based on a refinement of a keyword tree
BSA Lecture 4: Aho-Corasick matching – p.2/27
Keyword Trees
A keyword tree (or a trie) for a set of patterns P is a
rooted tree K such that
1. each edge of K is labeled by a character
2. any two edges out of a node have different labels
Define the label of a node v as the concatenation of edge
labels on the path from the root to v, and denote it by L(v)
3. for each P ∈ P there’s a node v with L(v) = P, and
4. the label L(v) of any leaf v equals some P ∈ P
BSA Lecture 4: Aho-Corasick matching – p.3/27
Example of a Keyword Tree
A keyword tree for P = {he, she, his, hers}:


-


-


-


-


-


-








-
Z
Z
ZZ~
J
J
J
J
J
J^
h e sr
s
s
h e
i
2
41
3
A keyword tree is an efficient implementation of a
dictionary of strings
BSA Lecture 4: Aho-Corasick matching – p.4/27
Keyword Tree: Construction
Construction for P = {P1, . . . , Pk}:
Begin with a root node only;
Insert each pattern Pi, one after the other, as follows:
Starting at the root, follow the path labeled by chars of Pi;
If the path ends before Pi, continue it by adding new
edges and nodes for the remaining characters of Pi
Store identifier i of Pi at the terminal node of the path
This takes clearly O(|P1| + · · · + |Pk|) = O(n) time
BSA Lecture 4: Aho-Corasick matching – p.5/27
Keyword Tree: Lookup
Lookup of a string P: Starting at root, follow the path
labeled by characters of P as long as possible;
If the path leads to a node with an identifier, P is a
keyword in the dictionary
If the path terminates before P, the string is not in the
dictionary
Takes clearly O(|P|) time — An efficient look-up method!
Naive application to pattern matching would lead to Θ(nm)
time
Next we extend a keyword tree into an automaton,
to support linear-time matching
BSA Lecture 4: Aho-Corasick matching – p.6/27
Aho-Corasick Automaton (1)
States: nodes of the keyword tree
initial state: 0 = the root
Actions are determined by three functions:
1. the goto function g(q, a) gives the state entered from
current state q by matching target char a
if edge (q, v) is labeled by a, then g(q, a) = v;
g(0, a) = 0 for each a that does not label an edge
out of the root
the automaton stays at the initial state while
scanning non-matching characters
Otherwise g(q, a) = ∅
BSA Lecture 4: Aho-Corasick matching – p.7/27
Aho-Corasick Automaton (2)
2. the failure function f(q) for q = 0 gives the state
entered at a mismatch
f(q) is the node labeled by the longest proper suffix
w of L(q) s.t. w is a prefix of some pattern
→ a fail transition does not miss any potential
occurrences
NB: f(q) is always defined, since L(0) = ǫ is a
prefix of any pattern
3. the output function out(q) gives the set of patterns
recognized when entering state q
BSA Lecture 4: Aho-Corasick matching – p.8/27
Example of an AC Automaton
0 1
6
2
3 4 5
7
8 9
{he}
{hers}
{his}
h
={h, s}
h {he, she}
s
e
r s
i
e
s
Dashed arrows are fail transitions
BSA Lecture 4: Aho-Corasick matching – p.9/27
AC Search of Target T[1 . . . m]
q := 0; // initial state (root)
for i := 1 to m do
while g(q, T[i]) = ∅ do
q := f(q); // follow a fail
q := g(q, T[i]); // follow a goto
if out(q) = ∅ then print i, out(q);
endfor;
Example:
Search text “ushers” with the preceding automaton
BSA Lecture 4: Aho-Corasick matching – p.10/27
AC Search Example
0 1
6
2
3 4 5
7
8 9
{he}
{hers}
{his}
h
={h, s}
h {he, she}
s
e
r s
i
e
s
1 2 3 4 5 6
T: u s h e r s
q:
BSA Lecture 4: Aho-Corasick matching – p.11/27
Complexity of AC Search
Theorem Searching target T[1 . . . m] with an AC
automaton takes time O(m + z), where z is the number of
pattern occurrences
Proof. For each target character, the automaton performs
0 or more fail transitions, followed by a goto.
Each goto either stays at the root, or increases the depth
of q by 1 ⇒ the depth of q is increased ≤ m times
Each fail moves q closer to the root
⇒ the total number of fail transitions is ≤ m
The z occurrences can be reported in z × O(1) = O(z) time
(say, as pattern identifiers and start positions of
occurrences)
BSA Lecture 4: Aho-Corasick matching – p.12/27
Constructing an AC Automaton
The AC automaton can be constructed in two phases
Phase I:
1. Construct the keyword tree for P
for each P ∈ P added to the tree, set
out(v) := {P} for the node v s.t. L(v) = P
2. complete the goto function for the root by setting
g(0, a) := 0
for each a ∈ Σ that doesn’t label an edge out of the root
If the alphabet Σ is fixed, Phase I takes time O(n)
BSA Lecture 4: Aho-Corasick matching – p.13/27
Result of Phase I
sr
s
s
e
0 1
6
2
3 4 5
7
8 9
{he}
{hers}
{his}
{she}
h
={h, s}
i
e
h
BSA Lecture 4: Aho-Corasick matching – p.14/27
Result of Phase I
After Phase I
goto is complete
out() may be incomplete (E.g., ”he” for state 5)
fail links are missing
The automaton is completed in Phase II
BSA Lecture 4: Aho-Corasick matching – p.15/27
Phase II of the AC Construction
Q := emptyQueue();
for a ∈ Σ do
if q(0, a) = q = 0 then
f(q) := 0; enqueue(q, Q);
while not isEmpty(Q) do
r:= dequeue(Q);
for a ∈ Σ do
if g(r, a) = u = ∅ then
enqueue(u, Q); v := f(r);
while g(v, a) = ∅ do v := f(v); // (*)
f(u) := g(v, a);
out(u) := out(u) ∪ out(f(u));
What does this do?
BSA Lecture 4: Aho-Corasick matching – p.16/27
Explanation of Phase II
Functions fail and output are computed for the nodes of
the trie in a breadth-first order
nodes closer to the root have already been processed
Consider nodes r and u = g(r, a), that is, r is the parent of
u and L(u) = L(r)a
Now what should f(u) be?
A: The deepest node labeled by a proper suffix of L(u).
The executions of line (*) find this, by locating the deepest
node v s.t. L(v) is a proper suffix of L(r) and g(v, a)
(= f(u)) is defined.
(Notice that v and g(v, a) may both be the root.)
BSA Lecture 4: Aho-Corasick matching – p.17/27
Completing the Output Functions
What about
out(u) := out(u) ∪ out(f(u)); ?
This is done because the patterns recognized at f(u) (if
any), and only those, are proper suffixes of L(u), and shall
thus be recognized at state u also.
BSA Lecture 4: Aho-Corasick matching – p.18/27
Complexity of the AC Construction
Phase II can be implemented to run in time O(n), too:
The breadth-first traversal alone takes time proportional to
the size of the tree, which is O(n);
OK; . . .
Is there also an O(n) bound for the number of times that
the f transitions are followed (on line (*) of Phase II)?
A: Yes! See next
BSA Lecture 4: Aho-Corasick matching – p.19/27
AC Construction: Number of fail
transitions
Consider nodes u1, . . . , ul created by entering a pattern
a1 . . . al; Let df(ui) denote the depth of their f nodes:
u 1 u 2 u i
u i+1a 2 a i
a i+1
a 1
df(u )i df(u )i+1
a i+1
u l
a l
df(u )  ll
0
df=0 df=0 or 1
When computing f(ui+1), each repeat of line (*)
decrements the value of df(ui+1)
BSA Lecture 4: Aho-Corasick matching – p.20/27
Number of fail transitions (cont)
So, for a pattern of length l,
the line (*) cannot be repeated over l times
Thus, in total over all patterns,
line (*) cannot be repeated over n times
BSA Lecture 4: Aho-Corasick matching – p.21/27
AC Construction: Unions of output
functions
Is it costly to perform
out(u) := out(u) ∪ out(f(u)); ?
No: Before the assignment, out(u) = ∅ or out(u) = {L(u)}.
Any patterns in out(f(u)) are shorter than L(u)
⇒ the sets are disjoint
→ Output sets can be implemented as linked lists, and
united in constant time
BSA Lecture 4: Aho-Corasick matching – p.22/27
Biological Applications
1. Matching against a library of known patterns
A Sequence-tagged-site (STS) is, roughly, a DNA string
of 200–300 bases whose left and right ends occur only
once in the entire genome
ESTs (expressed sequence tags) are STSs that participate
in gene expression, and thus belong to genes
Hundreds of thousands of STSs and tens of thousands of
ESTs (by mid-90’s) are stored in databases, and used to
compare against new DNA sequences
Ability to search for occurrences of patterns in
time that is independent of their number is very useful
BSA Lecture 4: Aho-Corasick matching – p.23/27
2. Matching with Wild Cards
Let φ be a wild card that matches any single character
For example, abφφcφ occurs at positions 2 and 7 of
1234567890123
xabvccababcax
A transcription factor is a protein that binds to specific
locations of DNA and regulates its transcription to RNA
Many families of transcription factors are characterized by
substrings with wild cards
Example: Transcription factor Zinc Finger has signature
CφφCφφφφφφφφφφφφφHφφH
(C and H: amino acids; cysteine and histidine)
BSA Lecture 4: Aho-Corasick matching – p.24/27
Matching with Wild Cards (2)
If the number of wild cards is bounded by a constant,
patterns with wild-cards can be matched in linear time, by
counting occurrences of non-wild-card substrings of P:
Let P = {P1, . . . , Pk} be the substrings of P separated by
wild-cards, and let l1, . . . , lk be their end positions in P
Preprocess: Build an AC automaton for P;
Initiate occurrence counts: for i := 1 to |T| do C[i] := 0;
Search target T with the AC automaton
When pattern Pj is found to end at position i ≥ lj of T,
increment C[i − lj + 1] by one;
Then P occurs at T[i] iff C[i] = k
BSA Lecture 4: Aho-Corasick matching – p.25/27
Example
Let P = φATCφφTCφATC
Then P = {ATC, TC, ATC} with l1 = 4, l2 = 8 and l3 = 12
0 1 2
A CT
3
4 5
T C
{8}
C, G
{4, 12, 8}
Search on
i: 12345678901234...
T: ACGATCTCTCGATC...
C[1] = C[7] = C[11] = 1 and C[3] = 3 (∼ occurrence)
BSA Lecture 4: Aho-Corasick matching – p.26/27
Complexity of AC Wild-Card
Matching
Let |P| = n and |T| = m
Preprocessing: O(n + m) (← k
i=1 |Pi| ≤ n)
Search: O(m + z), where z is # of occurrences of P in T
Each of them increments a cell of C by one, and each of
C[1], . . . , C[m] is incremented at most k times ⇒ z ≤ km
We have derived the following result:
Theorem Matching of a pattern P with k wild-cards can be
performed in time O(|P| + k|T|)
BSA Lecture 4: Aho-Corasick matching – p.27/27

More Related Content

What's hot

PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
Umesh Kumar
 
Java arrays
Java arraysJava arrays
Java arrays
Jin Castor
 
Tuple in python
Tuple in pythonTuple in python
Tuple in python
vikram mahendra
 
What is Dictionary In Python? Python Dictionary Tutorial | Edureka
What is Dictionary In Python? Python Dictionary Tutorial | EdurekaWhat is Dictionary In Python? Python Dictionary Tutorial | Edureka
What is Dictionary In Python? Python Dictionary Tutorial | Edureka
Edureka!
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
VedaGayathri1
 
Python programming : Arrays
Python programming : ArraysPython programming : Arrays
Python programming : Arrays
Emertxe Information Technologies Pvt Ltd
 
List , tuples, dictionaries and regular expressions in python
List , tuples, dictionaries and regular expressions in pythonList , tuples, dictionaries and regular expressions in python
List , tuples, dictionaries and regular expressions in python
channa basava
 
Boyer moore algorithm
Boyer moore algorithmBoyer moore algorithm
Boyer moore algorithm
AYESHA JAVED
 
Trie Data Structure
Trie Data StructureTrie Data Structure
Array
ArrayArray
NUMPY
NUMPY NUMPY
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
Ashikapokiya12345
 
Heap sort
Heap sortHeap sort
Heap sort
Mohd Arif
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
Bulbul Agrawal
 
Sorting
SortingSorting
Sequential & binary, linear search
Sequential & binary, linear searchSequential & binary, linear search
Sequential & binary, linear search
montazur420
 
Data Structures - Lecture 8 [Sorting Algorithms]
Data Structures - Lecture 8 [Sorting Algorithms]Data Structures - Lecture 8 [Sorting Algorithms]
Data Structures - Lecture 8 [Sorting Algorithms]
Muhammad Hammad Waseem
 
Arrays in python
Arrays in pythonArrays in python
Arrays in python
moazamali28
 
Lists
ListsLists
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
Hemantha Kulathilake
 

What's hot (20)

PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
PPT On Sorting And Searching Concepts In Data Structure | In Programming Lang...
 
Java arrays
Java arraysJava arrays
Java arrays
 
Tuple in python
Tuple in pythonTuple in python
Tuple in python
 
What is Dictionary In Python? Python Dictionary Tutorial | Edureka
What is Dictionary In Python? Python Dictionary Tutorial | EdurekaWhat is Dictionary In Python? Python Dictionary Tutorial | Edureka
What is Dictionary In Python? Python Dictionary Tutorial | Edureka
 
Python strings presentation
Python strings presentationPython strings presentation
Python strings presentation
 
Python programming : Arrays
Python programming : ArraysPython programming : Arrays
Python programming : Arrays
 
List , tuples, dictionaries and regular expressions in python
List , tuples, dictionaries and regular expressions in pythonList , tuples, dictionaries and regular expressions in python
List , tuples, dictionaries and regular expressions in python
 
Boyer moore algorithm
Boyer moore algorithmBoyer moore algorithm
Boyer moore algorithm
 
Trie Data Structure
Trie Data StructureTrie Data Structure
Trie Data Structure
 
Array
ArrayArray
Array
 
NUMPY
NUMPY NUMPY
NUMPY
 
String matching algorithms
String matching algorithmsString matching algorithms
String matching algorithms
 
Heap sort
Heap sortHeap sort
Heap sort
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
Sorting
SortingSorting
Sorting
 
Sequential & binary, linear search
Sequential & binary, linear searchSequential & binary, linear search
Sequential & binary, linear search
 
Data Structures - Lecture 8 [Sorting Algorithms]
Data Structures - Lecture 8 [Sorting Algorithms]Data Structures - Lecture 8 [Sorting Algorithms]
Data Structures - Lecture 8 [Sorting Algorithms]
 
Arrays in python
Arrays in pythonArrays in python
Arrays in python
 
Lists
ListsLists
Lists
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 

Similar to Aho corasick-lecture

Class1
 Class1 Class1
Class1
issbp
 
Ch02
Ch02Ch02
Ch02
Hankyo
 
Nondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdfNondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdf
SergioUlisesRojasAla
 
Volume computation and applications
Volume computation and applications Volume computation and applications
Volume computation and applications
Vissarion Fisikopoulos
 
Design and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpDesign and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment Help
Programming Homework Help
 
Lec1.pptx
Lec1.pptxLec1.pptx
Lec1.pptx
ziadk6872
 
Programmable PN Sequence Generators
Programmable PN Sequence GeneratorsProgrammable PN Sequence Generators
Programmable PN Sequence Generators
Rajesh Singh
 
Characterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean EmbeddingsCharacterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean Embeddings
Don Sheehy
 
Automata
AutomataAutomata
Automata
Gaditek
 
Automata
AutomataAutomata
Automata
Gaditek
 
QB104541.pdf
QB104541.pdfQB104541.pdf
QB104541.pdf
MrRRajasekarCSE
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensions
Vissarion Fisikopoulos
 
Automata theory
Automata theoryAutomata theory
Automata theory
Pardeep Vats
 
Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...
Apostolos Chalkis
 
Algorithms on Strings
Algorithms on StringsAlgorithms on Strings
Algorithms on Strings
Michael Soltys
 
Chapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfChapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdf
dawod yimer
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.ppt
WahyuAde4
 
Fsa
FsaFsa
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Yamagata Yoriyuki
 
Graphs
GraphsGraphs

Similar to Aho corasick-lecture (20)

Class1
 Class1 Class1
Class1
 
Ch02
Ch02Ch02
Ch02
 
Nondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdfNondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdf
 
Volume computation and applications
Volume computation and applications Volume computation and applications
Volume computation and applications
 
Design and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpDesign and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment Help
 
Lec1.pptx
Lec1.pptxLec1.pptx
Lec1.pptx
 
Programmable PN Sequence Generators
Programmable PN Sequence GeneratorsProgrammable PN Sequence Generators
Programmable PN Sequence Generators
 
Characterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean EmbeddingsCharacterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean Embeddings
 
Automata
AutomataAutomata
Automata
 
Automata
AutomataAutomata
Automata
 
QB104541.pdf
QB104541.pdfQB104541.pdf
QB104541.pdf
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensions
 
Automata theory
Automata theoryAutomata theory
Automata theory
 
Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...
 
Algorithms on Strings
Algorithms on StringsAlgorithms on Strings
Algorithms on Strings
 
Chapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfChapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdf
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.ppt
 
Fsa
FsaFsa
Fsa
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
Graphs
GraphsGraphs
Graphs
 

Recently uploaded

A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
Celine George
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
nitinpv4ai
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
nitinpv4ai
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
EduSkills OECD
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Denish Jangid
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
ImMuslim
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
Prof. Dr. K. Adisesha
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
National Information Standards Organization (NISO)
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
JomonJoseph58
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
National Information Standards Organization (NISO)
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
heathfieldcps1
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
EduSkills OECD
 

Recently uploaded (20)

A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
Bonku-Babus-Friend by Sathyajith Ray (9)
Bonku-Babus-Friend by Sathyajith Ray  (9)Bonku-Babus-Friend by Sathyajith Ray  (9)
Bonku-Babus-Friend by Sathyajith Ray (9)
 
Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10Haunted Houses by H W Longfellow for class 10
Haunted Houses by H W Longfellow for class 10
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Chapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptxChapter wise All Notes of First year Basic Civil Engineering.pptx
Chapter wise All Notes of First year Basic Civil Engineering.pptx
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
Geography as a Discipline Chapter 1 __ Class 11 Geography NCERT _ Class Notes...
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
 
Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"Benner "Expanding Pathways to Publishing Careers"
Benner "Expanding Pathways to Publishing Careers"
 
Stack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 MicroprocessorStack Memory Organization of 8086 Microprocessor
Stack Memory Organization of 8086 Microprocessor
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
Jemison, MacLaughlin, and Majumder "Broadening Pathways for Editors and Authors"
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
 
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptxBeyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
Beyond Degrees - Empowering the Workforce in the Context of Skills-First.pptx
 

Aho corasick-lecture

  • 1. Biosequence Algorithms, Spring 2012 Lecture 4: Set Matching and Aho-Corasick Algorithm Pekka Kilpel¨ainen University of Eastern Finland School of Computing BSA Lecture 4: Aho-Corasick matching – p.1/27 Exact Set Matching Problem In the exact set matching problem we locate occurrences of any pattern of a set P = {P1, . . . , Pk}, in target T[1 . . . m] Let n = k i=1 |Pi|. Exact set matching can be solved in time O(|P1| + m + · · · + |Pk| + m) = O(n + km) by applying any linear-time exact matching k times Aho-Corasick algorithm (AC) is a classic solution to exact set matching. It works in time O(n + m + z), where z is number of pattern occurrences in T (Main reference here [Aho and Corasick, 1975]) AC is based on a refinement of a keyword tree BSA Lecture 4: Aho-Corasick matching – p.2/27 Keyword Trees A keyword tree (or a trie) for a set of patterns P is a rooted tree K such that 1. each edge of K is labeled by a character 2. any two edges out of a node have different labels Define the label of a node v as the concatenation of edge labels on the path from the root to v, and denote it by L(v) 3. for each P ∈ P there’s a node v with L(v) = P, and 4. the label L(v) of any leaf v equals some P ∈ P BSA Lecture 4: Aho-Corasick matching – p.3/27 Example of a Keyword Tree A keyword tree for P = {he, she, his, hers}: - - - - - - - Z Z ZZ~ J J J J J J^ h e sr s s h e i 2 41 3 A keyword tree is an efficient implementation of a dictionary of strings BSA Lecture 4: Aho-Corasick matching – p.4/27 Keyword Tree: Construction Construction for P = {P1, . . . , Pk}: Begin with a root node only; Insert each pattern Pi, one after the other, as follows: Starting at the root, follow the path labeled by chars of Pi; If the path ends before Pi, continue it by adding new edges and nodes for the remaining characters of Pi Store identifier i of Pi at the terminal node of the path This takes clearly O(|P1| + · · · + |Pk|) = O(n) time BSA Lecture 4: Aho-Corasick matching – p.5/27 Keyword Tree: Lookup Lookup of a string P: Starting at root, follow the path labeled by characters of P as long as possible; If the path leads to a node with an identifier, P is a keyword in the dictionary If the path terminates before P, the string is not in the dictionary Takes clearly O(|P|) time — An efficient look-up method! Naive application to pattern matching would lead to Θ(nm) time Next we extend a keyword tree into an automaton, to support linear-time matching BSA Lecture 4: Aho-Corasick matching – p.6/27 Aho-Corasick Automaton (1) States: nodes of the keyword tree initial state: 0 = the root Actions are determined by three functions: 1. the goto function g(q, a) gives the state entered from current state q by matching target char a if edge (q, v) is labeled by a, then g(q, a) = v; g(0, a) = 0 for each a that does not label an edge out of the root the automaton stays at the initial state while scanning non-matching characters Otherwise g(q, a) = ∅ BSA Lecture 4: Aho-Corasick matching – p.7/27 Aho-Corasick Automaton (2) 2. the failure function f(q) for q = 0 gives the state entered at a mismatch f(q) is the node labeled by the longest proper suffix w of L(q) s.t. w is a prefix of some pattern → a fail transition does not miss any potential occurrences NB: f(q) is always defined, since L(0) = ǫ is a prefix of any pattern 3. the output function out(q) gives the set of patterns recognized when entering state q BSA Lecture 4: Aho-Corasick matching – p.8/27
  • 2. Example of an AC Automaton 0 1 6 2 3 4 5 7 8 9 {he} {hers} {his} h ={h, s} h {he, she} s e r s i e s Dashed arrows are fail transitions BSA Lecture 4: Aho-Corasick matching – p.9/27 AC Search of Target T[1 . . . m] q := 0; // initial state (root) for i := 1 to m do while g(q, T[i]) = ∅ do q := f(q); // follow a fail q := g(q, T[i]); // follow a goto if out(q) = ∅ then print i, out(q); endfor; Example: Search text “ushers” with the preceding automaton BSA Lecture 4: Aho-Corasick matching – p.10/27 AC Search Example 0 1 6 2 3 4 5 7 8 9 {he} {hers} {his} h ={h, s} h {he, she} s e r s i e s 1 2 3 4 5 6 T: u s h e r s q: BSA Lecture 4: Aho-Corasick matching – p.11/27 Complexity of AC Search Theorem Searching target T[1 . . . m] with an AC automaton takes time O(m + z), where z is the number of pattern occurrences Proof. For each target character, the automaton performs 0 or more fail transitions, followed by a goto. Each goto either stays at the root, or increases the depth of q by 1 ⇒ the depth of q is increased ≤ m times Each fail moves q closer to the root ⇒ the total number of fail transitions is ≤ m The z occurrences can be reported in z × O(1) = O(z) time (say, as pattern identifiers and start positions of occurrences) BSA Lecture 4: Aho-Corasick matching – p.12/27 Constructing an AC Automaton The AC automaton can be constructed in two phases Phase I: 1. Construct the keyword tree for P for each P ∈ P added to the tree, set out(v) := {P} for the node v s.t. L(v) = P 2. complete the goto function for the root by setting g(0, a) := 0 for each a ∈ Σ that doesn’t label an edge out of the root If the alphabet Σ is fixed, Phase I takes time O(n) BSA Lecture 4: Aho-Corasick matching – p.13/27 Result of Phase I sr s s e 0 1 6 2 3 4 5 7 8 9 {he} {hers} {his} {she} h ={h, s} i e h BSA Lecture 4: Aho-Corasick matching – p.14/27 Result of Phase I After Phase I goto is complete out() may be incomplete (E.g., ”he” for state 5) fail links are missing The automaton is completed in Phase II BSA Lecture 4: Aho-Corasick matching – p.15/27 Phase II of the AC Construction Q := emptyQueue(); for a ∈ Σ do if q(0, a) = q = 0 then f(q) := 0; enqueue(q, Q); while not isEmpty(Q) do r:= dequeue(Q); for a ∈ Σ do if g(r, a) = u = ∅ then enqueue(u, Q); v := f(r); while g(v, a) = ∅ do v := f(v); // (*) f(u) := g(v, a); out(u) := out(u) ∪ out(f(u)); What does this do? BSA Lecture 4: Aho-Corasick matching – p.16/27
  • 3. Explanation of Phase II Functions fail and output are computed for the nodes of the trie in a breadth-first order nodes closer to the root have already been processed Consider nodes r and u = g(r, a), that is, r is the parent of u and L(u) = L(r)a Now what should f(u) be? A: The deepest node labeled by a proper suffix of L(u). The executions of line (*) find this, by locating the deepest node v s.t. L(v) is a proper suffix of L(r) and g(v, a) (= f(u)) is defined. (Notice that v and g(v, a) may both be the root.) BSA Lecture 4: Aho-Corasick matching – p.17/27 Completing the Output Functions What about out(u) := out(u) ∪ out(f(u)); ? This is done because the patterns recognized at f(u) (if any), and only those, are proper suffixes of L(u), and shall thus be recognized at state u also. BSA Lecture 4: Aho-Corasick matching – p.18/27 Complexity of the AC Construction Phase II can be implemented to run in time O(n), too: The breadth-first traversal alone takes time proportional to the size of the tree, which is O(n); OK; . . . Is there also an O(n) bound for the number of times that the f transitions are followed (on line (*) of Phase II)? A: Yes! See next BSA Lecture 4: Aho-Corasick matching – p.19/27 AC Construction: Number of fail transitions Consider nodes u1, . . . , ul created by entering a pattern a1 . . . al; Let df(ui) denote the depth of their f nodes: u 1 u 2 u i u i+1a 2 a i a i+1 a 1 df(u )i df(u )i+1 a i+1 u l a l df(u ) ll 0 df=0 df=0 or 1 When computing f(ui+1), each repeat of line (*) decrements the value of df(ui+1) BSA Lecture 4: Aho-Corasick matching – p.20/27 Number of fail transitions (cont) So, for a pattern of length l, the line (*) cannot be repeated over l times Thus, in total over all patterns, line (*) cannot be repeated over n times BSA Lecture 4: Aho-Corasick matching – p.21/27 AC Construction: Unions of output functions Is it costly to perform out(u) := out(u) ∪ out(f(u)); ? No: Before the assignment, out(u) = ∅ or out(u) = {L(u)}. Any patterns in out(f(u)) are shorter than L(u) ⇒ the sets are disjoint → Output sets can be implemented as linked lists, and united in constant time BSA Lecture 4: Aho-Corasick matching – p.22/27 Biological Applications 1. Matching against a library of known patterns A Sequence-tagged-site (STS) is, roughly, a DNA string of 200–300 bases whose left and right ends occur only once in the entire genome ESTs (expressed sequence tags) are STSs that participate in gene expression, and thus belong to genes Hundreds of thousands of STSs and tens of thousands of ESTs (by mid-90’s) are stored in databases, and used to compare against new DNA sequences Ability to search for occurrences of patterns in time that is independent of their number is very useful BSA Lecture 4: Aho-Corasick matching – p.23/27 2. Matching with Wild Cards Let φ be a wild card that matches any single character For example, abφφcφ occurs at positions 2 and 7 of 1234567890123 xabvccababcax A transcription factor is a protein that binds to specific locations of DNA and regulates its transcription to RNA Many families of transcription factors are characterized by substrings with wild cards Example: Transcription factor Zinc Finger has signature CφφCφφφφφφφφφφφφφHφφH (C and H: amino acids; cysteine and histidine) BSA Lecture 4: Aho-Corasick matching – p.24/27
  • 4. Matching with Wild Cards (2) If the number of wild cards is bounded by a constant, patterns with wild-cards can be matched in linear time, by counting occurrences of non-wild-card substrings of P: Let P = {P1, . . . , Pk} be the substrings of P separated by wild-cards, and let l1, . . . , lk be their end positions in P Preprocess: Build an AC automaton for P; Initiate occurrence counts: for i := 1 to |T| do C[i] := 0; Search target T with the AC automaton When pattern Pj is found to end at position i ≥ lj of T, increment C[i − lj + 1] by one; Then P occurs at T[i] iff C[i] = k BSA Lecture 4: Aho-Corasick matching – p.25/27 Example Let P = φATCφφTCφATC Then P = {ATC, TC, ATC} with l1 = 4, l2 = 8 and l3 = 12 0 1 2 A CT 3 4 5 T C {8} C, G {4, 12, 8} Search on i: 12345678901234... T: ACGATCTCTCGATC... C[1] = C[7] = C[11] = 1 and C[3] = 3 (∼ occurrence) BSA Lecture 4: Aho-Corasick matching – p.26/27 Complexity of AC Wild-Card Matching Let |P| = n and |T| = m Preprocessing: O(n + m) (← k i=1 |Pi| ≤ n) Search: O(m + z), where z is # of occurrences of P in T Each of them increments a cell of C by one, and each of C[1], . . . , C[m] is incremented at most k times ⇒ z ≤ km We have derived the following result: Theorem Matching of a pattern P with k wild-cards can be performed in time O(|P| + k|T|) BSA Lecture 4: Aho-Corasick matching – p.27/27