SlideShare a Scribd company logo
1 of 4
Download to read offline
Biosequence Algorithms, Spring 2012
Lecture 4: Set Matching and
Aho-Corasick Algorithm
Pekka Kilpel¨ainen
University of Eastern Finland
School of Computing
BSA Lecture 4: Aho-Corasick matching – p.1/27
Exact Set Matching Problem
In the exact set matching problem we locate occurrences
of any pattern of a set P = {P1, . . . , Pk}, in target T[1 . . . m]
Let n = k
i=1 |Pi|. Exact set matching can be solved in time
O(|P1| + m + · · · + |Pk| + m) = O(n + km)
by applying any linear-time exact matching k times
Aho-Corasick algorithm (AC) is a classic solution to
exact set matching. It works in time O(n + m + z), where z
is number of pattern occurrences in T
(Main reference here [Aho and Corasick, 1975])
AC is based on a refinement of a keyword tree
BSA Lecture 4: Aho-Corasick matching – p.2/27
Keyword Trees
A keyword tree (or a trie) for a set of patterns P is a
rooted tree K such that
1. each edge of K is labeled by a character
2. any two edges out of a node have different labels
Define the label of a node v as the concatenation of edge
labels on the path from the root to v, and denote it by L(v)
3. for each P ∈ P there’s a node v with L(v) = P, and
4. the label L(v) of any leaf v equals some P ∈ P
BSA Lecture 4: Aho-Corasick matching – p.3/27
Example of a Keyword Tree
A keyword tree for P = {he, she, his, hers}:


-


-


-


-


-


-








-
Z
Z
ZZ~
J
J
J
J
J
J^
h e sr
s
s
h e
i
2
41
3
A keyword tree is an efficient implementation of a
dictionary of strings
BSA Lecture 4: Aho-Corasick matching – p.4/27
Keyword Tree: Construction
Construction for P = {P1, . . . , Pk}:
Begin with a root node only;
Insert each pattern Pi, one after the other, as follows:
Starting at the root, follow the path labeled by chars of Pi;
If the path ends before Pi, continue it by adding new
edges and nodes for the remaining characters of Pi
Store identifier i of Pi at the terminal node of the path
This takes clearly O(|P1| + · · · + |Pk|) = O(n) time
BSA Lecture 4: Aho-Corasick matching – p.5/27
Keyword Tree: Lookup
Lookup of a string P: Starting at root, follow the path
labeled by characters of P as long as possible;
If the path leads to a node with an identifier, P is a
keyword in the dictionary
If the path terminates before P, the string is not in the
dictionary
Takes clearly O(|P|) time — An efficient look-up method!
Naive application to pattern matching would lead to Θ(nm)
time
Next we extend a keyword tree into an automaton,
to support linear-time matching
BSA Lecture 4: Aho-Corasick matching – p.6/27
Aho-Corasick Automaton (1)
States: nodes of the keyword tree
initial state: 0 = the root
Actions are determined by three functions:
1. the goto function g(q, a) gives the state entered from
current state q by matching target char a
if edge (q, v) is labeled by a, then g(q, a) = v;
g(0, a) = 0 for each a that does not label an edge
out of the root
the automaton stays at the initial state while
scanning non-matching characters
Otherwise g(q, a) = ∅
BSA Lecture 4: Aho-Corasick matching – p.7/27
Aho-Corasick Automaton (2)
2. the failure function f(q) for q = 0 gives the state
entered at a mismatch
f(q) is the node labeled by the longest proper suffix
w of L(q) s.t. w is a prefix of some pattern
→ a fail transition does not miss any potential
occurrences
NB: f(q) is always defined, since L(0) = ǫ is a
prefix of any pattern
3. the output function out(q) gives the set of patterns
recognized when entering state q
BSA Lecture 4: Aho-Corasick matching – p.8/27
Example of an AC Automaton
0 1
6
2
3 4 5
7
8 9
{he}
{hers}
{his}
h
={h, s}
h {he, she}
s
e
r s
i
e
s
Dashed arrows are fail transitions
BSA Lecture 4: Aho-Corasick matching – p.9/27
AC Search of Target T[1 . . . m]
q := 0; // initial state (root)
for i := 1 to m do
while g(q, T[i]) = ∅ do
q := f(q); // follow a fail
q := g(q, T[i]); // follow a goto
if out(q) = ∅ then print i, out(q);
endfor;
Example:
Search text “ushers” with the preceding automaton
BSA Lecture 4: Aho-Corasick matching – p.10/27
AC Search Example
0 1
6
2
3 4 5
7
8 9
{he}
{hers}
{his}
h
={h, s}
h {he, she}
s
e
r s
i
e
s
1 2 3 4 5 6
T: u s h e r s
q:
BSA Lecture 4: Aho-Corasick matching – p.11/27
Complexity of AC Search
Theorem Searching target T[1 . . . m] with an AC
automaton takes time O(m + z), where z is the number of
pattern occurrences
Proof. For each target character, the automaton performs
0 or more fail transitions, followed by a goto.
Each goto either stays at the root, or increases the depth
of q by 1 ⇒ the depth of q is increased ≤ m times
Each fail moves q closer to the root
⇒ the total number of fail transitions is ≤ m
The z occurrences can be reported in z × O(1) = O(z) time
(say, as pattern identifiers and start positions of
occurrences)
BSA Lecture 4: Aho-Corasick matching – p.12/27
Constructing an AC Automaton
The AC automaton can be constructed in two phases
Phase I:
1. Construct the keyword tree for P
for each P ∈ P added to the tree, set
out(v) := {P} for the node v s.t. L(v) = P
2. complete the goto function for the root by setting
g(0, a) := 0
for each a ∈ Σ that doesn’t label an edge out of the root
If the alphabet Σ is fixed, Phase I takes time O(n)
BSA Lecture 4: Aho-Corasick matching – p.13/27
Result of Phase I
sr
s
s
e
0 1
6
2
3 4 5
7
8 9
{he}
{hers}
{his}
{she}
h
={h, s}
i
e
h
BSA Lecture 4: Aho-Corasick matching – p.14/27
Result of Phase I
After Phase I
goto is complete
out() may be incomplete (E.g., ”he” for state 5)
fail links are missing
The automaton is completed in Phase II
BSA Lecture 4: Aho-Corasick matching – p.15/27
Phase II of the AC Construction
Q := emptyQueue();
for a ∈ Σ do
if q(0, a) = q = 0 then
f(q) := 0; enqueue(q, Q);
while not isEmpty(Q) do
r:= dequeue(Q);
for a ∈ Σ do
if g(r, a) = u = ∅ then
enqueue(u, Q); v := f(r);
while g(v, a) = ∅ do v := f(v); // (*)
f(u) := g(v, a);
out(u) := out(u) ∪ out(f(u));
What does this do?
BSA Lecture 4: Aho-Corasick matching – p.16/27
Explanation of Phase II
Functions fail and output are computed for the nodes of
the trie in a breadth-first order
nodes closer to the root have already been processed
Consider nodes r and u = g(r, a), that is, r is the parent of
u and L(u) = L(r)a
Now what should f(u) be?
A: The deepest node labeled by a proper suffix of L(u).
The executions of line (*) find this, by locating the deepest
node v s.t. L(v) is a proper suffix of L(r) and g(v, a)
(= f(u)) is defined.
(Notice that v and g(v, a) may both be the root.)
BSA Lecture 4: Aho-Corasick matching – p.17/27
Completing the Output Functions
What about
out(u) := out(u) ∪ out(f(u)); ?
This is done because the patterns recognized at f(u) (if
any), and only those, are proper suffixes of L(u), and shall
thus be recognized at state u also.
BSA Lecture 4: Aho-Corasick matching – p.18/27
Complexity of the AC Construction
Phase II can be implemented to run in time O(n), too:
The breadth-first traversal alone takes time proportional to
the size of the tree, which is O(n);
OK; . . .
Is there also an O(n) bound for the number of times that
the f transitions are followed (on line (*) of Phase II)?
A: Yes! See next
BSA Lecture 4: Aho-Corasick matching – p.19/27
AC Construction: Number of fail
transitions
Consider nodes u1, . . . , ul created by entering a pattern
a1 . . . al; Let df(ui) denote the depth of their f nodes:
u 1 u 2 u i
u i+1a 2 a i
a i+1
a 1
df(u )i df(u )i+1
a i+1
u l
a l
df(u )  ll
0
df=0 df=0 or 1
When computing f(ui+1), each repeat of line (*)
decrements the value of df(ui+1)
BSA Lecture 4: Aho-Corasick matching – p.20/27
Number of fail transitions (cont)
So, for a pattern of length l,
the line (*) cannot be repeated over l times
Thus, in total over all patterns,
line (*) cannot be repeated over n times
BSA Lecture 4: Aho-Corasick matching – p.21/27
AC Construction: Unions of output
functions
Is it costly to perform
out(u) := out(u) ∪ out(f(u)); ?
No: Before the assignment, out(u) = ∅ or out(u) = {L(u)}.
Any patterns in out(f(u)) are shorter than L(u)
⇒ the sets are disjoint
→ Output sets can be implemented as linked lists, and
united in constant time
BSA Lecture 4: Aho-Corasick matching – p.22/27
Biological Applications
1. Matching against a library of known patterns
A Sequence-tagged-site (STS) is, roughly, a DNA string
of 200–300 bases whose left and right ends occur only
once in the entire genome
ESTs (expressed sequence tags) are STSs that participate
in gene expression, and thus belong to genes
Hundreds of thousands of STSs and tens of thousands of
ESTs (by mid-90’s) are stored in databases, and used to
compare against new DNA sequences
Ability to search for occurrences of patterns in
time that is independent of their number is very useful
BSA Lecture 4: Aho-Corasick matching – p.23/27
2. Matching with Wild Cards
Let φ be a wild card that matches any single character
For example, abφφcφ occurs at positions 2 and 7 of
1234567890123
xabvccababcax
A transcription factor is a protein that binds to specific
locations of DNA and regulates its transcription to RNA
Many families of transcription factors are characterized by
substrings with wild cards
Example: Transcription factor Zinc Finger has signature
CφφCφφφφφφφφφφφφφHφφH
(C and H: amino acids; cysteine and histidine)
BSA Lecture 4: Aho-Corasick matching – p.24/27
Matching with Wild Cards (2)
If the number of wild cards is bounded by a constant,
patterns with wild-cards can be matched in linear time, by
counting occurrences of non-wild-card substrings of P:
Let P = {P1, . . . , Pk} be the substrings of P separated by
wild-cards, and let l1, . . . , lk be their end positions in P
Preprocess: Build an AC automaton for P;
Initiate occurrence counts: for i := 1 to |T| do C[i] := 0;
Search target T with the AC automaton
When pattern Pj is found to end at position i ≥ lj of T,
increment C[i − lj + 1] by one;
Then P occurs at T[i] iff C[i] = k
BSA Lecture 4: Aho-Corasick matching – p.25/27
Example
Let P = φATCφφTCφATC
Then P = {ATC, TC, ATC} with l1 = 4, l2 = 8 and l3 = 12
0 1 2
A CT
3
4 5
T C
{8}
C, G
{4, 12, 8}
Search on
i: 12345678901234...
T: ACGATCTCTCGATC...
C[1] = C[7] = C[11] = 1 and C[3] = 3 (∼ occurrence)
BSA Lecture 4: Aho-Corasick matching – p.26/27
Complexity of AC Wild-Card
Matching
Let |P| = n and |T| = m
Preprocessing: O(n + m) (← k
i=1 |Pi| ≤ n)
Search: O(m + z), where z is # of occurrences of P in T
Each of them increments a cell of C by one, and each of
C[1], . . . , C[m] is incremented at most k times ⇒ z ≤ km
We have derived the following result:
Theorem Matching of a pattern P with k wild-cards can be
performed in time O(|P| + k|T|)
BSA Lecture 4: Aho-Corasick matching – p.27/27

More Related Content

What's hot (20)

Protein databases
Protein databasesProtein databases
Protein databases
 
PPT ON ALGORITHM
PPT ON ALGORITHMPPT ON ALGORITHM
PPT ON ALGORITHM
 
Semantic analysis
Semantic analysisSemantic analysis
Semantic analysis
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Unit 1-introduction to perl
Unit 1-introduction to perlUnit 1-introduction to perl
Unit 1-introduction to perl
 
Syntactic analysis in NLP
Syntactic analysis in NLPSyntactic analysis in NLP
Syntactic analysis in NLP
 
RNA structure analysis
RNA structure analysis RNA structure analysis
RNA structure analysis
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Comparative genomics in eukaryotes, organelles
Comparative genomics in eukaryotes, organellesComparative genomics in eukaryotes, organelles
Comparative genomics in eukaryotes, organelles
 
Gen bank
Gen bankGen bank
Gen bank
 
Perl
PerlPerl
Perl
 
Msa
MsaMsa
Msa
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Protein Database
Protein DatabaseProtein Database
Protein Database
 
Lesson 03
Lesson 03Lesson 03
Lesson 03
 
Swiss prot
Swiss protSwiss prot
Swiss prot
 
Lexical analyzer generator lex
Lexical analyzer generator lexLexical analyzer generator lex
Lexical analyzer generator lex
 
Deterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministicDeterministic context free grammars &non-deterministic
Deterministic context free grammars &non-deterministic
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 

Similar to Aho corasick-lecture

Class1
 Class1 Class1
Class1issbp
 
Nondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdfNondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdfSergioUlisesRojasAla
 
Design and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpDesign and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpProgramming Homework Help
 
Programmable PN Sequence Generators
Programmable PN Sequence GeneratorsProgrammable PN Sequence Generators
Programmable PN Sequence GeneratorsRajesh Singh
 
Characterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean EmbeddingsCharacterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean EmbeddingsDon Sheehy
 
Automata
AutomataAutomata
AutomataGaditek
 
Automata
AutomataAutomata
AutomataGaditek
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVissarion Fisikopoulos
 
Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Apostolos Chalkis
 
Chapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfChapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfdawod yimer
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptWahyuAde4
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticYamagata Yoriyuki
 

Similar to Aho corasick-lecture (20)

Class1
 Class1 Class1
Class1
 
Ch02
Ch02Ch02
Ch02
 
Nondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdfNondeterministic Finite Automata AFN.pdf
Nondeterministic Finite Automata AFN.pdf
 
Volume computation and applications
Volume computation and applications Volume computation and applications
Volume computation and applications
 
Design and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment HelpDesign and Analysis of Algorithms Assignment Help
Design and Analysis of Algorithms Assignment Help
 
Lec1.pptx
Lec1.pptxLec1.pptx
Lec1.pptx
 
Programmable PN Sequence Generators
Programmable PN Sequence GeneratorsProgrammable PN Sequence Generators
Programmable PN Sequence Generators
 
Characterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean EmbeddingsCharacterizing the Distortion of Some Simple Euclidean Embeddings
Characterizing the Distortion of Some Simple Euclidean Embeddings
 
Automata
AutomataAutomata
Automata
 
Automata
AutomataAutomata
Automata
 
QB104541.pdf
QB104541.pdfQB104541.pdf
QB104541.pdf
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensions
 
Automata theory
Automata theoryAutomata theory
Automata theory
 
Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...Practical volume estimation of polytopes by billiard trajectories and a new a...
Practical volume estimation of polytopes by billiard trajectories and a new a...
 
Algorithms on Strings
Algorithms on StringsAlgorithms on Strings
Algorithms on Strings
 
Chapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdfChapter 3 REGULAR EXPRESSION.pdf
Chapter 3 REGULAR EXPRESSION.pdf
 
Lecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.pptLecture_10_Parallel_Algorithms_Part_II.ppt
Lecture_10_Parallel_Algorithms_Part_II.ppt
 
Fsa
FsaFsa
Fsa
 
Consistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmeticConsistency proof of a feasible arithmetic inside a bounded arithmetic
Consistency proof of a feasible arithmetic inside a bounded arithmetic
 
Graphs
GraphsGraphs
Graphs
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

Aho corasick-lecture

  • 1. Biosequence Algorithms, Spring 2012 Lecture 4: Set Matching and Aho-Corasick Algorithm Pekka Kilpel¨ainen University of Eastern Finland School of Computing BSA Lecture 4: Aho-Corasick matching – p.1/27 Exact Set Matching Problem In the exact set matching problem we locate occurrences of any pattern of a set P = {P1, . . . , Pk}, in target T[1 . . . m] Let n = k i=1 |Pi|. Exact set matching can be solved in time O(|P1| + m + · · · + |Pk| + m) = O(n + km) by applying any linear-time exact matching k times Aho-Corasick algorithm (AC) is a classic solution to exact set matching. It works in time O(n + m + z), where z is number of pattern occurrences in T (Main reference here [Aho and Corasick, 1975]) AC is based on a refinement of a keyword tree BSA Lecture 4: Aho-Corasick matching – p.2/27 Keyword Trees A keyword tree (or a trie) for a set of patterns P is a rooted tree K such that 1. each edge of K is labeled by a character 2. any two edges out of a node have different labels Define the label of a node v as the concatenation of edge labels on the path from the root to v, and denote it by L(v) 3. for each P ∈ P there’s a node v with L(v) = P, and 4. the label L(v) of any leaf v equals some P ∈ P BSA Lecture 4: Aho-Corasick matching – p.3/27 Example of a Keyword Tree A keyword tree for P = {he, she, his, hers}: - - - - - - - Z Z ZZ~ J J J J J J^ h e sr s s h e i 2 41 3 A keyword tree is an efficient implementation of a dictionary of strings BSA Lecture 4: Aho-Corasick matching – p.4/27 Keyword Tree: Construction Construction for P = {P1, . . . , Pk}: Begin with a root node only; Insert each pattern Pi, one after the other, as follows: Starting at the root, follow the path labeled by chars of Pi; If the path ends before Pi, continue it by adding new edges and nodes for the remaining characters of Pi Store identifier i of Pi at the terminal node of the path This takes clearly O(|P1| + · · · + |Pk|) = O(n) time BSA Lecture 4: Aho-Corasick matching – p.5/27 Keyword Tree: Lookup Lookup of a string P: Starting at root, follow the path labeled by characters of P as long as possible; If the path leads to a node with an identifier, P is a keyword in the dictionary If the path terminates before P, the string is not in the dictionary Takes clearly O(|P|) time — An efficient look-up method! Naive application to pattern matching would lead to Θ(nm) time Next we extend a keyword tree into an automaton, to support linear-time matching BSA Lecture 4: Aho-Corasick matching – p.6/27 Aho-Corasick Automaton (1) States: nodes of the keyword tree initial state: 0 = the root Actions are determined by three functions: 1. the goto function g(q, a) gives the state entered from current state q by matching target char a if edge (q, v) is labeled by a, then g(q, a) = v; g(0, a) = 0 for each a that does not label an edge out of the root the automaton stays at the initial state while scanning non-matching characters Otherwise g(q, a) = ∅ BSA Lecture 4: Aho-Corasick matching – p.7/27 Aho-Corasick Automaton (2) 2. the failure function f(q) for q = 0 gives the state entered at a mismatch f(q) is the node labeled by the longest proper suffix w of L(q) s.t. w is a prefix of some pattern → a fail transition does not miss any potential occurrences NB: f(q) is always defined, since L(0) = ǫ is a prefix of any pattern 3. the output function out(q) gives the set of patterns recognized when entering state q BSA Lecture 4: Aho-Corasick matching – p.8/27
  • 2. Example of an AC Automaton 0 1 6 2 3 4 5 7 8 9 {he} {hers} {his} h ={h, s} h {he, she} s e r s i e s Dashed arrows are fail transitions BSA Lecture 4: Aho-Corasick matching – p.9/27 AC Search of Target T[1 . . . m] q := 0; // initial state (root) for i := 1 to m do while g(q, T[i]) = ∅ do q := f(q); // follow a fail q := g(q, T[i]); // follow a goto if out(q) = ∅ then print i, out(q); endfor; Example: Search text “ushers” with the preceding automaton BSA Lecture 4: Aho-Corasick matching – p.10/27 AC Search Example 0 1 6 2 3 4 5 7 8 9 {he} {hers} {his} h ={h, s} h {he, she} s e r s i e s 1 2 3 4 5 6 T: u s h e r s q: BSA Lecture 4: Aho-Corasick matching – p.11/27 Complexity of AC Search Theorem Searching target T[1 . . . m] with an AC automaton takes time O(m + z), where z is the number of pattern occurrences Proof. For each target character, the automaton performs 0 or more fail transitions, followed by a goto. Each goto either stays at the root, or increases the depth of q by 1 ⇒ the depth of q is increased ≤ m times Each fail moves q closer to the root ⇒ the total number of fail transitions is ≤ m The z occurrences can be reported in z × O(1) = O(z) time (say, as pattern identifiers and start positions of occurrences) BSA Lecture 4: Aho-Corasick matching – p.12/27 Constructing an AC Automaton The AC automaton can be constructed in two phases Phase I: 1. Construct the keyword tree for P for each P ∈ P added to the tree, set out(v) := {P} for the node v s.t. L(v) = P 2. complete the goto function for the root by setting g(0, a) := 0 for each a ∈ Σ that doesn’t label an edge out of the root If the alphabet Σ is fixed, Phase I takes time O(n) BSA Lecture 4: Aho-Corasick matching – p.13/27 Result of Phase I sr s s e 0 1 6 2 3 4 5 7 8 9 {he} {hers} {his} {she} h ={h, s} i e h BSA Lecture 4: Aho-Corasick matching – p.14/27 Result of Phase I After Phase I goto is complete out() may be incomplete (E.g., ”he” for state 5) fail links are missing The automaton is completed in Phase II BSA Lecture 4: Aho-Corasick matching – p.15/27 Phase II of the AC Construction Q := emptyQueue(); for a ∈ Σ do if q(0, a) = q = 0 then f(q) := 0; enqueue(q, Q); while not isEmpty(Q) do r:= dequeue(Q); for a ∈ Σ do if g(r, a) = u = ∅ then enqueue(u, Q); v := f(r); while g(v, a) = ∅ do v := f(v); // (*) f(u) := g(v, a); out(u) := out(u) ∪ out(f(u)); What does this do? BSA Lecture 4: Aho-Corasick matching – p.16/27
  • 3. Explanation of Phase II Functions fail and output are computed for the nodes of the trie in a breadth-first order nodes closer to the root have already been processed Consider nodes r and u = g(r, a), that is, r is the parent of u and L(u) = L(r)a Now what should f(u) be? A: The deepest node labeled by a proper suffix of L(u). The executions of line (*) find this, by locating the deepest node v s.t. L(v) is a proper suffix of L(r) and g(v, a) (= f(u)) is defined. (Notice that v and g(v, a) may both be the root.) BSA Lecture 4: Aho-Corasick matching – p.17/27 Completing the Output Functions What about out(u) := out(u) ∪ out(f(u)); ? This is done because the patterns recognized at f(u) (if any), and only those, are proper suffixes of L(u), and shall thus be recognized at state u also. BSA Lecture 4: Aho-Corasick matching – p.18/27 Complexity of the AC Construction Phase II can be implemented to run in time O(n), too: The breadth-first traversal alone takes time proportional to the size of the tree, which is O(n); OK; . . . Is there also an O(n) bound for the number of times that the f transitions are followed (on line (*) of Phase II)? A: Yes! See next BSA Lecture 4: Aho-Corasick matching – p.19/27 AC Construction: Number of fail transitions Consider nodes u1, . . . , ul created by entering a pattern a1 . . . al; Let df(ui) denote the depth of their f nodes: u 1 u 2 u i u i+1a 2 a i a i+1 a 1 df(u )i df(u )i+1 a i+1 u l a l df(u ) ll 0 df=0 df=0 or 1 When computing f(ui+1), each repeat of line (*) decrements the value of df(ui+1) BSA Lecture 4: Aho-Corasick matching – p.20/27 Number of fail transitions (cont) So, for a pattern of length l, the line (*) cannot be repeated over l times Thus, in total over all patterns, line (*) cannot be repeated over n times BSA Lecture 4: Aho-Corasick matching – p.21/27 AC Construction: Unions of output functions Is it costly to perform out(u) := out(u) ∪ out(f(u)); ? No: Before the assignment, out(u) = ∅ or out(u) = {L(u)}. Any patterns in out(f(u)) are shorter than L(u) ⇒ the sets are disjoint → Output sets can be implemented as linked lists, and united in constant time BSA Lecture 4: Aho-Corasick matching – p.22/27 Biological Applications 1. Matching against a library of known patterns A Sequence-tagged-site (STS) is, roughly, a DNA string of 200–300 bases whose left and right ends occur only once in the entire genome ESTs (expressed sequence tags) are STSs that participate in gene expression, and thus belong to genes Hundreds of thousands of STSs and tens of thousands of ESTs (by mid-90’s) are stored in databases, and used to compare against new DNA sequences Ability to search for occurrences of patterns in time that is independent of their number is very useful BSA Lecture 4: Aho-Corasick matching – p.23/27 2. Matching with Wild Cards Let φ be a wild card that matches any single character For example, abφφcφ occurs at positions 2 and 7 of 1234567890123 xabvccababcax A transcription factor is a protein that binds to specific locations of DNA and regulates its transcription to RNA Many families of transcription factors are characterized by substrings with wild cards Example: Transcription factor Zinc Finger has signature CφφCφφφφφφφφφφφφφHφφH (C and H: amino acids; cysteine and histidine) BSA Lecture 4: Aho-Corasick matching – p.24/27
  • 4. Matching with Wild Cards (2) If the number of wild cards is bounded by a constant, patterns with wild-cards can be matched in linear time, by counting occurrences of non-wild-card substrings of P: Let P = {P1, . . . , Pk} be the substrings of P separated by wild-cards, and let l1, . . . , lk be their end positions in P Preprocess: Build an AC automaton for P; Initiate occurrence counts: for i := 1 to |T| do C[i] := 0; Search target T with the AC automaton When pattern Pj is found to end at position i ≥ lj of T, increment C[i − lj + 1] by one; Then P occurs at T[i] iff C[i] = k BSA Lecture 4: Aho-Corasick matching – p.25/27 Example Let P = φATCφφTCφATC Then P = {ATC, TC, ATC} with l1 = 4, l2 = 8 and l3 = 12 0 1 2 A CT 3 4 5 T C {8} C, G {4, 12, 8} Search on i: 12345678901234... T: ACGATCTCTCGATC... C[1] = C[7] = C[11] = 1 and C[3] = 3 (∼ occurrence) BSA Lecture 4: Aho-Corasick matching – p.26/27 Complexity of AC Wild-Card Matching Let |P| = n and |T| = m Preprocessing: O(n + m) (← k i=1 |Pi| ≤ n) Search: O(m + z), where z is # of occurrences of P in T Each of them increments a cell of C by one, and each of C[1], . . . , C[m] is incremented at most k times ⇒ z ≤ km We have derived the following result: Theorem Matching of a pattern P with k wild-cards can be performed in time O(|P| + k|T|) BSA Lecture 4: Aho-Corasick matching – p.27/27