result management system report for college project
Horspool Algorithm in Design and Analysis of Algorithms in VTU
1. Advanced Data Structure: Bioinformatics
•First week: Algorithms for exact string matching.
•Second week: Approximate search and alignment
of short sequences.
•Third week: Dealing with long sequences.
2. Advanced Data Structure:bibliography
•Bioinformatics, Sequence and Genome Analysis
David W. Mount
•Flexible Pattern Matching in Strings (2002)
Gonzalo Navarro and Mathieu Raffinot
•http://www-igm.univ-mlv.fr/~lecroq/string/index.html
•http://www.ncbi.nlm.nih.gov/
3. First week
•First week: algorithms for exact string matching:
One pattern: The algorithm depends on |p| and |
k patterns: The algorithm depends on k, |p| and ||
•Second week: approximate search and alignment
of short sequences.
•Third week: dealing with long sequences.
4. Exact string matching for one pattern
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA.
How does the string algorithms made the search?
and for the pattern TACTACGGTATGACTAA
5. Exact string matching: Brute force algorithm
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Example:
6. Exact string matching: Brute force algorithm
Text :
Pattern :
From left to right: prefix
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text :
The window is shifted only one cell
7. Exact string matching: one pattern
There is a sliding window along the text
against which the pattern is compared:
How does the matching algorithms made the search?
Pattern :
Text :
Which are the facts that differentiate the algorithms?
1. How the comparison is made.
2. The length of the shift.
At each step the comparison is made and
the window is shifted to the right.
9. Horspool algorithm
Text :
Pattern :
Sufix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
Shift until the next ocurrence of “a” in the pattern:
a
a a
a a a
We need a preprocessing phase to construct the shift table.
10. Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A
C
G
T
11. Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C
G
T
12. Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G
T
13. Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T
14. Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
15. Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
16. Horspool algorithm: example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
17. Some questions about Horspool algorithm
A 4
C 5
G 2
T 1
Given a random text over an equally likely
probability distribution (EPD):
Given the pattern ATGTA, the shift table is
1.- Determine the expected shift of the window. And,
if the PD is not equally likely?
2.- Determine the expected number of shifts
assuming a text of length n.
3.- Determine the expected number of comparisons
in the suffix search phase
19. Text :
Pattern :
Search for suffixes of T that are factors of
BNDM algorithm
• Which is the next position of the window ?
• How the comparison is made?
That is denoted as
D2 = 1 0 0 0 1 0 0
Depends on the value of the leftmost bit of D
Once the next character x is read
D3 = D2<<1 & B(x)
B(x): mask of x in the pattern P.
For instance, if B(x) = ( 0 0 1 1 0 0 0)
D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
x
20. BNDM algorithm: example
Given the pattern ATGTA
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
• The mask of characters is:
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 0 1 0 0 )
D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 1 0 0 0 1 )
D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )
D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
21. BNDM algorithm: example of window shift
A T G T A
• Given the pattern ATGTA
• The mask of characters is :
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 1 0 0 0 1 )
D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )
D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )
D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found
22. BNDM algorithm: example
Given the pattern ATGTA
• The searching phase: G T A C T A G AA T A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
• The mask of characters is :
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
How the shif is determined?
23. Extended string matching
• Classes of characters: when in some DNA files or patterns there are
new characters as N or R that means N={A,C,G,T} and R={G,A}.
• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3)
means any 2 or 3 characters.
• Optional characters: we find pattern as AC?ACT?T?A where C?
means that C may or may not appear in the text.
• Wild cards: we find pattern as AT*TA where * means an arbitrary long
string.
• Repeatable characters: we find pattern as AT[TA]*AT where [TA]*
means that TA can appear zero or more times..
25. Autòmata Factor Oracle: propietats
Factor Oracle of word G T A T G T A
G
G A
T T A
T
T
A
G
All states are accepting states.
Recognizes all factors … but more, which?
If a word is rejected, it isn't a factor, then
26. BOM algorithm (Backward Oracle Matching)
• How many cells are shifted?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Checks from right to left
a
• If the a isn't into the automaton
• If we reach the last stat of the automaton with the a
a
27. BOM algorithm: example
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
28. BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
29. BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
30. BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
31. BOM algorithm: example
A T G T A T G
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
32. BOM algorithm: example
A T G T A T G
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
33. Automata Factor Oracle
Given the pattern GTATA, in which state the factors are accepted?
G A
T
T
A
G GT
T
GTA
TA
A
When the new A is read, 5 factors
should be accepted GTATA
TATA
ATA
TA
A, how it can be
reached?
GTAT
TAT
AT
T
T
G A
T
T
A
G GT
T
GTA
TA
A
When the new T is read, 4 factors should be
accepted GTAT
TAT
AT
T, how it can be reached?
34. Automata Factor Oracle
When the new
G is read, 6
factors should
be accepted
GTATAG
TATAG
ATAG
TAG
AG
G
GTATA
TATA
ATA
TA
A
GTAT
TAT
AT
T
T
G A
T
T
A
G GT
T
GTA
TA
A
A G
GTATAG
TATAG
ATAG
TAG
AG
G