Advanced Data Structure: Bioinformatics
•First week: Algorithms for exact string matching.
•Second week: Approximate search and alignment
of short sequences.
•Third week: Dealing with long sequences.
Advanced Data Structure:bibliography
•Bioinformatics, Sequence and Genome Analysis
David W. Mount
•Flexible Pattern Matching in Strings (2002)
Gonzalo Navarro and Mathieu Raffinot
•http://www-igm.univ-mlv.fr/~lecroq/string/index.html
•http://www.ncbi.nlm.nih.gov/
First week
•First week: algorithms for exact string matching:
One pattern: The algorithm depends on |p| and |
k patterns: The algorithm depends on k, |p| and ||
•Second week: approximate search and alignment
of short sequences.
•Third week: dealing with long sequences.
Exact string matching for one pattern
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA.
How does the string algorithms made the search?
and for the pattern TACTACGGTATGACTAA
Exact string matching: Brute force algorithm
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Example:
Exact string matching: Brute force algorithm
Text :
Pattern :
From left to right: prefix
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text :
The window is shifted only one cell
Exact string matching: one pattern
There is a sliding window along the text
against which the pattern is compared:
How does the matching algorithms made the search?
Pattern :
Text :
Which are the facts that differentiate the algorithms?
1. How the comparison is made.
2. The length of the shift.
At each step the comparison is made and
the window is shifted to the right.
Exact string matching for one pattern
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256e
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Horspool algorithm
Text :
Pattern :
Sufix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
Shift until the next ocurrence of “a” in the pattern:
a
a a
a a a
We need a preprocessing phase to construct the shift table.
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A
C
G
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C
G
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Horspool algorithm: example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Some questions about Horspool algorithm
A 4
C 5
G 2
T 1
Given a random text over an equally likely
probability distribution (EPD):
Given the pattern ATGTA, the shift table is
1.- Determine the expected shift of the window. And,
if the PD is not equally likely?
2.- Determine the expected number of shifts
assuming a text of length n.
3.- Determine the expected number of comparisons
in the suffix search phase
Exact string matching for one pattern
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Text :
Pattern :
Search for suffixes of T that are factors of
BNDM algorithm
• Which is the next position of the window ?
• How the comparison is made?
That is denoted as
D2 = 1 0 0 0 1 0 0
Depends on the value of the leftmost bit of D
Once the next character x is read
D3 = D2<<1 & B(x)
B(x): mask of x in the pattern P.
For instance, if B(x) = ( 0 0 1 1 0 0 0)
D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
x
BNDM algorithm: example
Given the pattern ATGTA
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
• The mask of characters is:
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 0 1 0 0 )
D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 1 0 0 0 1 )
D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )
D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
BNDM algorithm: example of window shift
A T G T A
• Given the pattern ATGTA
• The mask of characters is :
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 1 0 0 0 1 )
D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )
D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )
D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found
BNDM algorithm: example
Given the pattern ATGTA
• The searching phase: G T A C T A G AA T A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
• The mask of characters is :
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
How the shif is determined?
Extended string matching
• Classes of characters: when in some DNA files or patterns there are
new characters as N or R that means N={A,C,G,T} and R={G,A}.
• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3)
means any 2 or 3 characters.
• Optional characters: we find pattern as AC?ACT?T?A where C?
means that C may or may not appear in the text.
• Wild cards: we find pattern as AT*TA where * means an arbitrary long
string.
• Repeatable characters: we find pattern as AT[TA]*AT where [TA]*
means that TA can appear zero or more times..
Exact string matching for one pattern
Algorismes més eficients (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Autòmata Factor Oracle: propietats
Factor Oracle of word G T A T G T A
G
G A
T T A
T
T
A
G
All states are accepting states.
Recognizes all factors … but more, which?
If a word is rejected, it isn't a factor, then
BOM algorithm (Backward Oracle Matching)
• How many cells are shifted?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Checks from right to left
a
• If the a isn't into the automaton
• If we reach the last stat of the automaton with the a
a
BOM algorithm: example
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
Automata Factor Oracle
Given the pattern GTATA, in which state the factors are accepted?
G A
T
T
A
G GT
T
GTA
TA
A
When the new A is read, 5 factors
should be accepted GTATA
TATA
ATA
TA
A, how it can be
reached?
GTAT
TAT
AT
T
T
G A
T
T
A
G GT
T
GTA
TA
A
When the new T is read, 4 factors should be
accepted GTAT
TAT
AT
T, how it can be reached?
Automata Factor Oracle
When the new
G is read, 6
factors should
be accepted
GTATAG
TATAG
ATAG
TAG
AG
G
GTATA
TATA
ATA
TA
A
GTAT
TAT
AT
T
T
G A
T
T
A
G GT
T
GTA
TA
A
A G
GTATAG
TATAG
ATAG
TAG
AG
G
Automaton Factor Oracle: linear algorithm
?
Autòmata Factor Oracle: algorisme
If there is a T transition ...
T
T
Autòmata Factor Oracle: algorisme
… and recursively continue ...
T
T
But if there isn't a T transition ...

Horspool Algorithm in Design and Analysis of Algorithms in VTU

  • 1.
    Advanced Data Structure:Bioinformatics •First week: Algorithms for exact string matching. •Second week: Approximate search and alignment of short sequences. •Third week: Dealing with long sequences.
  • 2.
    Advanced Data Structure:bibliography •Bioinformatics,Sequence and Genome Analysis David W. Mount •Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot •http://www-igm.univ-mlv.fr/~lecroq/string/index.html •http://www.ncbi.nlm.nih.gov/
  • 3.
    First week •First week:algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || •Second week: approximate search and alignment of short sequences. •Third week: dealing with long sequences.
  • 4.
    Exact string matchingfor one pattern For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. How does the string algorithms made the search? and for the pattern TACTACGGTATGACTAA
  • 5.
    Exact string matching:Brute force algorithm Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A Example:
  • 6.
    Exact string matching:Brute force algorithm Text : Pattern : From left to right: prefix • Which is the next position of the window? • How the comparison is made? Pattern : Text : The window is shifted only one cell
  • 7.
    Exact string matching:one pattern There is a sliding window along the text against which the pattern is compared: How does the matching algorithms made the search? Pattern : Text : Which are the facts that differentiate the algorithms? 1. How the comparison is made. 2. The length of the shift. At each step the comparison is made and the window is shifted to the right.
  • 8.
    Exact string matchingfor one pattern Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256e 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
  • 9.
    Horspool algorithm Text : Pattern: Sufix search • Which is the next position of the window? • How the comparison is made? Pattern : Text : a Shift until the next ocurrence of “a” in the pattern: a a a a a a We need a preprocessing phase to construct the shift table.
  • 10.
    Horspool algorithm :example Given the pattern ATGTA • The shift table is: A C G T
  • 11.
    Horspool algorithm :example Given the pattern ATGTA • The shift table is: A 4 C G T
  • 12.
    Horspool algorithm :example Given the pattern ATGTA • The shift table is: A 4 C 5 G T
  • 13.
    Horspool algorithm :example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T
  • 14.
    Horspool algorithm :example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T 1
  • 15.
    Horspool algorithm :example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T 1 • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A
  • 16.
    Horspool algorithm: example Giventhe pattern ATGTA • The shift table is: A 4 C 5 G 2 T 1 • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A
  • 17.
    Some questions aboutHorspool algorithm A 4 C 5 G 2 T 1 Given a random text over an equally likely probability distribution (EPD): Given the pattern ATGTA, the shift table is 1.- Determine the expected shift of the window. And, if the PD is not equally likely? 2.- Determine the expected number of shifts assuming a text of length n. 3.- Determine the expected number of comparisons in the suffix search phase
  • 18.
    Exact string matchingfor one pattern Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
  • 19.
    Text : Pattern : Searchfor suffixes of T that are factors of BNDM algorithm • Which is the next position of the window ? • How the comparison is made? That is denoted as D2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x
  • 20.
    BNDM algorithm: example Giventhe pattern ATGTA • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A • The mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
  • 21.
    BNDM algorithm: exampleof window shift A T G T A • Given the pattern ATGTA • The mask of characters is : • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found
  • 22.
    BNDM algorithm: example Giventhe pattern ATGTA • The searching phase: G T A C T A G AA T A C G T A T G T A C T G ... A T G T A A T G T A A T G T A • The mask of characters is : B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) How the shif is determined?
  • 23.
    Extended string matching •Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. • Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. • Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. • Wild cards: we find pattern as AT*TA where * means an arbitrary long string. • Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..
  • 24.
    Exact string matchingfor one pattern Algorismes més eficients (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
  • 25.
    Autòmata Factor Oracle:propietats Factor Oracle of word G T A T G T A G G A T T A T T A G All states are accepting states. Recognizes all factors … but more, which? If a word is rejected, it isn't a factor, then
  • 26.
    BOM algorithm (BackwardOracle Matching) • How many cells are shifted? • How the comparison is made? Text : Pattern : Automata: Factor Oracle Checks from right to left a • If the a isn't into the automaton • If we reach the last stat of the automaton with the a a
  • 27.
    BOM algorithm: example •The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A... A T G T A T G How the comparison is made? G G A T T A T T A G
  • 28.
    BOM algorithm: example AT G T A T G How the comparison is made? G G A T T A T T A G A T G T A T G • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 29.
    BOM algorithm: example AT G T A T G How the comparison is made? G G A T T A T T A G A T G T A T G A T G T A T G • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 30.
    BOM algorithm: example AT G T A T G How the comparison is made? G G A T T A T T A G A T G T A T G A T G T A T G A T G T A T G • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 31.
    BOM algorithm: example AT G T A T G G G A T T A T T A G A T G T A T G A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 32.
    BOM algorithm: example AT G T A T G G G A T T A T T A G A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 33.
    Automata Factor Oracle Giventhe pattern GTATA, in which state the factors are accepted? G A T T A G GT T GTA TA A When the new A is read, 5 factors should be accepted GTATA TATA ATA TA A, how it can be reached? GTAT TAT AT T T G A T T A G GT T GTA TA A When the new T is read, 4 factors should be accepted GTAT TAT AT T, how it can be reached?
  • 34.
    Automata Factor Oracle Whenthe new G is read, 6 factors should be accepted GTATAG TATAG ATAG TAG AG G GTATA TATA ATA TA A GTAT TAT AT T T G A T T A G GT T GTA TA A A G GTATAG TATAG ATAG TAG AG G
  • 35.
    Automaton Factor Oracle:linear algorithm ?
  • 36.
    Autòmata Factor Oracle:algorisme If there is a T transition ... T T
  • 37.
    Autòmata Factor Oracle:algorisme … and recursively continue ... T T But if there isn't a T transition ...