SlideShare a Scribd company logo
1 of 37
Advanced Data Structure: Bioinformatics
•First week: Algorithms for exact string matching.
•Second week: Approximate search and alignment
of short sequences.
•Third week: Dealing with long sequences.
Advanced Data Structure:bibliography
•Bioinformatics, Sequence and Genome Analysis
David W. Mount
•Flexible Pattern Matching in Strings (2002)
Gonzalo Navarro and Mathieu Raffinot
•http://www-igm.univ-mlv.fr/~lecroq/string/index.html
•http://www.ncbi.nlm.nih.gov/
First week
•First week: algorithms for exact string matching:
One pattern: The algorithm depends on |p| and |
k patterns: The algorithm depends on k, |p| and ||
•Second week: approximate search and alignment
of short sequences.
•Third week: dealing with long sequences.
Exact string matching for one pattern
For instance, given the sequence
CTACTACTACGTCTATACTGATCGTAGCTACTACATGC
search for the pattern ACTGA.
How does the string algorithms made the search?
and for the pattern TACTACGGTATGACTAA
Exact string matching: Brute force algorithm
Given the pattern ATGTA, the search is
G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Example:
Exact string matching: Brute force algorithm
Text :
Pattern :
From left to right: prefix
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text :
The window is shifted only one cell
Exact string matching: one pattern
There is a sliding window along the text
against which the pattern is compared:
How does the matching algorithms made the search?
Pattern :
Text :
Which are the facts that differentiate the algorithms?
1. How the comparison is made.
2. The length of the shift.
At each step the comparison is made and
the window is shifted to the right.
Exact string matching for one pattern
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256e
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Horspool algorithm
Text :
Pattern :
Sufix search
• Which is the next position of the window?
• How the comparison is made?
Pattern :
Text : a
Shift until the next ocurrence of “a” in the pattern:
a
a a
a a a
We need a preprocessing phase to construct the shift table.
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A
C
G
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C
G
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
Horspool algorithm : example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Horspool algorithm: example
Given the pattern ATGTA
• The shift table is:
A 4
C 5
G 2
T 1
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
A T G T A
Some questions about Horspool algorithm
A 4
C 5
G 2
T 1
Given a random text over an equally likely
probability distribution (EPD):
Given the pattern ATGTA, the shift table is
1.- Determine the expected shift of the window. And,
if the PD is not equally likely?
2.- Determine the expected number of shifts
assuming a text of length n.
3.- Determine the expected number of comparisons
in the suffix search phase
Exact string matching for one pattern
Experimental efficiency (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Text :
Pattern :
Search for suffixes of T that are factors of
BNDM algorithm
• Which is the next position of the window ?
• How the comparison is made?
That is denoted as
D2 = 1 0 0 0 1 0 0
Depends on the value of the leftmost bit of D
Once the next character x is read
D3 = D2<<1 & B(x)
B(x): mask of x in the pattern P.
For instance, if B(x) = ( 0 0 1 1 0 0 0)
D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 )
x
BNDM algorithm: example
Given the pattern ATGTA
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
A T G T A
• The mask of characters is:
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 0 1 0 0 )
D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 1 0 0 0 1 )
D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 )
D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
BNDM algorithm: example of window shift
A T G T A
• Given the pattern ATGTA
• The mask of characters is :
• The searching phase: G T A C T A G A G G A C G T A T G T A C T G ...
A T G T A
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 1 0 0 0 1 )
D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 )
D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 )
D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 )
D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found
BNDM algorithm: example
Given the pattern ATGTA
• The searching phase: G T A C T A G AA T A C G T A T G T A C T G ...
A T G T A
A T G T A
A T G T A
• The mask of characters is :
B(A) = ( 1 0 0 0 1 )
B(C) = ( 0 0 0 0 0 )
B(G) = ( 0 0 1 0 0 )
B(T) = ( 0 1 0 1 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 )
D1 = ( 0 1 0 1 0 )
D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 )
D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 )
How the shif is determined?
Extended string matching
• Classes of characters: when in some DNA files or patterns there are
new characters as N or R that means N={A,C,G,T} and R={G,A}.
• Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3)
means any 2 or 3 characters.
• Optional characters: we find pattern as AC?ACT?T?A where C?
means that C may or may not appear in the text.
• Wild cards: we find pattern as AT*TA where * means an arbitrary long
string.
• Repeatable characters: we find pattern as AT[TA]*AT where [TA]*
means that TA can appear zero or more times..
Exact string matching for one pattern
Algorismes més eficients (Navarro & Raffinot)
2 4 8 16 32 64 128 256
64
32
16
8
4
2
| |
Long. pattern
Horspool
BNDM
BOM
BNDM : Backward Nondeterministic Dawg Matching
BOM : Backward Oracle Matching
w
Autòmata Factor Oracle: propietats
Factor Oracle of word G T A T G T A
G
G A
T T A
T
T
A
G
All states are accepting states.
Recognizes all factors … but more, which?
If a word is rejected, it isn't a factor, then
BOM algorithm (Backward Oracle Matching)
• How many cells are shifted?
• How the comparison is made?
Text :
Pattern : Automata: Factor Oracle
Checks from right to left
a
• If the a isn't into the automaton
• If we reach the last stat of the automaton with the a
a
BOM algorithm: example
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
How the comparison is made?
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
BOM algorithm: example
A T G T A T G
G
G A
T T A
T
T
A
G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
A T G T A T G
How the comparison is
made?
• The automaton of the inverse patterns is built: given the pattern ATGTATG
• And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
Automata Factor Oracle
Given the pattern GTATA, in which state the factors are accepted?
G A
T
T
A
G GT
T
GTA
TA
A
When the new A is read, 5 factors
should be accepted GTATA
TATA
ATA
TA
A, how it can be
reached?
GTAT
TAT
AT
T
T
G A
T
T
A
G GT
T
GTA
TA
A
When the new T is read, 4 factors should be
accepted GTAT
TAT
AT
T, how it can be reached?
Automata Factor Oracle
When the new
G is read, 6
factors should
be accepted
GTATAG
TATAG
ATAG
TAG
AG
G
GTATA
TATA
ATA
TA
A
GTAT
TAT
AT
T
T
G A
T
T
A
G GT
T
GTA
TA
A
A G
GTATAG
TATAG
ATAG
TAG
AG
G
Automaton Factor Oracle: linear algorithm
?
Autòmata Factor Oracle: algorisme
If there is a T transition ...
T
T
Autòmata Factor Oracle: algorisme
… and recursively continue ...
T
T
But if there isn't a T transition ...

More Related Content

Similar to Horspool Algorithm in Design and Analysis of Algorithms in VTU

1212 regular meeting
1212 regular meeting1212 regular meeting
1212 regular meeting
marxliouville
 
lecture 1
lecture 1lecture 1
lecture 1
sajinsc
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..
KarthikeyaLanka1
 

Similar to Horspool Algorithm in Design and Analysis of Algorithms in VTU (20)

Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptx
 
Notion of Algorithms.pdf
Notion of Algorithms.pdfNotion of Algorithms.pdf
Notion of Algorithms.pdf
 
FivaTech
FivaTechFivaTech
FivaTech
 
1212 regular meeting
1212 regular meeting1212 regular meeting
1212 regular meeting
 
Module iv sp
Module iv spModule iv sp
Module iv sp
 
Lec1
Lec1Lec1
Lec1
 
Intermediate code generation in Compiler Design
Intermediate code generation in Compiler DesignIntermediate code generation in Compiler Design
Intermediate code generation in Compiler Design
 
Asymptotic Analysis.ppt
Asymptotic Analysis.pptAsymptotic Analysis.ppt
Asymptotic Analysis.ppt
 
Algorithm Design and Analysis
Algorithm Design and AnalysisAlgorithm Design and Analysis
Algorithm Design and Analysis
 
Inductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDFInductive Triple Graphs: A purely functional approach to represent RDF
Inductive Triple Graphs: A purely functional approach to represent RDF
 
Cat's anatomy
Cat's anatomyCat's anatomy
Cat's anatomy
 
Biochip
BiochipBiochip
Biochip
 
Best C++ Programming Homework Help
Best C++ Programming Homework HelpBest C++ Programming Homework Help
Best C++ Programming Homework Help
 
Introduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdfIntroduction to computing Processing and performance.pdf
Introduction to computing Processing and performance.pdf
 
lecture 1
lecture 1lecture 1
lecture 1
 
DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..DS Unit-1.pptx very easy to understand..
DS Unit-1.pptx very easy to understand..
 
Introduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic NotationIntroduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic Notation
 
C++ Notes PPT.ppt
C++ Notes PPT.pptC++ Notes PPT.ppt
C++ Notes PPT.ppt
 
lecture1.ppt
lecture1.pptlecture1.ppt
lecture1.ppt
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Recently uploaded (20)

Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 

Horspool Algorithm in Design and Analysis of Algorithms in VTU

  • 1. Advanced Data Structure: Bioinformatics •First week: Algorithms for exact string matching. •Second week: Approximate search and alignment of short sequences. •Third week: Dealing with long sequences.
  • 2. Advanced Data Structure:bibliography •Bioinformatics, Sequence and Genome Analysis David W. Mount •Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot •http://www-igm.univ-mlv.fr/~lecroq/string/index.html •http://www.ncbi.nlm.nih.gov/
  • 3. First week •First week: algorithms for exact string matching: One pattern: The algorithm depends on |p| and | k patterns: The algorithm depends on k, |p| and || •Second week: approximate search and alignment of short sequences. •Third week: dealing with long sequences.
  • 4. Exact string matching for one pattern For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. How does the string algorithms made the search? and for the pattern TACTACGGTATGACTAA
  • 5. Exact string matching: Brute force algorithm Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A Example:
  • 6. Exact string matching: Brute force algorithm Text : Pattern : From left to right: prefix • Which is the next position of the window? • How the comparison is made? Pattern : Text : The window is shifted only one cell
  • 7. Exact string matching: one pattern There is a sliding window along the text against which the pattern is compared: How does the matching algorithms made the search? Pattern : Text : Which are the facts that differentiate the algorithms? 1. How the comparison is made. 2. The length of the shift. At each step the comparison is made and the window is shifted to the right.
  • 8. Exact string matching for one pattern Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256e 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
  • 9. Horspool algorithm Text : Pattern : Sufix search • Which is the next position of the window? • How the comparison is made? Pattern : Text : a Shift until the next ocurrence of “a” in the pattern: a a a a a a We need a preprocessing phase to construct the shift table.
  • 10. Horspool algorithm : example Given the pattern ATGTA • The shift table is: A C G T
  • 11. Horspool algorithm : example Given the pattern ATGTA • The shift table is: A 4 C G T
  • 12. Horspool algorithm : example Given the pattern ATGTA • The shift table is: A 4 C 5 G T
  • 13. Horspool algorithm : example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T
  • 14. Horspool algorithm : example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T 1
  • 15. Horspool algorithm : example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T 1 • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A
  • 16. Horspool algorithm: example Given the pattern ATGTA • The shift table is: A 4 C 5 G 2 T 1 • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A
  • 17. Some questions about Horspool algorithm A 4 C 5 G 2 T 1 Given a random text over an equally likely probability distribution (EPD): Given the pattern ATGTA, the shift table is 1.- Determine the expected shift of the window. And, if the PD is not equally likely? 2.- Determine the expected number of shifts assuming a text of length n. 3.- Determine the expected number of comparisons in the suffix search phase
  • 18. Exact string matching for one pattern Experimental efficiency (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
  • 19. Text : Pattern : Search for suffixes of T that are factors of BNDM algorithm • Which is the next position of the window ? • How the comparison is made? That is denoted as D2 = 1 0 0 0 1 0 0 Depends on the value of the leftmost bit of D Once the next character x is read D3 = D2<<1 & B(x) B(x): mask of x in the pattern P. For instance, if B(x) = ( 0 0 1 1 0 0 0) D = (0 0 0 1 0 0 0) & (0 0 1 1 0 0 0 ) = (0 0 0 1 0 0 0 ) x
  • 20. BNDM algorithm: example Given the pattern ATGTA • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A • The mask of characters is: B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 0 1 0 0 ) D2 = ( 0 1 0 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 0 0 0 0) = ( 0 0 0 0 0 )
  • 21. BNDM algorithm: example of window shift A T G T A • Given the pattern ATGTA • The mask of characters is : • The searching phase: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 1 0 0 0 1 ) D2 = ( 0 0 0 1 0 ) & ( 0 1 0 1 0 ) = ( 0 0 0 1 0 ) D3 = ( 0 0 1 0 0 ) & ( 0 0 1 0 0 ) = ( 0 0 1 0 0 ) D4 = ( 0 1 0 0 0 ) & ( 0 1 0 1 0 ) = ( 0 1 0 0 0 ) D5 = ( 1 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D6 = ( 0 0 0 0 0 ) & ( * * * * * ) = ( 0 0 0 0 0 ) Found
  • 22. BNDM algorithm: example Given the pattern ATGTA • The searching phase: G T A C T A G AA T A C G T A T G T A C T G ... A T G T A A T G T A A T G T A • The mask of characters is : B(A) = ( 1 0 0 0 1 ) B(C) = ( 0 0 0 0 0 ) B(G) = ( 0 0 1 0 0 ) B(T) = ( 0 1 0 1 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 0 0 0 0 0 ) = ( 0 0 0 0 0 ) D1 = ( 0 1 0 1 0 ) D2 = ( 1 0 1 0 0 ) & ( 1 0 0 0 1 ) = ( 1 0 0 0 0 ) D3 = ( 0 0 0 0 0 ) & ( 1 0 0 0 1 ) = ( 0 0 0 0 0 ) How the shif is determined?
  • 23. Extended string matching • Classes of characters: when in some DNA files or patterns there are new characters as N or R that means N={A,C,G,T} and R={G,A}. • Bounded length gaps: we find pattern as ATx(2,3)TA where x(2,3) means any 2 or 3 characters. • Optional characters: we find pattern as AC?ACT?T?A where C? means that C may or may not appear in the text. • Wild cards: we find pattern as AT*TA where * means an arbitrary long string. • Repeatable characters: we find pattern as AT[TA]*AT where [TA]* means that TA can appear zero or more times..
  • 24. Exact string matching for one pattern Algorismes més eficients (Navarro & Raffinot) 2 4 8 16 32 64 128 256 64 32 16 8 4 2 | | Long. pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w
  • 25. Autòmata Factor Oracle: propietats Factor Oracle of word G T A T G T A G G A T T A T T A G All states are accepting states. Recognizes all factors … but more, which? If a word is rejected, it isn't a factor, then
  • 26. BOM algorithm (Backward Oracle Matching) • How many cells are shifted? • How the comparison is made? Text : Pattern : Automata: Factor Oracle Checks from right to left a • If the a isn't into the automaton • If we reach the last stat of the automaton with the a a
  • 27. BOM algorithm: example • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A... A T G T A T G How the comparison is made? G G A T T A T T A G
  • 28. BOM algorithm: example A T G T A T G How the comparison is made? G G A T T A T T A G A T G T A T G • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 29. BOM algorithm: example A T G T A T G How the comparison is made? G G A T T A T T A G A T G T A T G A T G T A T G • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 30. BOM algorithm: example A T G T A T G How the comparison is made? G G A T T A T T A G A T G T A T G A T G T A T G A T G T A T G • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 31. BOM algorithm: example A T G T A T G G G A T T A T T A G A T G T A T G A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 32. BOM algorithm: example A T G T A T G G G A T T A T T A G A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G How the comparison is made? • The automaton of the inverse patterns is built: given the pattern ATGTATG • And the search is : G T A C T A G AA T G T G T A G A C A T G T A T G G G A...
  • 33. Automata Factor Oracle Given the pattern GTATA, in which state the factors are accepted? G A T T A G GT T GTA TA A When the new A is read, 5 factors should be accepted GTATA TATA ATA TA A, how it can be reached? GTAT TAT AT T T G A T T A G GT T GTA TA A When the new T is read, 4 factors should be accepted GTAT TAT AT T, how it can be reached?
  • 34. Automata Factor Oracle When the new G is read, 6 factors should be accepted GTATAG TATAG ATAG TAG AG G GTATA TATA ATA TA A GTAT TAT AT T T G A T T A G GT T GTA TA A A G GTATAG TATAG ATAG TAG AG G
  • 35. Automaton Factor Oracle: linear algorithm ?
  • 36. Autòmata Factor Oracle: algorisme If there is a T transition ... T T
  • 37. Autòmata Factor Oracle: algorisme … and recursively continue ... T T But if there isn't a T transition ...

Editor's Notes

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37