This document discusses string algorithms and data structures. It introduces the Knuth-Morris-Pratt algorithm for finding patterns in strings in O(n+m) time where n is the length of the text and m is the length of the pattern. It also discusses common string data structures like tries, suffix trees, and suffix arrays. Suffix trees and suffix arrays store all suffixes of a string and support efficient pattern matching and other string operations in linear time or O(m+logn) time where m is the pattern length and n is the text length.
Fast algorithms for large scale genome alignment and comparisonDavide Eynard
A presentation of the three articles about MUMmer by A.L. Delcher et al., made for the PhD course "Algorithms for Computational Molecular Biology" at Politecnico di Milano, May 2007
Fast algorithms for large scale genome alignment and comparisonDavide Eynard
A presentation of the three articles about MUMmer by A.L. Delcher et al., made for the PhD course "Algorithms for Computational Molecular Biology" at Politecnico di Milano, May 2007
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Advances in discrete energy minimisation for computer vision
1. String algorithms
ast • Knuth-Morris-Pratt algorithm
ime: - find pattern P in string T
- preprocess P (construct Fail array)
- search time: O(n+m) n=|T|, m=|P|
oday: • Data structures for strings
- trie
- suffix tree & generalized suffix tree
- suffix array
• Many applications
- e.g. computing common longest substring
2. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d l hi
had d p
{ had, held, help, hi }
held help
3. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree had
held
• Edges labeled with letters help
• Leafs correspond to input strings hi
{ had, held, help, hi }
4. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
had
• Leafs correspond to input strings held
help
hi
{ had, held, help, hi }
5. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
had held hi
help
{ had, held, help, hi }
6. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d held hi
help
had
{ had, held, help, hi }
7. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d l hi
had held
{ had, held, help, hi } help
8. Trie data structure
• Data structure for storing a set of strings
• Name comes from “retrieval”, usually pronounced as “try”
• Rooted tree h
• Edges labeled with letters
a e i
• Leafs correspond to input strings
d l hi
had d p
{ had, held, help, hi }
held help
9. Trie data structure
• Cannot represent strings T and TT'
• Solution: append special symbol $
to the end of each word
h
a e i
d l hi
had d p
{ had, held, help, hi }
held help
10. Trie data structure
• Cannot represent strings T and TT'
• Solution: append special symbol $
to the end of each word
h
$ a e i
h$
$
d l
hi$
$ d p
{ had$, held$, help$, hi$, h$ }
had$ $ $
held$ help$
11. Trie data structure
• Application: storing dictionaries
• Space: O(m1+...+mk), mi=length of string Ti
• Looking up word of length n: O(n)
- assuming alphabet has constant size h
$ a
• Alternative to binary trees e i
- pros and cons (see wikipedia) h$
$
d l
had
hi$
h held d p
$
help hi
had$ $ $
- all nodes store items, not just leaves
- unlabeled edges held$ help$
12. Tries for storing suffixes
• This lecture: use tries for storing the set of suffixes of string T
c a o $
a c o $ $
cacao$
acao$
c o a o$
$ cao$
a o ao$ ao$
$
o$
cao$
o $ $
$ acao$
cacao$
14. Reducing space requirements
• # leafs: n+1
• # internal nodes: at most n
• => O(n) storage
# edges: at most 2n edges
• Trick 2: Store edge labels via two pointers into the string
(begin & end)
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
15. Suffix trees
• Called the suffix tree for string T [Weiner’73]
• D. Knuth: “algorithm of the year 1973”
• Can be constructed in linear time
- e.g. [McCreight’76], [Ukkonen’95]
- algorithms are complicated [will not be covered]
• Used for all sorts of string problems
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
16. Suffix trees for string matching
Does text T contain pattern P?
• Construct suffix tree for T in O(|T|) time
• Query takes O(|P|) time!
• same as Knuth-Morris-Pratt, but...
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
17. Suffix trees for string matching
• Scenario: text T is fixed, patterns P1,...,Pk come one at a time
- T is very long
• Time: O(|T| + |P1| + |P2| + ... + |Pk|)
• Knuth-Morris-Pratt can be generalized to multiple patterns
(Aho-Corasick alg.). Same complexity as above, but requires
knowledge of P1,...,Pk in advance
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
18. Getting extra information
Task: Get all occurrences of P in T (give a list of start positions)
• Can be done in O(|P|+c) time where c is # of occurences
- Locate the node corresponding to P
- Traverse all leaves of the subtree rooted at this node
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
19. Getting extra information
Task: Get # of occurrences of P in T
• Can be done in O(|P|) time
- With O(|T|) preprocessing
cacao$
ca a o$ $
acao$
cao$
o$
cao$ o$ cao$ o$ $ ao$
o$
cao$ acao$ ao$
cacao$ $
20. Generalized suffix tree
• Input: a set of strings {T1,...,Tk}
• Generalized suffix tree: tree containing all suffixes of all strings
• Tree for { abba , bac } : abba$ 1
bba$1
a $2
b c$2 $1 ba$1
a$1
c$2 $1 $2
bba$1 $1 a $1
c$2 ba$1
abba$1 ac$2 a$1 bba$1 bac$2
$1 c$2 ac$2
ba$1 bac$2
c$2
21. Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”
abba$1bac$2
bba$1bac$2
$1bac$2 ba$1bac$2
a$1bac$2
...$1bac$2 $1bac$2 $1bac$2 $1bac$2
...$1bac$1
abba$1bac$2 a$1bac$2 bba$1bac$2 bac$2
ac$2
$1bac$2
c$2
ba$1bac$2 $2
22. Construction of generalized suffix tree for {X, Y}
• Construct suffix tree for X $1Y $2
• Edges leading to red leaves are labeled as “...$1Y $2”; delete “Y $2”
abba$1bac$2
bba$1bac$2
$1 ba$1bac$2
a$1bac$2
...$1 $1 $1 $1bac$2
...$1
abba$1 a$1 bba$1 bac$2
ac$2
$1
c$2
ba$1 $2
23. Construction of generalized suffix tree {T 1,...,Tk}
• Can use the same trick
- construct suffix tree for T1 $1 ... Tk $k
• Linear runtime: O(|T1|+...+|Tk|)
• In practice, modify the algorithm (e.g. Ukkonen’s alg.)
to get a slightly faster performance
24. Computing common longest subsequence of {T1,T2}
• Mark each internal node with red (blue)
if it has at least one red (blue) child
• Nodes with both colors contain common subsequences
abba$1
bba$1
a $2
b c$2 $1 ba$1
a$1
c$2 $1 $2
bba$1 $1 a $1
c$2 ba$1
abba$1 ac$2 a$1 bba$1 bac$2
$1 c$2 ac$2
ba$1 bac$2
c$2
25. Computing common longest subsequence of {T1,...,TK}
Task: For each r = 2,...,K compute the length of the longest sequence
contained in at least r strings
1. Construct generalized suffix tree for {T1,...,TK}
...
...
26. Computing common longest subsequence of {T1,...,TK}
Task: For each r = 2,...,K compute the length of the longest sequence
contained in at least r strings
1. Construct generalized suffix tree for {T1,...,TK}
2. Mark each node v with color k=1,...,K if it has a leaf with that color
3. Compute c(v) = # of colors at v
- Naive computation: O(nK), n=|T1|+...+|TK|. Can also be done in O(n)
String corresponding to v is contained in exactly c(v) input strings
...
...
27. Suffix array for text T[1..n]
SA[0..n]
mississippi 0 ε 11 • SA[i] = id of the i’th
ississippi 1 i 10 smallest suffix
ssissippi 2 ippi 7
sissippi 3 issippi 4
issippi 4 ississippi 1
• Number k corresponds
ssippi 5 mississippi 0 to suffix T[k+1..n]
sippi 6 pi 9
ippi 7 ppi 8
ppi 8 sippi 6
pi 9 sissippi 3 • Lexicographic order:
i 10 ssippi 5 - X < Xα...
ε 11 ssissippi 2 - Xα... < Xβ... if α<β
Suffix array: stores suffixes of string T in a lexicographic order
28. Pattern search from a suffix array
Task: Find pattern P = iss in text T = mississippi
m n
ε 11
i 10 • All occurrences appear contiguously in the array
ippi 7
issippi 4 • Computing start & end indexes: binary search
ississippi 1
mississippi 0 • Worst-case: O(m log n)
pi 9
ppi 8
sippi 6 • In practice, often O(m + log n)
sissippi 3
ssippi 5 • Can be improved to O(m + log n) worst-case
ssissippi 2 - need to store extra 2n numbers
29. Construction of suffix arrays
ε 11 • SA[0..n] can be constructed in linear time
i 10
ippi 7 - from a suffix tree
issippi 4
- or directly
ississippi 1
mississippi 0 [Kim,Sim,Park,Park’03]
pi 9 [Kärkkäinen&Sanders’03]
ppi 8 [Ko & Aluru’03]
sippi 6
sissippi 3
ssippi 5
ssissippi 2
30. Summary
• Suffix trees & suffix arrays: two powerful data structures for strings
• Space requirements:
- suffix trees: linear but with a large constant, e.g. 20 |T|
- suffix arrays: more efficient, e.g. 5 |T|
• Suffix arrays are more suited to large alphabets
• Practitioners like suffix arrays (simplicity, space efficiency)
• Theoreticians like suffix trees (explicit structure) ε 11
i 10
ippi 7
issippi 4
ississippi 1
mississippi 0
pi 9
ppi 8
sippi 6
sissippi 3
ssippi 5
ssissippi 2
31. Further information on string algorithms
• Dan Gusfield, “Algorithms on Strings, Trees and Sequences:
Computer Science and Computational Biology”, 1997
• Slides mentioning more recent results:
http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt