Successfully reported this slideshow.
Upcoming SlideShare
×

# Csr2011 june18 11_00_tiskin

474 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

### Csr2011 june18 11_00_tiskin

1. 1. Approximate matching in grammar-compressed strings Alexander Tiskin Department of Computer Science University of Warwick http://www.dcs.warwick.ac.uk/~tiskinAlexander Tiskin (Warwick) Approximate matching in GC-strings 1 / 52
2. 2. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 2 / 52
3. 3. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 3 / 52
4. 4. IntroductionString matching: ﬁnding an exact pattern in a stringString comparison: ﬁnding similar patterns in two stringsApplications: computational biology, image recognition, . . . Alexander Tiskin (Warwick) Approximate matching in GC-strings 4 / 52
5. 5. IntroductionString matching: ﬁnding an exact pattern in a stringString comparison: ﬁnding similar patterns in two stringsApplications: computational biology, image recognition, . . .Standard types of string comparison: global: whole string vs whole string local: substrings vs substringsMain focus of this work: semi-local: whole string vs substrings; preﬁxes vs suﬃxesClosely related to approximate string matching (no relation toapproximation algorithms!)Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices) Alexander Tiskin (Warwick) Approximate matching in GC-strings 4 / 52
6. 6. IntroductionTerminology and notationIntegers: . . . − 2, −1, 0, 1, 2, . . .Odd half-integers: . . . − 2 , − 3 , − 1 , 2 , 2 , 5 , . . . 5 2 2 1 3 2(i, j) (i , j ) iﬀ i < i and j < j (i, j) (i , j ) iﬀ i < i and j > jWe consider ﬁnite and inﬁnite integer matrices over integer and oddhalf-integer indices. For simplicity, index range will usually be ignored.A permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column  0 1 01 0 0 0 0 1 Alexander Tiskin (Warwick) Approximate matching in GC-strings 5 / 52
7. 7. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j) Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
8. 8. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j)Given matrix E , its density matrix isE (ˆ, ) = E (ˆ− 1 ,  + 1 ) − E (ˆ− 1 ,  − 2 ) − E (ˆ+ 1 ,  + 2 ) + E (ˆ+ 1 ,  − 1 ) ı ˆ ı 2 ˆ 2 ı 2 ˆ 1 ı 2 ˆ 1 ı 2 ˆ 2where D Σ , E over integers; D, E over odd half-integers Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
9. 9. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j)Given matrix E , its density matrix isE (ˆ, ) = E (ˆ− 1 ,  + 1 ) − E (ˆ− 1 ,  − 2 ) − E (ˆ+ 1 ,  + 2 ) + E (ˆ+ 1 ,  − 1 ) ı ˆ ı 2 ˆ 2 ı 2 ˆ 1 ı 2 ˆ 1 ı 2 ˆ 2where D Σ , E over integers; D, E over odd half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
10. 10. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j)Given matrix E , its density matrix isE (ˆ, ) = E (ˆ− 1 ,  + 1 ) − E (ˆ− 1 ,  − 2 ) − E (ˆ+ 1 ,  + 2 ) + E (ˆ+ 1 ,  − 1 ) ı ˆ ı 2 ˆ 2 ı 2 ˆ 1 ı 2 ˆ 1 ı 2 ˆ 2where D Σ , E over integers; D, E over odd half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0(D Σ ) = D for all DMatrix E is simple, if (E )Σ = E Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
11. 11. IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: border-to-border distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: border-to-border distances in a grid-like graph Alexander Tiskin (Warwick) Approximate matching in GC-strings 7 / 52
12. 12. IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: border-to-border distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: border-to-border distances in a grid-like graphSimple unit-Monge matrix: P Σ , where P is a permutation matrixSeaweed matrix: P Σ , represented implicitly by P   Σ 0 1 2 3 0 1 01 0 0 = 0 1 1 2   0 0 0 1 0 0 1 0 0 0 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 7 / 52
13. 13. IntroductionImplicit unit-Monge matricesEﬃcient P Σ queries: range tree on nonzeros of P [Bentley: 1980] binary search tree by i-coordinate under every node, binary search tree by j-coordinate • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • Alexander Tiskin (Warwick) Approximate matching in GC-strings 8 / 52
14. 14. IntroductionImplicit unit-Monge matricesEﬃcient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query means dominance counting: how many nonzeros aredominated by query point? Answered by decomposing query range into≤ log2 n disjoint canonical ranges.Total size O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Approximate matching in GC-strings 9 / 52
15. 15. IntroductionImplicit unit-Monge matricesEﬃcient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query means dominance counting: how many nonzeros aredominated by query point? Answered by decomposing query range into≤ log2 n disjoint canonical ranges.Total size O(n log n), query time O(log2 n)There are asymptotically more eﬃcient (but less practical) data structures log nTotal size O(n), query time O log log n [J´J´+: 2004] a a [Chan, Pˇtra¸cu: 2010] a s Alexander Tiskin (Warwick) Approximate matching in GC-strings 9 / 52
16. 16. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 10 / 52
17. 17. Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: preﬁx, suﬃxNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably close Alexander Tiskin (Warwick) Approximate matching in GC-strings 11 / 52
18. 18. Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: preﬁx, suﬃxNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably closeThe longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0In biological terms, “loss-free alignment” (unlike “lossy” BLAST) Alexander Tiskin (Warwick) Approximate matching in GC-strings 11 / 52
19. 19. Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs b Alexander Tiskin (Warwick) Approximate matching in GC-strings 12 / 52
20. 20. Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs bLCS: running timeO(mn) [Wagner, Fischer: 1974] mnO log2 n σ = O(1) [Masek, Paterson: 1980] [Crochemore+: 2003] mn(log log n)2O log2 n [Paterson, Danˇ´cık: 1994] [Bille, Farach-Colton: 2008]Running time varies depending on the RAM modelWe assume word-RAM with word size log n Alexander Tiskin (Warwick) Approximate matching in GC-strings 12 / 52
21. 21. Semi-local string comparisonSemi-local LCS and edit distanceLCS on the alignment graph (directed, acyclic) b a a b c a b c a b a c a blue = 0b red = 1aabcbcaLCS(“baabcbca”, “baabcabcabaca”) = “baabcbca”LCS = highest-score corner-to-corner path Alexander Tiskin (Warwick) Approximate matching in GC-strings 13 / 52
22. 22. Semi-local string comparisonSemi-local LCS and edit distanceLCS: dynamic programming [WF: 1974]Sweep alignment graph by cellsCell update: time O(1)Overall time O(mn) Alexander Tiskin (Warwick) Approximate matching in GC-strings 14 / 52
23. 23. Semi-local string comparisonSemi-local LCS and edit distanceLCS: micro-block dynamic programming [MP: 1980; BF: 2008]Sweep alignment graph by micro-blocksMicro-block size: t = O(log n) when σ = O(1) log n t=O log log n otherwiseMicro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bitsMicro-block update: time O(1), via table of all possible interfaces mn mn(log log n)2Overall time O log2 n when σ = O(1), O log2 n otherwise Alexander Tiskin (Warwick) Approximate matching in GC-strings 15 / 52
24. 24. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O(m2 + n2 ) LCS scores: string-substring LCS: string a vs every substring of b preﬁx-suﬃx LCS: every preﬁx of a vs every suﬃx of b symmetrically, substring-string and suﬃx-preﬁx LCS Alexander Tiskin (Warwick) Approximate matching in GC-strings 16 / 52
25. 25. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O(m2 + n2 ) LCS scores: string-substring LCS: string a vs every substring of b preﬁx-suﬃx LCS: every preﬁx of a vs every suﬃx of b symmetrically, substring-string and suﬃx-preﬁx LCSThe three-way semi-local LCS problemGive the (implicit) matrix of O(n2 ) LCS scores: string-substring, preﬁx-suﬃx, suﬃx-preﬁx LCS no substring-string LCSSuitable for m n Alexander Tiskin (Warwick) Approximate matching in GC-strings 16 / 52
26. 26. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O(m2 + n2 ) LCS scores: string-substring LCS: string a vs every substring of b preﬁx-suﬃx LCS: every preﬁx of a vs every suﬃx of b symmetrically, substring-string and suﬃx-preﬁx LCSThe three-way semi-local LCS problemGive the (implicit) matrix of O(n2 ) LCS scores: string-substring, preﬁx-suﬃx, suﬃx-preﬁx LCS no substring-string LCSSuitable for m nCf.: dynamic programming gives preﬁx-preﬁx LCS Alexander Tiskin (Warwick) Approximate matching in GC-strings 16 / 52
27. 27. Semi-local string comparisonSemi-local LCS and edit distanceSemi-local LCS on the alignment graph b a a b c a b c a b a c a blue = 0b red = 1aabcbcascore(“baabcbca”, “cabcaba”) = 5 (“abcba”)Semi-local LCS = all highest-score border-to-border paths(string-substring = top-to-bottom, etc.) Alexander Tiskin (Warwick) Approximate matching in GC-strings 17 / 52
28. 28. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 b = “baabcabcabaca” −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 18 / 52
29. 29. Semi-local string comparisonScore matrices and seaweed matricesSemi-local LCS: output representation and running timesize query timeO(n2 ) O(1) trivialO(m1/2 n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006] . . . or any 2D orthogonal range counting data structurerunning timeO(mn2 ) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006] mnO log0.5 n [T: 2006] mn(log log n)2O log2 n [T: 2007] Alexander Tiskin (Warwick) Approximate matching in GC-strings 19 / 52
30. 30. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix PH(i, j): the number of matched characters for a vs substring b i : jj − i − H(i, j): the number of unmatched charactersProperties of matrix j − i − H(i, j): simple unit-Monge therefore, = P Σ , where P = −H is a permutation matrixP is the seaweed matrix, giving an implicit representation of HRange tree for P: memory O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Approximate matching in GC-strings 20 / 52
31. 31. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “baabcabcabaca” −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 • −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 • −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 • −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 21 / 52
32. 32. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “baabcabcabaca” −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 • −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 • blue: diﬀerence in H is 0 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 • red: diﬀerence in H is 1 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 21 / 52
33. 33. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “baabcabcabaca” −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 • −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 • blue: diﬀerence in H is 0 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 • red: diﬀerence in H is 1 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 green: P(i, j) = 1−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1 H(i, j) = j − i − P Σ (i, j)−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 21 / 52
34. 34. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P a = “baabcbca” • b = “baabcabcabaca” • b = b 4 : 11 = “cabcaba” • H(4, 11) = LCS(a, b ) = 11 − 4 − P Σ (i, j) = 11 − 4 − 2 = 5 • • Alexander Tiskin (Warwick) Approximate matching in GC-strings 22 / 52
35. 35. Semi-local string comparisonScore matrices and seaweed matricesThe seaweeds in the alignment graph b a a b c a b c a b a c a a = “baabcbca” •ba b = “baabcabcabaca”a b = b 4 : 11 = “cabcaba”b H(4, 11) = LCS(a, b ) =c 11 − 4 − P Σ (i, j) =b 11 − 4 − 2 = 5ca •P(i, j) = 1 corresponds to seaweed (top, i) (bottom, j) Alexander Tiskin (Warwick) Approximate matching in GC-strings 23 / 52
36. 36. Semi-local string comparisonScore matrices and seaweed matricesThe seaweeds in the alignment graph b a a b c a b c a b a c a a = “baabcbca” •ba b = “baabcabcabaca”a b = b 4 : 11 = “cabcaba”b H(4, 11) = LCS(a, b ) =c 11 − 4 − P Σ (i, j) =b 11 − 4 − 2 = 5ca •P(i, j) = 1 corresponds to seaweed (top, i) (bottom, j)Also deﬁne top right, left right, left bottom seaweedsGives bijection between top-left and bottom-right borders Alexander Tiskin (Warwick) Approximate matching in GC-strings 23 / 52
37. 37. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 24 / 52
38. 38. Matrix distance multiplicationSeaweed braidsDistance algebra (a.k.a (min, +) or tropical algebra): ⊕ is min, is +Matrix -multiplicationA B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k) Alexander Tiskin (Warwick) Approximate matching in GC-strings 25 / 52
39. 39. Matrix distance multiplicationSeaweed braidsDistance algebra (a.k.a (min, +) or tropical algebra): ⊕ is min, is +Matrix -multiplicationA B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k)Matrix classes closed under -multiplication (for given n): general numerical (integer, real) matrices Monge matrices simple unit-Monge matrices ΣPA Σ Σ PB = PC written as PA PB = PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 25 / 52
40. 40. Matrix distance multiplicationSeaweed braidsThe seaweed monoid Tn : simple unit-Monge matrices under -multiplication permutation matrices under -multiplicationIdentity: 1 x =x Zero: 0 x =0   · · · •   • · · · · • · · · · • ·1= 0= ·  • · ·  · · • · · · · • • · · · Alexander Tiskin (Warwick) Approximate matching in GC-strings 26 / 52
41. 41. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
42. 42. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
43. 43. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
44. 44. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPA PCPB Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
45. 45. Matrix distance multiplicationSeaweed braidsSeaweed braids: similar to standard braids, generated by crossingsUnlike in standard braids, all seaweed crossings are transversal, i.e. on one level (not underpass/overpass) idempotent, i.e. two seaweeds can cross at most onceSeaweed braid -multiplication: associative, no inverse (a crossing cannotbe undone)Identity: 1 x = x Zero: 0 x = 01 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 28 / 52
46. 46. Matrix distance multiplicationSeaweed braidsThe seaweed monoid Tn : n! elements (permutations of size n) n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)idempotence:gi2 = gi for all i =far commutativity:gi gj = gj gi j − i > 1 ··· = ···braid relations:gi gj gi = gj gi gj j −i =1 = Alexander Tiskin (Warwick) Approximate matching in GC-strings 29 / 52
47. 47. Matrix distance multiplicationSeaweed braidsThe seaweed monoid TnAlso known as the 0-Hecke monoid of the symmetric group H0 (Sn )Generalisations: general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008] Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990] Alexander Tiskin (Warwick) Approximate matching in GC-strings 30 / 52
48. 48. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP) Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
49. 49. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
50. 50. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
51. 51. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0Easy to use, but not an eﬃcient algorithm Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
52. 52. Matrix distance multiplicationImplicit unit-Monge -multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC ) Alexander Tiskin (Warwick) Approximate matching in GC-strings 32 / 52
53. 53. Matrix distance multiplicationImplicit unit-Monge -multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC )Matrix -multiplication: running timetype timegeneral O(n3 ) standard 3 3 O n (log log n) log2 n [Chan: 2007]Monge O(n2 ) via [Aggarwal+: 1987]implicit unit-Monge O(n1.5 ) [T: 2006] O(n log n) [T: 2010] Alexander Tiskin (Warwick) Approximate matching in GC-strings 32 / 52
54. 54. Matrix distance multiplicationImplicit unit-Monge -multiplication PB •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• • •• ? • • • • PA PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 33 / 52
55. 55. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• •• • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
56. 56. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• •• •• • • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
57. 57. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • • • • • • •• • •• •• • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
58. 58. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
59. 59. Matrix distance multiplicationImplicit unit-Monge -multiplication vs Alexander Tiskin (Warwick) Approximate matching in GC-strings 35 / 52
60. 60. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 36 / 52
61. 61. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 36 / 52
62. 62. Matrix distance multiplicationImplicit unit-Monge -multiplicationImplicit unit-Monge matrix -multiplication: the algorithm Σ Σ ΣPC (i, k) = minj PA (i, j) + PB (j, k)Divide-and-conquer on the range of jDivide PA horizontally, PB vertically; two subproblems of eﬀective size n/2: ΣPA,lo Σ Σ PB,lo = PC ,lo Σ PA,hi Σ Σ PB,hi = PC ,hiConquer: most (but not all!) nonzeros of PC ,lo , PC ,hi appear in PCMissing nonzeros can be obtained in time O(n) using the Monge propertyOverall time O(n log n) Alexander Tiskin (Warwick) Approximate matching in GC-strings 37 / 52
63. 63. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 38 / 52
64. 64. Compressed string comparisonGrammar compressionNotation: text t of length n; pattern p of length mA GC-string (grammar-compressed string) t is a straight-line program(context-free grammar) generating t = tn by n assignments of the form ¯ ¯ tk = α, where α is an alphabet character tk = ti tj , where i, j < kIn general, n = O(2n ) ¯Example: Fibonacci string “abaababaabaab”t1 = ‘b’ t2 = ‘a’t3 = t2 t1 t4 = t3 t2 t5 = t4 t3 t6 = t5 t4 t7 = t6 t5 Alexander Tiskin (Warwick) Approximate matching in GC-strings 39 / 52
65. 65. Compressed string comparisonGrammar compressionGrammar-compression covers various compression types, e.g. LZ78, LZW(not LZ77 directly)Simplifying assumption: arithmetic up to n runs in O(1)This assumption can be removed by careful index remapping Alexander Tiskin (Warwick) Approximate matching in GC-strings 40 / 52
66. 66. Compressed string comparisonThree-way semi-local LCS on GC-stringsLCS: running timet pplain plain O(mn) [Wagner, Fischer: 1974] mn O log2 m [Masek, Paterson: 1980] [Crochemore+: 2003]GC plain O(m3 n + . . .) ¯ general CFG [Myers: 1995] O(m1.5 n)¯ 3-way semi [T: 2008] O(m log m · n) ¯ 3-way semi [T: NEW]GC GC NP-hard [Lifshits: 2005] O(r 1.2¯1.4 ) r [Hermelin+: 2009] O(r log r · ¯) r [T: NEW]r =m+n ¯=m+n r ¯ ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 41 / 52
67. 67. Compressed string comparisonThree-way semi-local LCS on GC-stringsThree-way semi-local LCS (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯Overall time O(m log m · n) ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 42 / 52
68. 68. Compressed string comparisonSubsequence recognition on GC-stringsThe global subsequence recognition problemDoes text t contain pattern p as a subsequence?Global subsequence recognition: running timet pplain plain O(n) greedyGC plain O(m¯) n greedyGC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Approximate matching in GC-strings 43 / 52
69. 69. Compressed string comparisonSubsequence recognition on GC-stringsThe local subsequence recognition problemFind all minimally matching substrings of t with respect to pSubstring of t is matching, if p is a subsequence of tMatching substring of t is minimally matching, if none of its propersubstrings are matching Alexander Tiskin (Warwick) Approximate matching in GC-strings 44 / 52
70. 70. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition: running time ( + output)t pplain plain O(mn) [Mannila+: 1995] mn O log m [Das+: 1997] O(c m + n) [Boasson+: 2001] O(m + nσ) [Troniˇek: 2001] cGC plain O(m2 log m¯) n [C´gielski+: 2006] e 1.5 n) O(m ¯ [T: 2008] O(m log m · n) ¯ [T: NEW]GC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Approximate matching in GC-strings 45 / 52
71. 71. Compressed string comparisonSubsequence recognition on GC-strings 0 ˆ1 ı ˆ3 ı ˆ5 ı n n 2 2 2 1 ˆ 3 ˆ 5 ˆ 2 2 2b i : j matching iﬀ box [i : j] not pierced left-to-right -maximal seaweeds: -chain ˆ1 ,  1 ı ˆ ˆ3 ,  3 ı ˆ ··· ˆs− 1 , s− 1 ı ˆ 2 2 2 2 2 2b i : j minimally matching iﬀ (i, j) is in the interleaved -chain ˆ3 ,  1 ı ˆ ˆ5 ,  3 ı ˆ ··· ˆs− 1 , s− 3 ı ˆ 2 2 2 2 2 2 Alexander Tiskin (Warwick) Approximate matching in GC-strings 46 / 52
72. 72. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 47 / 52
73. 73. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in t Alexander Tiskin (Warwick) Approximate matching in GC-strings 47 / 52
74. 74. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in tThen, ﬁnd -chain of -maximal seaweeds in time n · O(m) = O(m¯) ¯ nThe interleaved -chain deﬁnes minimally matching substrings in toverlapping both t and tOverall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 47 / 52
75. 75. Compressed string comparisonSubsequence recognition on GC-stringsThe threshold approximate matching problemFind all matching substrings of t with respect to p, according to athreshold kSubstring of t is matching, if the edit distance for p vs t is at most k Alexander Tiskin (Warwick) Approximate matching in GC-strings 48 / 52
76. 76. Compressed string comparisonSubsequence recognition on GC-stringsThreshold approximate matching: running time ( + output)t pplain plain O(mn) [Sellers: 1980] O(mk) [Landau, Vishkin: 1989] 4 O(m + n + nk ) m [Cole, Hariharan: 2002]GC plain O(m¯k 2 ) n [K¨rkk¨inen+: 2003] a a O(m¯k + n log n) n ¯ [LV: 1989] via [Bille+: 2010] O(m¯ + nk 4 + n log n) n ¯ ¯ [CH: 2002] via [Bille+: 2010] O(m log m · n) ¯ [T: NEW]GC GC NP-hard [Lifshits: 2005](Also many specialised variants for LZ compression) Alexander Tiskin (Warwick) Approximate matching in GC-strings 49 / 52
77. 77. Compressed string comparisonSubsequence recognition on GC-stringsThreshold approximate matching (GC text, plain pattern): thealgorithmAlgorithm structure similar to local subsequence recognition by seaweedmatrix -multiplication and seaweed -chainsExtra ingredients: the blow-up technique: reduction of edit distances to LCS scores the “implicit SMAWK” technique: row minima in an implicit Monge matrix by an extension of the classical “SMAWK” algorithm; replaces -chain interleavingOverall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 50 / 52
78. 78. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 51 / 52
79. 79. Conclusions and future workA powerful alternative to dynamic programmingImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound? Alexander Tiskin (Warwick) Approximate matching in GC-strings 52 / 52
80. 80. Conclusions and future workA powerful alternative to dynamic programmingImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound?Semi-local LCS problem: representation by implicit unit-Monge matrices generalisation to rational alignment scores next: real alignment scores? Alexander Tiskin (Warwick) Approximate matching in GC-strings 52 / 52
81. 81. Conclusions and future workA powerful alternative to dynamic programmingImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound?Semi-local LCS problem: representation by implicit unit-Monge matrices generalisation to rational alignment scores next: real alignment scores?Approximate matching in GC-text in time O(m log m · n) ¯Other applications: maximum clique in a circle graph in time O(n log2 n) parallel LCS in time O mn , comm O m+n per processor p p 1/2 identiﬁcation of evolutionary-conserved regions in DNA Alexander Tiskin (Warwick) Approximate matching in GC-strings 52 / 52