Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Csr2011 june18 11_00_tiskin

474 views

Published on

  • Be the first to comment

  • Be the first to like this

Csr2011 june18 11_00_tiskin

  1. 1. Approximate matching in grammar-compressed strings Alexander Tiskin Department of Computer Science University of Warwick http://www.dcs.warwick.ac.uk/~tiskinAlexander Tiskin (Warwick) Approximate matching in GC-strings 1 / 52
  2. 2. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 2 / 52
  3. 3. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 3 / 52
  4. 4. IntroductionString matching: finding an exact pattern in a stringString comparison: finding similar patterns in two stringsApplications: computational biology, image recognition, . . . Alexander Tiskin (Warwick) Approximate matching in GC-strings 4 / 52
  5. 5. IntroductionString matching: finding an exact pattern in a stringString comparison: finding similar patterns in two stringsApplications: computational biology, image recognition, . . .Standard types of string comparison: global: whole string vs whole string local: substrings vs substringsMain focus of this work: semi-local: whole string vs substrings; prefixes vs suffixesClosely related to approximate string matching (no relation toapproximation algorithms!)Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices) Alexander Tiskin (Warwick) Approximate matching in GC-strings 4 / 52
  6. 6. IntroductionTerminology and notationIntegers: . . . − 2, −1, 0, 1, 2, . . .Odd half-integers: . . . − 2 , − 3 , − 1 , 2 , 2 , 5 , . . . 5 2 2 1 3 2(i, j) (i , j ) iff i < i and j < j (i, j) (i , j ) iff i < i and j > jWe consider finite and infinite integer matrices over integer and oddhalf-integer indices. For simplicity, index range will usually be ignored.A permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column  0 1 01 0 0 0 0 1 Alexander Tiskin (Warwick) Approximate matching in GC-strings 5 / 52
  7. 7. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j) Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
  8. 8. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j)Given matrix E , its density matrix isE (ˆ, ) = E (ˆ− 1 ,  + 1 ) − E (ˆ− 1 ,  − 2 ) − E (ˆ+ 1 ,  + 2 ) + E (ˆ+ 1 ,  − 1 ) ı ˆ ı 2 ˆ 2 ı 2 ˆ 1 ı 2 ˆ 1 ı 2 ˆ 2where D Σ , E over integers; D, E over odd half-integers Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
  9. 9. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j)Given matrix E , its density matrix isE (ˆ, ) = E (ˆ− 1 ,  + 1 ) − E (ˆ− 1 ,  − 2 ) − E (ˆ+ 1 ,  + 2 ) + E (ˆ+ 1 ,  − 1 ) ı ˆ ı 2 ˆ 2 ı 2 ˆ 1 ı 2 ˆ 1 ı 2 ˆ 2where D Σ , E over integers; D, E over odd half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
  10. 10. IntroductionTerminology and notationGiven matrix D, its distribution matrix is D Σ (i, j) = ˆ>i,ˆ<j ı  D(ˆ, ) ı ˆIn other words, D Σ (i, j) = D(ˆ, ), where (ˆ, ) is ı ˆ ı ˆ -dominated by (i, j)Given matrix E , its density matrix isE (ˆ, ) = E (ˆ− 1 ,  + 1 ) − E (ˆ− 1 ,  − 2 ) − E (ˆ+ 1 ,  + 2 ) + E (ˆ+ 1 ,  − 1 ) ı ˆ ı 2 ˆ 2 ı 2 ˆ 1 ı 2 ˆ 1 ı 2 ˆ 2where D Σ , E over integers; D, E over odd half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0(D Σ ) = D for all DMatrix E is simple, if (E )Σ = E Alexander Tiskin (Warwick) Approximate matching in GC-strings 6 / 52
  11. 11. IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: border-to-border distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: border-to-border distances in a grid-like graph Alexander Tiskin (Warwick) Approximate matching in GC-strings 7 / 52
  12. 12. IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: border-to-border distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: border-to-border distances in a grid-like graphSimple unit-Monge matrix: P Σ , where P is a permutation matrixSeaweed matrix: P Σ , represented implicitly by P   Σ 0 1 2 3 0 1 01 0 0 = 0 1 1 2   0 0 0 1 0 0 1 0 0 0 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 7 / 52
  13. 13. IntroductionImplicit unit-Monge matricesEfficient P Σ queries: range tree on nonzeros of P [Bentley: 1980] binary search tree by i-coordinate under every node, binary search tree by j-coordinate • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • Alexander Tiskin (Warwick) Approximate matching in GC-strings 8 / 52
  14. 14. IntroductionImplicit unit-Monge matricesEfficient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query means dominance counting: how many nonzeros aredominated by query point? Answered by decomposing query range into≤ log2 n disjoint canonical ranges.Total size O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Approximate matching in GC-strings 9 / 52
  15. 15. IntroductionImplicit unit-Monge matricesEfficient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query means dominance counting: how many nonzeros aredominated by query point? Answered by decomposing query range into≤ log2 n disjoint canonical ranges.Total size O(n log n), query time O(log2 n)There are asymptotically more efficient (but less practical) data structures log nTotal size O(n), query time O log log n [J´J´+: 2004] a a [Chan, Pˇtra¸cu: 2010] a s Alexander Tiskin (Warwick) Approximate matching in GC-strings 9 / 52
  16. 16. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 10 / 52
  17. 17. Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: prefix, suffixNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably close Alexander Tiskin (Warwick) Approximate matching in GC-strings 11 / 52
  18. 18. Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: prefix, suffixNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably closeThe longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0In biological terms, “loss-free alignment” (unlike “lossy” BLAST) Alexander Tiskin (Warwick) Approximate matching in GC-strings 11 / 52
  19. 19. Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs b Alexander Tiskin (Warwick) Approximate matching in GC-strings 12 / 52
  20. 20. Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs bLCS: running timeO(mn) [Wagner, Fischer: 1974] mnO log2 n σ = O(1) [Masek, Paterson: 1980] [Crochemore+: 2003] mn(log log n)2O log2 n [Paterson, Danˇ´cık: 1994] [Bille, Farach-Colton: 2008]Running time varies depending on the RAM modelWe assume word-RAM with word size log n Alexander Tiskin (Warwick) Approximate matching in GC-strings 12 / 52
  21. 21. Semi-local string comparisonSemi-local LCS and edit distanceLCS on the alignment graph (directed, acyclic) b a a b c a b c a b a c a blue = 0b red = 1aabcbcaLCS(“baabcbca”, “baabcabcabaca”) = “baabcbca”LCS = highest-score corner-to-corner path Alexander Tiskin (Warwick) Approximate matching in GC-strings 13 / 52
  22. 22. Semi-local string comparisonSemi-local LCS and edit distanceLCS: dynamic programming [WF: 1974]Sweep alignment graph by cellsCell update: time O(1)Overall time O(mn) Alexander Tiskin (Warwick) Approximate matching in GC-strings 14 / 52
  23. 23. Semi-local string comparisonSemi-local LCS and edit distanceLCS: micro-block dynamic programming [MP: 1980; BF: 2008]Sweep alignment graph by micro-blocksMicro-block size: t = O(log n) when σ = O(1) log n t=O log log n otherwiseMicro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bitsMicro-block update: time O(1), via table of all possible interfaces mn mn(log log n)2Overall time O log2 n when σ = O(1), O log2 n otherwise Alexander Tiskin (Warwick) Approximate matching in GC-strings 15 / 52
  24. 24. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O(m2 + n2 ) LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b symmetrically, substring-string and suffix-prefix LCS Alexander Tiskin (Warwick) Approximate matching in GC-strings 16 / 52
  25. 25. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O(m2 + n2 ) LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b symmetrically, substring-string and suffix-prefix LCSThe three-way semi-local LCS problemGive the (implicit) matrix of O(n2 ) LCS scores: string-substring, prefix-suffix, suffix-prefix LCS no substring-string LCSSuitable for m n Alexander Tiskin (Warwick) Approximate matching in GC-strings 16 / 52
  26. 26. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O(m2 + n2 ) LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b symmetrically, substring-string and suffix-prefix LCSThe three-way semi-local LCS problemGive the (implicit) matrix of O(n2 ) LCS scores: string-substring, prefix-suffix, suffix-prefix LCS no substring-string LCSSuitable for m nCf.: dynamic programming gives prefix-prefix LCS Alexander Tiskin (Warwick) Approximate matching in GC-strings 16 / 52
  27. 27. Semi-local string comparisonSemi-local LCS and edit distanceSemi-local LCS on the alignment graph b a a b c a b c a b a c a blue = 0b red = 1aabcbcascore(“baabcbca”, “cabcaba”) = 5 (“abcba”)Semi-local LCS = all highest-score border-to-border paths(string-substring = top-to-bottom, etc.) Alexander Tiskin (Warwick) Approximate matching in GC-strings 17 / 52
  28. 28. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 b = “baabcabcabaca” −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 18 / 52
  29. 29. Semi-local string comparisonScore matrices and seaweed matricesSemi-local LCS: output representation and running timesize query timeO(n2 ) O(1) trivialO(m1/2 n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006] . . . or any 2D orthogonal range counting data structurerunning timeO(mn2 ) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006] mnO log0.5 n [T: 2006] mn(log log n)2O log2 n [T: 2007] Alexander Tiskin (Warwick) Approximate matching in GC-strings 19 / 52
  30. 30. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix PH(i, j): the number of matched characters for a vs substring b i : jj − i − H(i, j): the number of unmatched charactersProperties of matrix j − i − H(i, j): simple unit-Monge therefore, = P Σ , where P = −H is a permutation matrixP is the seaweed matrix, giving an implicit representation of HRange tree for P: memory O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Approximate matching in GC-strings 20 / 52
  31. 31. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “baabcabcabaca” −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 • −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 • −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 • −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 21 / 52
  32. 32. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “baabcabcabaca” −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 • −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 • blue: difference in H is 0 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 • red: difference in H is 1 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 21 / 52
  33. 33. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “baabcbca” −1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “baabcabcabaca” −2 −1 0 1 2 3 4 4 5 6 6 6 6 7 • −3 −2 −1 0 1 2 3 3 4 5 5 6 6 7 b = b 4 : 11 = “cabcaba” −4 −3 −2 −1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = LCS(a, b ) = 5 −5 −4 −3 −2 −1 0 1 2 3 4 4 5 5 6 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j −7 −6 −5 −4 −3 −2 −1 0 1 2 2 3 3 4 • blue: difference in H is 0 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 3 4 • red: difference in H is 1 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4−10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3 green: P(i, j) = 1−11 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2−12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 −11 1 H(i, j) = j − i − P Σ (i, j)−13 −11 −12 −10−9 −8 −7 −6 −5 −4 −3 −2 −1 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 21 / 52
  34. 34. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P a = “baabcbca” • b = “baabcabcabaca” • b = b 4 : 11 = “cabcaba” • H(4, 11) = LCS(a, b ) = 11 − 4 − P Σ (i, j) = 11 − 4 − 2 = 5 • • Alexander Tiskin (Warwick) Approximate matching in GC-strings 22 / 52
  35. 35. Semi-local string comparisonScore matrices and seaweed matricesThe seaweeds in the alignment graph b a a b c a b c a b a c a a = “baabcbca” •ba b = “baabcabcabaca”a b = b 4 : 11 = “cabcaba”b H(4, 11) = LCS(a, b ) =c 11 − 4 − P Σ (i, j) =b 11 − 4 − 2 = 5ca •P(i, j) = 1 corresponds to seaweed (top, i) (bottom, j) Alexander Tiskin (Warwick) Approximate matching in GC-strings 23 / 52
  36. 36. Semi-local string comparisonScore matrices and seaweed matricesThe seaweeds in the alignment graph b a a b c a b c a b a c a a = “baabcbca” •ba b = “baabcabcabaca”a b = b 4 : 11 = “cabcaba”b H(4, 11) = LCS(a, b ) =c 11 − 4 − P Σ (i, j) =b 11 − 4 − 2 = 5ca •P(i, j) = 1 corresponds to seaweed (top, i) (bottom, j)Also define top right, left right, left bottom seaweedsGives bijection between top-left and bottom-right borders Alexander Tiskin (Warwick) Approximate matching in GC-strings 23 / 52
  37. 37. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 24 / 52
  38. 38. Matrix distance multiplicationSeaweed braidsDistance algebra (a.k.a (min, +) or tropical algebra): ⊕ is min, is +Matrix -multiplicationA B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k) Alexander Tiskin (Warwick) Approximate matching in GC-strings 25 / 52
  39. 39. Matrix distance multiplicationSeaweed braidsDistance algebra (a.k.a (min, +) or tropical algebra): ⊕ is min, is +Matrix -multiplicationA B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k)Matrix classes closed under -multiplication (for given n): general numerical (integer, real) matrices Monge matrices simple unit-Monge matrices ΣPA Σ Σ PB = PC written as PA PB = PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 25 / 52
  40. 40. Matrix distance multiplicationSeaweed braidsThe seaweed monoid Tn : simple unit-Monge matrices under -multiplication permutation matrices under -multiplicationIdentity: 1 x =x Zero: 0 x =0   · · · •   • · · · · • · · · · • ·1= 0= ·  • · ·  · · • · · · · • • · · · Alexander Tiskin (Warwick) Approximate matching in GC-strings 26 / 52
  41. 41. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
  42. 42. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
  43. 43. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
  44. 44. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as -multiplication of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPA PCPB Alexander Tiskin (Warwick) Approximate matching in GC-strings 27 / 52
  45. 45. Matrix distance multiplicationSeaweed braidsSeaweed braids: similar to standard braids, generated by crossingsUnlike in standard braids, all seaweed crossings are transversal, i.e. on one level (not underpass/overpass) idempotent, i.e. two seaweeds can cross at most onceSeaweed braid -multiplication: associative, no inverse (a crossing cannotbe undone)Identity: 1 x = x Zero: 0 x = 01 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 28 / 52
  46. 46. Matrix distance multiplicationSeaweed braidsThe seaweed monoid Tn : n! elements (permutations of size n) n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)idempotence:gi2 = gi for all i =far commutativity:gi gj = gj gi j − i > 1 ··· = ···braid relations:gi gj gi = gj gi gj j −i =1 = Alexander Tiskin (Warwick) Approximate matching in GC-strings 29 / 52
  47. 47. Matrix distance multiplicationSeaweed braidsThe seaweed monoid TnAlso known as the 0-Hecke monoid of the symmetric group H0 (Sn )Generalisations: general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008] Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990] Alexander Tiskin (Warwick) Approximate matching in GC-strings 30 / 52
  48. 48. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP) Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
  49. 49. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
  50. 50. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0 Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
  51. 51. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0Easy to use, but not an efficient algorithm Alexander Tiskin (Warwick) Approximate matching in GC-strings 31 / 52
  52. 52. Matrix distance multiplicationImplicit unit-Monge -multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC ) Alexander Tiskin (Warwick) Approximate matching in GC-strings 32 / 52
  53. 53. Matrix distance multiplicationImplicit unit-Monge -multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC )Matrix -multiplication: running timetype timegeneral O(n3 ) standard 3 3 O n (log log n) log2 n [Chan: 2007]Monge O(n2 ) via [Aggarwal+: 1987]implicit unit-Monge O(n1.5 ) [T: 2006] O(n log n) [T: 2010] Alexander Tiskin (Warwick) Approximate matching in GC-strings 32 / 52
  54. 54. Matrix distance multiplicationImplicit unit-Monge -multiplication PB •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• • •• ? • • • • PA PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 33 / 52
  55. 55. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• •• • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
  56. 56. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• •• •• • • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
  57. 57. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • • • • • • •• • •• •• • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
  58. 58. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 34 / 52
  59. 59. Matrix distance multiplicationImplicit unit-Monge -multiplication vs Alexander Tiskin (Warwick) Approximate matching in GC-strings 35 / 52
  60. 60. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Approximate matching in GC-strings 36 / 52
  61. 61. Matrix distance multiplicationImplicit unit-Monge -multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC Alexander Tiskin (Warwick) Approximate matching in GC-strings 36 / 52
  62. 62. Matrix distance multiplicationImplicit unit-Monge -multiplicationImplicit unit-Monge matrix -multiplication: the algorithm Σ Σ ΣPC (i, k) = minj PA (i, j) + PB (j, k)Divide-and-conquer on the range of jDivide PA horizontally, PB vertically; two subproblems of effective size n/2: ΣPA,lo Σ Σ PB,lo = PC ,lo Σ PA,hi Σ Σ PB,hi = PC ,hiConquer: most (but not all!) nonzeros of PC ,lo , PC ,hi appear in PCMissing nonzeros can be obtained in time O(n) using the Monge propertyOverall time O(n log n) Alexander Tiskin (Warwick) Approximate matching in GC-strings 37 / 52
  63. 63. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 38 / 52
  64. 64. Compressed string comparisonGrammar compressionNotation: text t of length n; pattern p of length mA GC-string (grammar-compressed string) t is a straight-line program(context-free grammar) generating t = tn by n assignments of the form ¯ ¯ tk = α, where α is an alphabet character tk = ti tj , where i, j < kIn general, n = O(2n ) ¯Example: Fibonacci string “abaababaabaab”t1 = ‘b’ t2 = ‘a’t3 = t2 t1 t4 = t3 t2 t5 = t4 t3 t6 = t5 t4 t7 = t6 t5 Alexander Tiskin (Warwick) Approximate matching in GC-strings 39 / 52
  65. 65. Compressed string comparisonGrammar compressionGrammar-compression covers various compression types, e.g. LZ78, LZW(not LZ77 directly)Simplifying assumption: arithmetic up to n runs in O(1)This assumption can be removed by careful index remapping Alexander Tiskin (Warwick) Approximate matching in GC-strings 40 / 52
  66. 66. Compressed string comparisonThree-way semi-local LCS on GC-stringsLCS: running timet pplain plain O(mn) [Wagner, Fischer: 1974] mn O log2 m [Masek, Paterson: 1980] [Crochemore+: 2003]GC plain O(m3 n + . . .) ¯ general CFG [Myers: 1995] O(m1.5 n)¯ 3-way semi [T: 2008] O(m log m · n) ¯ 3-way semi [T: NEW]GC GC NP-hard [Lifshits: 2005] O(r 1.2¯1.4 ) r [Hermelin+: 2009] O(r log r · ¯) r [T: NEW]r =m+n ¯=m+n r ¯ ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 41 / 52
  67. 67. Compressed string comparisonThree-way semi-local LCS on GC-stringsThree-way semi-local LCS (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯Overall time O(m log m · n) ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 42 / 52
  68. 68. Compressed string comparisonSubsequence recognition on GC-stringsThe global subsequence recognition problemDoes text t contain pattern p as a subsequence?Global subsequence recognition: running timet pplain plain O(n) greedyGC plain O(m¯) n greedyGC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Approximate matching in GC-strings 43 / 52
  69. 69. Compressed string comparisonSubsequence recognition on GC-stringsThe local subsequence recognition problemFind all minimally matching substrings of t with respect to pSubstring of t is matching, if p is a subsequence of tMatching substring of t is minimally matching, if none of its propersubstrings are matching Alexander Tiskin (Warwick) Approximate matching in GC-strings 44 / 52
  70. 70. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition: running time ( + output)t pplain plain O(mn) [Mannila+: 1995] mn O log m [Das+: 1997] O(c m + n) [Boasson+: 2001] O(m + nσ) [Troniˇek: 2001] cGC plain O(m2 log m¯) n [C´gielski+: 2006] e 1.5 n) O(m ¯ [T: 2008] O(m log m · n) ¯ [T: NEW]GC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Approximate matching in GC-strings 45 / 52
  71. 71. Compressed string comparisonSubsequence recognition on GC-strings 0 ˆ1 ı ˆ3 ı ˆ5 ı n n 2 2 2 1 ˆ 3 ˆ 5 ˆ 2 2 2b i : j matching iff box [i : j] not pierced left-to-right -maximal seaweeds: -chain ˆ1 ,  1 ı ˆ ˆ3 ,  3 ı ˆ ··· ˆs− 1 , s− 1 ı ˆ 2 2 2 2 2 2b i : j minimally matching iff (i, j) is in the interleaved -chain ˆ3 ,  1 ı ˆ ˆ5 ,  3 ı ˆ ··· ˆs− 1 , s− 3 ı ˆ 2 2 2 2 2 2 Alexander Tiskin (Warwick) Approximate matching in GC-strings 46 / 52
  72. 72. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 47 / 52
  73. 73. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in t Alexander Tiskin (Warwick) Approximate matching in GC-strings 47 / 52
  74. 74. Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (GC text, plain pattern): the algorithmFor every k, compute by recursion the three-way seaweed matrix for p vstk , using seaweed matrix -multiplication: time O(m log m · n) ¯Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in tThen, find -chain of -maximal seaweeds in time n · O(m) = O(m¯) ¯ nThe interleaved -chain defines minimally matching substrings in toverlapping both t and tOverall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 47 / 52
  75. 75. Compressed string comparisonSubsequence recognition on GC-stringsThe threshold approximate matching problemFind all matching substrings of t with respect to p, according to athreshold kSubstring of t is matching, if the edit distance for p vs t is at most k Alexander Tiskin (Warwick) Approximate matching in GC-strings 48 / 52
  76. 76. Compressed string comparisonSubsequence recognition on GC-stringsThreshold approximate matching: running time ( + output)t pplain plain O(mn) [Sellers: 1980] O(mk) [Landau, Vishkin: 1989] 4 O(m + n + nk ) m [Cole, Hariharan: 2002]GC plain O(m¯k 2 ) n [K¨rkk¨inen+: 2003] a a O(m¯k + n log n) n ¯ [LV: 1989] via [Bille+: 2010] O(m¯ + nk 4 + n log n) n ¯ ¯ [CH: 2002] via [Bille+: 2010] O(m log m · n) ¯ [T: NEW]GC GC NP-hard [Lifshits: 2005](Also many specialised variants for LZ compression) Alexander Tiskin (Warwick) Approximate matching in GC-strings 49 / 52
  77. 77. Compressed string comparisonSubsequence recognition on GC-stringsThreshold approximate matching (GC text, plain pattern): thealgorithmAlgorithm structure similar to local subsequence recognition by seaweedmatrix -multiplication and seaweed -chainsExtra ingredients: the blow-up technique: reduction of edit distances to LCS scores the “implicit SMAWK” technique: row minima in an implicit Monge matrix by an extension of the classical “SMAWK” algorithm; replaces -chain interleavingOverall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Approximate matching in GC-strings 50 / 52
  78. 78. 1 Introduction2 Semi-local string comparison3 Matrix distance multiplication4 Compressed string comparison5 Conclusions and future work Alexander Tiskin (Warwick) Approximate matching in GC-strings 51 / 52
  79. 79. Conclusions and future workA powerful alternative to dynamic programmingImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound? Alexander Tiskin (Warwick) Approximate matching in GC-strings 52 / 52
  80. 80. Conclusions and future workA powerful alternative to dynamic programmingImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound?Semi-local LCS problem: representation by implicit unit-Monge matrices generalisation to rational alignment scores next: real alignment scores? Alexander Tiskin (Warwick) Approximate matching in GC-strings 52 / 52
  81. 81. Conclusions and future workA powerful alternative to dynamic programmingImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound?Semi-local LCS problem: representation by implicit unit-Monge matrices generalisation to rational alignment scores next: real alignment scores?Approximate matching in GC-text in time O(m log m · n) ¯Other applications: maximum clique in a circle graph in time O(n log2 n) parallel LCS in time O mn , comm O m+n per processor p p 1/2 identification of evolutionary-conserved regions in DNA Alexander Tiskin (Warwick) Approximate matching in GC-strings 52 / 52

×