Upcoming SlideShare
×

# 20121020 semi local-string_comparison_tiskin

405
-1

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
405
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
1
0
Likes
0
Embeds 0
No embeds

No notes for slide

### 20121020 semi local-string_comparison_tiskin

1. 1. Semi-local string comparison: Algorithmic techniques and applications Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskinAlexander Tiskin (Warwick) Semi-local string comparison 1 / 132
2. 2. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 2 / 132
3. 3. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 3 / 132
4. 4. IntroductionString matching: ﬁnding an exact pattern in a stringString comparison: ﬁnding similar patterns in two stringsApplications: computational biology, image recognition, . . . Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
5. 5. IntroductionString matching: ﬁnding an exact pattern in a stringString comparison: ﬁnding similar patterns in two stringsApplications: computational biology, image recognition, . . .Standard types of string comparison: global: whole string vs whole string local: substrings vs substringsMain focus of this work: semi-local: whole string vs substrings; preﬁxes vs suﬃxesClosely related to approximate string matching (no relation toapproximation algorithms!)Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices) Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
6. 6. IntroductionTerminology and notationx− = x − 1 2 x+ = x + 1 2Integers: {. . . − 2, −1, 0, 1, 2, . . .}Half-integers: . . . − 3 , − 1 , 1 , 2 , 2 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+ 2 2 2 3 5(i, j) (i , j ) iﬀ i < i and j < j (i, j) (i , j ) iﬀ i > i and j < jA permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column  0 1 01 0 0 0 0 1 Alexander Tiskin (Warwick) Semi-local string comparison 5 / 132
7. 7. IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums: Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
8. 8. IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums:Given matrix E , its density matrix is made up of quadrangle diﬀerences:E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆwhere D Σ , E over integers; D, E over half-integers Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
9. 9. IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums:Given matrix E , its density matrix is made up of quadrangle diﬀerences:E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆwhere D Σ , E over integers; D, E over half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
10. 10. IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums:Given matrix E , its density matrix is made up of quadrangle diﬀerences:E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆwhere D Σ , E over integers; D, E over half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0(D Σ ) = D for all DMatrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
11. 11. IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: boundary-to-boundary distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: boundary-to-boundary distances in a grid-like graph Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
12. 12. IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: boundary-to-boundary distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: boundary-to-boundary distances in a grid-like graphSimple unit-Monge matrix: P Σ , where P is a permutation matrixSeaweed matrix: P used as an implicit representation of P Σ   Σ 0 1 2 3 0 1 0 01 0 0 =  1 1 2  0 0 0 1 0 0 1 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
13. 13. IntroductionImplicit unit-Monge matricesEﬃcient P Σ queries: range tree on nonzeros of P [Bentley: 1980] binary search tree by i-coordinate under every node, binary search tree by j-coordinate • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • Alexander Tiskin (Warwick) Semi-local string comparison 8 / 132
14. 14. IntroductionImplicit unit-Monge matricesEﬃcient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query is equivalent to -dominance counting: how many nonzerosare -dominated by query point?Answer: sum up nonzero counts in ≤ log2 n disjoint canonical rangesTotal size O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
15. 15. IntroductionImplicit unit-Monge matricesEﬃcient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query is equivalent to -dominance counting: how many nonzerosare -dominated by query point?Answer: sum up nonzero counts in ≤ log2 n disjoint canonical rangesTotal size O(n log n), query time O(log2 n)There are asymptotically more eﬃcient (but less practical) data structures log nTotal size O(n), query time O log log n [J´J´+: 2004] a a [Chan, Pˇtra¸cu: 2010] a s Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
16. 16. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 10 / 132
17. 17. Matrix distance multiplicationSeaweed braidsDistance algebra (a.k.a (min, +) or tropical algebra): addition ⊕ given by min multiplication given by +Matrix -multiplicationA B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k)Matrix classes closed under -multiplication (for given n): general numerical (integer, real) matrices Monge matrices simple unit-Monge matrices (!) Alexander Tiskin (Warwick) Semi-local string comparison 11 / 132
18. 18. Matrix distance multiplicationSeaweed braidsRecall that simple unit-Monge matrices are represented implicitly bypermutation (seaweed) matricesDeﬁne PA Σ PB = PC as PA Σ Σ PB = PCThe seaweed monoid Tn : simple unit-Monge matrices under equivalently, permutation (seaweed) matrices underAlso known as the 0-Hecke monoid of the symmetric group H0 (Sn ) Alexander Tiskin (Warwick) Semi-local string comparison 12 / 132
19. 19. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
20. 20. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
21. 21. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
22. 22. Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPA PCPB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
23. 23. Matrix distance multiplicationSeaweed braidsThe seaweed monoid Tn : n! elements (permutations of size n) n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)Idempotence:gi2 = gi for all i =Far commutativity:gi gj = gj gi j − i > 1 ··· = ···Braid relations:gi gj gi = gj gi gj j − i = 1 = Alexander Tiskin (Warwick) Semi-local string comparison 14 / 132
24. 24. Matrix distance multiplicationSeaweed braidsIdentity: 1 x =x   • · · · · • · · · · • · =1=  · · · •Zero: 0 x =0   · · · • · · • ·0= · • · · =  • · · · Alexander Tiskin (Warwick) Semi-local string comparison 15 / 132
25. 25. Matrix distance multiplicationSeaweed braidsRelated structures: positive braids: far comm; braid relations braids: gi gi−1 = 1; far comm; braid relations Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations locally free idempotent monoid: idem; far comm [Vershik+: 2000]Generalisations: general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008] Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990] J -trivial monoids [Denton+: 2011] Alexander Tiskin (Warwick) Semi-local string comparison 16 / 132
26. 26. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP) Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
27. 27. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
28. 28. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
29. 29. Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a conﬂuent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0Easy to use, but not an eﬃcient algorithm Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
30. 30. Matrix distance multiplicationSeaweed matrix multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC ) Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
31. 31. Matrix distance multiplicationSeaweed matrix multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC )Matrix -multiplication: running timetype timegeneral O(n3 ) standard 3 3 O n (log log n) log2 n [Chan: 2007]Monge O(n2 ) via [Aggarwal+: 1987]implicit unit-Monge O(n1.5 ) [T: 2006] O(n log n) [T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
32. 32. Matrix distance multiplicationSeaweed matrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• • •• ? • • • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 19 / 132
33. 33. Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• •• • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
34. 34. Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• •• •• • • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
35. 35. Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • • • • • • •• • •• •• • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
36. 36. Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
37. 37. Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
38. 38. Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
39. 39. Matrix distance multiplicationSeaweed matrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 22 / 132
40. 40. Matrix distance multiplicationSeaweed matrix multiplicationImplicit unit-Monge matrix -multiplication: the algorithm Σ Σ ΣPC (i, k) = minj PA (i, j) + PB (j, k)Divide-and-conquer on the range of jDivide PA horizontally, PB vertically: two subproblems of eﬀective size n/2 ΣPA,lo Σ Σ PB,lo = PC ,lo Σ PA,hi Σ Σ PB,hi = PC ,hiConquer: -low nonzeros of PC ,lo and -high nonzeros of PC ,hi appear in PCThe remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to becorrected to obtain the remaining nonzeros of PCCorrection can be done in time O(n) using the unit-Monge propertyOverall time O(n log n) Alexander Tiskin (Warwick) Semi-local string comparison 23 / 132
41. 41. Matrix distance multiplicationBruhat orderComparing permutations by the “degree of sortedness”Bruhat orderPermutation A is lower (“more sorted”) than permutation B in the Bruhatorder (A B), if B can be transformed to A by successive pairwise sortingbetween arbitrary pairs of elements.Permutation matrices: PA PB , if PB can be transformed to PA bysuccessive submatrix substitution: ( 0 1 ) 10 (1 0) 01 Alexander Tiskin (Warwick) Semi-local string comparison 24 / 132
42. 42. Matrix distance multiplicationBruhat orderBruhat comparability: running timeO(n2 ) folkloreO(n log n) [T: NEW]PA PB iﬀ PA ≤ PB elementwise, time O(n2 ) Σ Σ folklore R RPA PB iﬀ PA PB = Id , time O(n log n) [T: NEW]where P R denotes clockwise rotation of matrix P Alexander Tiskin (Warwick) Semi-local string comparison 25 / 132
43. 43. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 26 / 132
44. 44. Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: preﬁx, suﬃxNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably close Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
45. 45. Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: preﬁx, suﬃxNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably closeThe longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0In biological terms, “loss-free alignment” (unlike “lossy” BLAST) Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
46. 46. Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs b Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
47. 47. Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs bLCS: running timeO(mn) [Wagner, Fischer: 1974] mnO log2 n σ = O(1) [Masek, Paterson: 1980] [Crochemore+: 2003] mn(log log n)2O log2 n [Paterson, Danˇ´cık: 1994] [Bille, Farach-Colton: 2008]Running time varies depending on the RAM model versionWe assume word-RAM with word size log n (where it matters) Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
48. 48. Semi-local string comparisonSemi-local LCS and edit distanceLCS on the alignment graph (directed, acyclic) B A A B C A B C A B A C A blue = 0B red = 1AABCBCAscore(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8LCS = highest-score path from top-left to bottom-right Alexander Tiskin (Warwick) Semi-local string comparison 29 / 132
49. 49. Semi-local string comparisonSemi-local LCS and edit distanceLCS: dynamic programming [WF: 1974]Sweep cells in any -compatible orderCell update: time O(1)Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 30 / 132
50. 50. Semi-local string comparisonSemi-local LCS and edit distance ‘Begin at the beginning,’ the King said gravely, ‘and go on till you come to the end: then stop.’ L. Carroll, Alice in Wonderland (The standard approach in dynamic programming) Alexander Tiskin (Warwick) Semi-local string comparison 31 / 132
51. 51. Semi-local string comparisonSemi-local LCS and edit distanceSometimes dynamic programming can be run from both ends for extraﬂexibility Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
52. 52. Semi-local string comparisonSemi-local LCS and edit distanceSometimes dynamic programming can be run from both ends for extraﬂexibilityIs there a better, fully ﬂexible alternative (e.g. for comparing compressedstrings, comparing strings dynamically or in parallel, etc.)? Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
53. 53. Semi-local string comparisonSemi-local LCS and edit distanceLCS: micro-block dynamic programming [MP: 1980; BF: 2008]Sweep cells in micro-blocks, in any -compatible orderMicro-block size: t = O(log n) when σ = O(1) log n t=O log log n otherwiseMicro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bitsMicro-block update: time O(1), by precomputing all possible interfaces mn mn(log log n)2Overall time O log2 n when σ = O(1), O log2 n otherwise Alexander Tiskin (Warwick) Semi-local string comparison 33 / 132
54. 54. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b preﬁx-suﬃx LCS: every preﬁx of a vs every suﬃx of b suﬃx-preﬁx LCS: every suﬃx of a vs every preﬁx of b substring-string LCS: every substring of a vs string b Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
55. 55. Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b preﬁx-suﬃx LCS: every preﬁx of a vs every suﬃx of b suﬃx-preﬁx LCS: every suﬃx of a vs every preﬁx of b substring-string LCS: every substring of a vs string bCf.: dynamic programming gives preﬁx-preﬁx LCS Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
56. 56. Semi-local string comparisonSemi-local LCS and edit distanceSemi-local LCS on the alignment graph B A A B C A B C A B A C A blue = 0B red = 1AABCBCAscore(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5String-substring LCS: all highest-score top-to-bottom pathsSemi-local LCS: all highest-score boundary-to-boundary paths Alexander Tiskin (Warwick) Semi-local string comparison 35 / 132
57. 57. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 b = “BAABCABCABACA” -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 36 / 132
58. 58. Semi-local string comparisonScore matrices and seaweed matricesSemi-local LCS: output representation and running timesize query timeO(n2 ) O(1) trivialO(m1/2 n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006] . . . or any 2D orthogonal range counting data structurerunning timeO(mn2 ) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006] mnO log0.5 n [T: 2006] mn(log log n)2O log2 n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 37 / 132
59. 59. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix PH(i, j): the number of matched characters for a vs substring b i : jj − i − H(i, j): the number of unmatched charactersProperties of matrix j − i − H(i, j): simple unit-Monge therefore, = P Σ , where P = −H is a permutation matrixP is the seaweed matrix, giving an implicit representation of HRange tree for P: memory O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 38 / 132
60. 60. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
61. 61. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: diﬀerence in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: diﬀerence in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
62. 62. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: diﬀerence in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: diﬀerence in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 green: P(i, j) = 1-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 H(i, j) = j − i − P Σ (i, j)-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
63. 63. Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P a = “BAABCBCA” • b = “BAABCABCABACA” • H(4, 11) = • 11 − 4 − P Σ (i, j) = 11 − 4 − 2 = 5 • • Alexander Tiskin (Warwick) Semi-local string comparison 40 / 132
64. 64. Semi-local string comparisonScore matrices and seaweed matricesThe seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA”BA b = “BAABCABCABACA”A H(4, 11) =B 11 − 4 − P Σ (i, j) =C 11 − 4 − 2 = 5BCAP(i, j) = 1 corresponds to seaweed top i bottom j Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
65. 65. Semi-local string comparisonScore matrices and seaweed matricesThe seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA”BA b = “BAABCABCABACA”A H(4, 11) =B 11 − 4 − P Σ (i, j) =C 11 − 4 − 2 = 5BCAP(i, j) = 1 corresponds to seaweed top i bottom jAlso deﬁne top right, left right, left bottom seaweedsGives bijection between top-left and bottom-right graph boundaries Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
66. 66. Semi-local string comparisonScore matrices and seaweed matricesSeaweed braid: a highly symmetric object (element of the 0-Hecke monoidof the symmetric group)Can be built recursively by assembling subbraids from separate partsHighly ﬂexible: local alignment, compression, parallel computation. . . Alexander Tiskin (Warwick) Semi-local string comparison 42 / 132
67. 67. Semi-local string comparisonWeighted alignmentThe LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
68. 68. Semi-local string comparisonWeighted alignmentThe LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0Alignment score is rational, if wM , wX , wG are rational numbersEquivalent to LCS score on blown-up strings Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
69. 69. Semi-local string comparisonWeighted alignmentThe LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0Alignment score is rational, if wM , wX , wG are rational numbersEquivalent to LCS score on blown-up stringsEdit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
70. 70. Semi-local string comparisonWeighted alignmentWeighted alignment graph B A A B C A B C A B A C A blue = 0B red (solid) = 2A red (dotted) = 1ABCBCALevenshtein score(“BAABCBCA”, “CABCABA”) = 11 Alexander Tiskin (Warwick) Semi-local string comparison 44 / 132
71. 71. Semi-local string comparisonWeighted alignmentAlignment graph for blown-up strings \$B \$A \$A \$B \$C \$A \$B \$C \$A \$B \$A \$C \$A blue = 0\$B red = 0.5 or 1\$A\$A\$B\$C\$B\$C\$ALevenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5 Alexander Tiskin (Warwick) Semi-local string comparison 45 / 132
72. 72. Semi-local string comparisonWeighted alignmentRational-weighted semi-local alignment reduced to semi-local LCS \$B \$A \$A \$B \$C \$A \$B \$C \$A \$B \$A \$C \$A\$B\$A\$A\$B\$C\$B\$C\$ALet wM = 1, wX = µ , wG = 0 νIncrease × ν 2 in complexity (can be reduced to ν) Alexander Tiskin (Warwick) Semi-local string comparison 46 / 132
73. 73. Alexander Tiskin (Warwick) Semi-local string comparison 47 / 132
74. 74. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 48 / 132
75. 75. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 49 / 132
76. 76. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
77. 77. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
78. 78. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
79. 79. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
80. 80. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
81. 81. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
82. 82. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
83. 83. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
84. 84. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
85. 85. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
86. 86. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
87. 87. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
88. 88. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
89. 89. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
90. 90. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
91. 91. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
92. 92. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
93. 93. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
94. 94. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
95. 95. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
96. 96. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
97. 97. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
98. 98. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
99. 99. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
100. 100. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
101. 101. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
102. 102. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
103. 103. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
104. 104. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
105. 105. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
106. 106. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
107. 107. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
108. 108. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
109. 109. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
110. 110. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
111. 111. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
112. 112. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
113. 113. The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 51 / 132
114. 114. The seaweed methodSeaweed combingSemi-local LCS: seaweed combing [T: 2006]Initialise seaweed braid: crossings in all mismatch cellsSweep cells in any -compatible orderMatch cell: two seaweeds uncrossed; skipMismatch cell: two seaweeds cross if the same seaweeds crossed before, uncross them otherwise skip, keep seaweeds crossedCell update: time O(1)Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 52 / 132
115. 115. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 53 / 132
116. 116. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
117. 117. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
118. 118. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
119. 119. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
120. 120. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
121. 121. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
122. 122. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
123. 123. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
124. 124. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
125. 125. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
126. 126. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
127. 127. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
128. 128. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
129. 129. The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 55 / 132
130. 130. The seaweed methodMicro-block seaweed combingSemi-local LCS: micro-block seaweed combing [T: 2007]Initialise seaweed braid: crossings in all mismatch cellsSweep cells in micro-blocks, in any -compatible order log nMicro-block size: t = O log log nMicro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) integers, each O(log n) bits, can be reduced to O(log t) bitsMicro-block update: time O(1), by precomputing all possible interfaces mn(log log n)2Overall time O log2 n Alexander Tiskin (Warwick) Semi-local string comparison 56 / 132
131. 131. The seaweed methodCyclic LCSThe cyclic LCS problemGive the maximum LCS score for a vs all cyclic rotations of b Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
132. 132. The seaweed methodCyclic LCSThe cyclic LCS problemGive the maximum LCS score for a vs all cyclic rotations of bCyclic LCS: running time mn 2O log n naiveO(mn log m) [Maes: 1990]O(mn) [Bunke, B¨hler: 1993; Landau+: 1998; Schmidt: 1998] u 2O mn(log 2log n) log n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
133. 133. The seaweed methodCyclic LCSCyclic LCS: the algorithm mn(log log n)2Micro-block seaweed combing on a vs bb, time O log2 nMake n string-substring LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 58 / 132
134. 134. The seaweed methodLongest repeating subsequenceThe longest repeating subsequence problemFind the longest subsequence of a that is a square (a repetition of twoidentical strings) Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
135. 135. The seaweed methodLongest repeating subsequenceThe longest repeating subsequence problemFind the longest subsequence of a that is a square (a repetition of twoidentical strings)Longest repeating subsequence: running timeO(m3 ) naiveO(m2 ) [Kosowski: 2004] 2 log 2O m (log 2 m m) log [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
136. 136. The seaweed methodLongest repeating subsequenceLongest repeating subsequence: the algorithm m2 (log log m)2Micro-block seaweed combing on a vs a, time O log2 mMake m − 1 suﬃx-preﬁx LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 60 / 132
137. 137. The seaweed methodApproximate matchingThe approximate pattern matching problemGive the substring closest to a by alignment score, starting at eachposition in bAssume rational alignment scoreApproximate pattern matching: running timeO(mn) [Sellers: 1980] mnO log n σ = O(1) via [Masek, Paterson: 1980] mn(log log n)2O log2 n via [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 61 / 132
138. 138. The seaweed methodApproximate matchingApproximate pattern matching: the algorithmMicro-block seaweed combing on a vs b (with blow-up), time 2O mn(log 2log n) log nThe implicit semi-local edit score matrix: an anti-Monge matrix approximate pattern matching ∼ row minimaRow minima in O(n) element queries [Aggarwal+: 1987]Each query in time O(log2 n) using the range tree representation,combined query time negligible mn(log log n)2Overall running time O log2 n , same as [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 62 / 132
139. 139. Alexander Tiskin (Warwick) Semi-local string comparison 63 / 132
140. 140. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 64 / 132
141. 141. Periodic string comparisonWraparound seaweed combingThe periodic string-substring LCS problemGive (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞Let u be of length pMay assume that every character of a occurs in uOnly substrings of b of length at most mp (otherwise LCS score is m) Alexander Tiskin (Warwick) Semi-local string comparison 65 / 132
142. 142. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 66 / 132
143. 143. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
144. 144. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
145. 145. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
146. 146. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
147. 147. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
148. 148. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
149. 149. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
150. 150. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
151. 151. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
152. 152. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
153. 153. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
154. 154. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
155. 155. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
156. 156. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
157. 157. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
158. 158. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
159. 159. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
160. 160. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
161. 161. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
162. 162. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
163. 163. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
164. 164. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
165. 165. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
166. 166. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
167. 167. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
168. 168. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
169. 169. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
170. 170. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
171. 171. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
172. 172. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
173. 173. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
174. 174. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
175. 175. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
176. 176. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
177. 177. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
178. 178. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
179. 179. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
180. 180. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
181. 181. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
182. 182. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
183. 183. Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 68 / 132
184. 184. Periodic string comparisonWraparound seaweed combingPeriodic string-substring LCS: Wraparound seaweed combingInitialise seaweed braid: crossings in all mismatch cellsSweep cells row-by-row: each row starts at match cell, wraps at boundaryMatch cell: two seaweeds uncrossed; skipMismatch cell: two seaweeds cross if the same seaweeds crossed before (with wrapping), uncross them otherwise skip, keep seaweeds crossedCell update: time O(1)Overall time O(mn)String-substring LCS score: count seaweeds with multiplicities Alexander Tiskin (Warwick) Semi-local string comparison 69 / 132
185. 185. Periodic string comparisonWraparound seaweed combingThe tandem LCS problemGive LCS score for a vs b = u kWe have n = kp; may assume k ≤ mTandem LCS: running timeO(mkp) naiveO m(k + p) [Landau, Ziv-Ukelson: 2001]O(mp) [T: 2009]Direct application of wraparound seaweed combing Alexander Tiskin (Warwick) Semi-local string comparison 70 / 132
186. 186. Periodic string comparisonWraparound seaweed combingThe tandem alignment problemGive the substring closest to a by alignment score among certainsubstrings of b = u ±∞ : global: substrings u k of length kp across all k cyclic: substrings of length kp across all k local: substrings of any lengthTandem alignment: running timeO(m2 p) all naiveO(mp) global [Myers, Miller: 1989]O(mp log p) cyclic [Benson: 2005]O(mp) cyclic [T: 2009]O(mp) local [Myers, Miller: 1989] Alexander Tiskin (Warwick) Semi-local string comparison 71 / 132
187. 187. Periodic string comparisonWraparound seaweed combingCyclic tandem alignment: the algorithmPeriodic seaweed combing for a vs b (with blow-up), time O(mp)For each k ∈ [1 : m]: solve tandem LCS (under given alignment score) for a vs u k obtain scores for a vs p successive substrings of b of length kp by LCS batch query: time O(1) per substringRunning time O(mp) Alexander Tiskin (Warwick) Semi-local string comparison 72 / 132
188. 188. Alexander Tiskin (Warwick) Semi-local string comparison 73 / 132
189. 189. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 74 / 132
190. 190. The transposition network methodTransposition networksComparison network: a circuit of comparatorsA comparator sorts two inputs and outputs them in prescribed orderComparison networks traditionally used for non-branching merging/sortingClassical comparison networks # comparatorsmerging O(n log n) [Batcher: 1968]sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
191. 191. The transposition network methodTransposition networksComparison network: a circuit of comparatorsA comparator sorts two inputs and outputs them in prescribed orderComparison networks traditionally used for non-branching merging/sortingClassical comparison networks # comparatorsmerging O(n log n) [Batcher: 1968]sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983]Comparison networks are visualised by wire diagramsTransposition network: all comparisons are between adjacent wires Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
192. 192. The transposition network methodTransposition networksSeaweed combing as a transposition network −7 −5 −3 −1 A B C A +1 A +3 +5 C +7 B −7 C −1 +3 −3 −5 +7 +5 +1Character mismatches correspond to comparatorsInputs anti-sorted (sorted in reverse); each value traces a seaweed Alexander Tiskin (Warwick) Semi-local string comparison 76 / 132
193. 193. The transposition network methodTransposition networksGlobal LCS: transposition network with binary input 0 0 0 0 A B C A 1 A 1 1 C 1 B 0 0 C 1 0 0 1 1 1Inputs still anti-sorted, but may not be distinctComparison between equal values is indeterminate Alexander Tiskin (Warwick) Semi-local string comparison 77 / 132
194. 194. The transposition network methodParameterised string comparisonParameterised string comparisonString comparison sensitive e.g. to low similarity: small λ = LCS(a, b) high similarity: small κ = dist LCS (a, b) = m + n − 2λCan also use weighted alignment score or edit distanceAssume m = n, therefore κ = 2(n − λ) Alexander Tiskin (Warwick) Semi-local string comparison 78 / 132
195. 195. The transposition network methodParameterised string comparisonLow-similarity comparison: small λ sparse set of matches, may need to look at them all preprocess matches for fast searching, time O(n log σ)High-similarity comparison: small κ set of matches may be dense, but only need to look at small subset no need to preprocess, linear search is OKFlexible comparison: sensitive to both high and low similarity, e.g. by bothcomparison types running alongside each other Alexander Tiskin (Warwick) Semi-local string comparison 79 / 132
196. 196. The transposition network methodParameterised string comparisonParameterised string comparison: running timeLow-similarity, after preprocessing in O(n log σ)O(nλ) [Hirschberg: 1977] [Apostolico, Guerra: 1985] [Apostolico+: 1992]High-similarity, no preprocessingO(n · κ) [Ukkonen: 1985] [Myers: 1986]FlexibleO(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990]O(λ · κ) after preproc [Rick: 1995] Alexander Tiskin (Warwick) Semi-local string comparison 80 / 132
197. 197. The transposition network methodParameterised string comparisonParameterised string comparison: the waterfall algorithmLow-similarity: O(n · λ) High-similarity: O(n · κ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0Trace 0s through network in contiguous blocks and gaps Alexander Tiskin (Warwick) Semi-local string comparison 81 / 132
198. 198. The transposition network methodDynamic string comparisonThe dynamic LCS problemMaintain current LCS score under updates to one or both input stringsBoth input strings are streams, updated on-line: appending characters at left or right deleting characters at left or rightAssume for simplicity m ≈ n, i.e. m = Θ(n)Goal: linear time per update O(n) per update of a (n = |b|) O(m) per update of b (m = |a|) Alexander Tiskin (Warwick) Semi-local string comparison 82 / 132
199. 199. The transposition network methodDynamic string comparisonDynamic LCS in linear time: update modelsleft right– app+del standard DP [Wagner, Fischer: 1974]app app a ﬁxed [Landau+: 1998], [Kim, Park: 2004]app app [Ishida+: 2005]app+del app+del [T: NEW]Main idea: for append only, maintain seaweed matrix Pa,b for append+delete, maintain partial seaweed layout by tracing a transposition network Alexander Tiskin (Warwick) Semi-local string comparison 83 / 132
200. 200. The transposition network methodBit-parallel string comparisonBit-parallel string comparisonString comparison using standard instructions on words of size wBit-parallel string comparison: running timeO(mn/w ) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001] Alexander Tiskin (Warwick) Semi-local string comparison 84 / 132
201. 201. The transposition network methodBit-parallel string comparisonBit-parallel string comparison: binary transposition networkIn every cell: input bits s, c; output bits s , c ; match/mismatch ﬂag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
202. 202. The transposition network methodBit-parallel string comparisonBit-parallel string comparison: binary transposition networkIn every cell: input bits s, c; output bits s , c ; match/mismatch ﬂag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
203. 203. Alexander Tiskin (Warwick) Semi-local string comparison 86 / 132
204. 204. 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 87 / 132
205. 205. Sparse string comparisonSemi-local LCS between permutationsThe LCS problem on permutation stringsGive LCS score for a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to longest increasing subsequence (LIS) in a string maximum clique in a permutation graph maximum planar matching in an embedded bipartite graphLCS on permutation strings: running timeO(n log n) implicit in [Erd¨s, Szekeres: o 1935] [Robinson: 1938; Knuth: 1970; Dijkstra: 1980]O(n log log n) unit-RAM [Chang, Wang: 1992] [Bespamyatnikh, Segal: 2000] Alexander Tiskin (Warwick) Semi-local string comparison 88 / 132
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.