• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
20121020 semi local-string_comparison_tiskin
 

20121020 semi local-string_comparison_tiskin

on

  • 474 views

 

Statistics

Views

Total Views
474
Views on SlideShare
385
Embed Views
89

Actions

Likes
0
Downloads
0
Comments
0

3 Embeds 89

http://logic.pdmi.ras.ru 66
http://compsciclub.ru 19
http://www.compsciclub.ru 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    20121020 semi local-string_comparison_tiskin 20121020 semi local-string_comparison_tiskin Presentation Transcript

    • Semi-local string comparison: Algorithmic techniques and applications Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskinAlexander Tiskin (Warwick) Semi-local string comparison 1 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 2 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 3 / 132
    • IntroductionString matching: finding an exact pattern in a stringString comparison: finding similar patterns in two stringsApplications: computational biology, image recognition, . . . Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
    • IntroductionString matching: finding an exact pattern in a stringString comparison: finding similar patterns in two stringsApplications: computational biology, image recognition, . . .Standard types of string comparison: global: whole string vs whole string local: substrings vs substringsMain focus of this work: semi-local: whole string vs substrings; prefixes vs suffixesClosely related to approximate string matching (no relation toapproximation algorithms!)Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices) Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
    • IntroductionTerminology and notationx− = x − 1 2 x+ = x + 1 2Integers: {. . . − 2, −1, 0, 1, 2, . . .}Half-integers: . . . − 3 , − 1 , 1 , 2 , 2 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+ 2 2 2 3 5(i, j) (i , j ) iff i < i and j < j (i, j) (i , j ) iff i > i and j < jA permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column  0 1 01 0 0 0 0 1 Alexander Tiskin (Warwick) Semi-local string comparison 5 / 132
    • IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums: Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
    • IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums:Given matrix E , its density matrix is made up of quadrangle differences:E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆwhere D Σ , E over integers; D, E over half-integers Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
    • IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums:Given matrix E , its density matrix is made up of quadrangle differences:E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆwhere D Σ , E over integers; D, E over half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
    • IntroductionTerminology and notationGiven matrix D, its distribution matrix is made up of -dominance sums:Given matrix E , its density matrix is made up of quadrangle differences:E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆwhere D Σ , E over integers; D, E over half-integers     Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 01 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0(D Σ ) = D for all DMatrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
    • IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: boundary-to-boundary distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: boundary-to-boundary distances in a grid-like graph Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
    • IntroductionTerminology and notationMatrix E is Monge, if E is nonnegativeIntuition: boundary-to-boundary distances in a (weighted) planar graphMatrix E is unit-Monge, if E is a permutation matrixIntuition: boundary-to-boundary distances in a grid-like graphSimple unit-Monge matrix: P Σ , where P is a permutation matrixSeaweed matrix: P used as an implicit representation of P Σ   Σ 0 1 2 3 0 1 0 01 0 0 =  1 1 2  0 0 0 1 0 0 1 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
    • IntroductionImplicit unit-Monge matricesEfficient P Σ queries: range tree on nonzeros of P [Bentley: 1980] binary search tree by i-coordinate under every node, binary search tree by j-coordinate • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • Alexander Tiskin (Warwick) Semi-local string comparison 8 / 132
    • IntroductionImplicit unit-Monge matricesEfficient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query is equivalent to -dominance counting: how many nonzerosare -dominated by query point?Answer: sum up nonzero counts in ≤ log2 n disjoint canonical rangesTotal size O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
    • IntroductionImplicit unit-Monge matricesEfficient P Σ queries: (contd.)Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero countOverall, ≤ n log n canonical ranges are non-emptyA P Σ query is equivalent to -dominance counting: how many nonzerosare -dominated by query point?Answer: sum up nonzero counts in ≤ log2 n disjoint canonical rangesTotal size O(n log n), query time O(log2 n)There are asymptotically more efficient (but less practical) data structures log nTotal size O(n), query time O log log n [J´J´+: 2004] a a [Chan, Pˇtra¸cu: 2010] a s Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 10 / 132
    • Matrix distance multiplicationSeaweed braidsDistance algebra (a.k.a (min, +) or tropical algebra): addition ⊕ given by min multiplication given by +Matrix -multiplicationA B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k)Matrix classes closed under -multiplication (for given n): general numerical (integer, real) matrices Monge matrices simple unit-Monge matrices (!) Alexander Tiskin (Warwick) Semi-local string comparison 11 / 132
    • Matrix distance multiplicationSeaweed braidsRecall that simple unit-Monge matrices are represented implicitly bypermutation (seaweed) matricesDefine PA Σ PB = PC as PA Σ Σ PB = PCThe seaweed monoid Tn : simple unit-Monge matrices under equivalently, permutation (seaweed) matrices underAlso known as the 0-Hecke monoid of the symmetric group H0 (Sn ) Alexander Tiskin (Warwick) Semi-local string comparison 12 / 132
    • Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
    • Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
    • Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPAPB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
    • Matrix distance multiplicationSeaweed braidsPA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PCPA PCPB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
    • Matrix distance multiplicationSeaweed braidsThe seaweed monoid Tn : n! elements (permutations of size n) n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)Idempotence:gi2 = gi for all i =Far commutativity:gi gj = gj gi j − i > 1 ··· = ···Braid relations:gi gj gi = gj gi gj j − i = 1 = Alexander Tiskin (Warwick) Semi-local string comparison 14 / 132
    • Matrix distance multiplicationSeaweed braidsIdentity: 1 x =x   • · · · · • · · · · • · =1=  · · · •Zero: 0 x =0   · · · • · · • ·0= · • · · =  • · · · Alexander Tiskin (Warwick) Semi-local string comparison 15 / 132
    • Matrix distance multiplicationSeaweed braidsRelated structures: positive braids: far comm; braid relations braids: gi gi−1 = 1; far comm; braid relations Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations locally free idempotent monoid: idem; far comm [Vershik+: 2000]Generalisations: general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008] Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990] J -trivial monoids [Denton+: 2011] Alexander Tiskin (Warwick) Semi-local string comparison 16 / 132
    • Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP) Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
    • Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
    • Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
    • Matrix distance multiplicationSeaweed braidsComputation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0aa → a bb → b bab → 0 aba → 0T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0aa → a ca → ac bab → aba cbac → bcbabb → b cc → c cbc → bcb abacba → 0Easy to use, but not an efficient algorithm Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
    • Matrix distance multiplicationSeaweed matrix multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC ) Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
    • Matrix distance multiplicationSeaweed matrix multiplicationThe implicit unit-Monge matrix -multiplication problemGiven permutation matrices PA , PB , compute PC , such that Σ Σ ΣPA PB = PC (equivalently, PA PB = PC )Matrix -multiplication: running timetype timegeneral O(n3 ) standard 3 3 O n (log log n) log2 n [Chan: 2007]Monge O(n2 ) via [Aggarwal+: 1987]implicit unit-Monge O(n1.5 ) [T: 2006] O(n log n) [T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• • •• ? • • • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 19 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• •• • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• •• •• • • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • • • • • • •• • •• •• • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
    • Matrix distance multiplicationSeaweed matrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 22 / 132
    • Matrix distance multiplicationSeaweed matrix multiplicationImplicit unit-Monge matrix -multiplication: the algorithm Σ Σ ΣPC (i, k) = minj PA (i, j) + PB (j, k)Divide-and-conquer on the range of jDivide PA horizontally, PB vertically: two subproblems of effective size n/2 ΣPA,lo Σ Σ PB,lo = PC ,lo Σ PA,hi Σ Σ PB,hi = PC ,hiConquer: -low nonzeros of PC ,lo and -high nonzeros of PC ,hi appear in PCThe remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to becorrected to obtain the remaining nonzeros of PCCorrection can be done in time O(n) using the unit-Monge propertyOverall time O(n log n) Alexander Tiskin (Warwick) Semi-local string comparison 23 / 132
    • Matrix distance multiplicationBruhat orderComparing permutations by the “degree of sortedness”Bruhat orderPermutation A is lower (“more sorted”) than permutation B in the Bruhatorder (A B), if B can be transformed to A by successive pairwise sortingbetween arbitrary pairs of elements.Permutation matrices: PA PB , if PB can be transformed to PA bysuccessive submatrix substitution: ( 0 1 ) 10 (1 0) 01 Alexander Tiskin (Warwick) Semi-local string comparison 24 / 132
    • Matrix distance multiplicationBruhat orderBruhat comparability: running timeO(n2 ) folkloreO(n log n) [T: NEW]PA PB iff PA ≤ PB elementwise, time O(n2 ) Σ Σ folklore R RPA PB iff PA PB = Id , time O(n log n) [T: NEW]where P R denotes clockwise rotation of matrix P Alexander Tiskin (Warwick) Semi-local string comparison 25 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 26 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: prefix, suffixNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably close Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceConsider strings (= sequences) over an alphabet of size σDistinguish contiguous substrings and not necessarily contiguoussubsequencesSpecial cases of substring: prefix, suffixNotation: strings a, b of length m, n respectivelyAssume where necessary: m ≤ n; m, n reasonably closeThe longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0In biological terms, “loss-free alignment” (unlike “lossy” BLAST) Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs b Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceThe LCS problemGive the LCS score for a vs bLCS: running timeO(mn) [Wagner, Fischer: 1974] mnO log2 n σ = O(1) [Masek, Paterson: 1980] [Crochemore+: 2003] mn(log log n)2O log2 n [Paterson, Danˇ´cık: 1994] [Bille, Farach-Colton: 2008]Running time varies depending on the RAM model versionWe assume word-RAM with word size log n (where it matters) Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceLCS on the alignment graph (directed, acyclic) B A A B C A B C A B A C A blue = 0B red = 1AABCBCAscore(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8LCS = highest-score path from top-left to bottom-right Alexander Tiskin (Warwick) Semi-local string comparison 29 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceLCS: dynamic programming [WF: 1974]Sweep cells in any -compatible orderCell update: time O(1)Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 30 / 132
    • Semi-local string comparisonSemi-local LCS and edit distance ‘Begin at the beginning,’ the King said gravely, ‘and go on till you come to the end: then stop.’ L. Carroll, Alice in Wonderland (The standard approach in dynamic programming) Alexander Tiskin (Warwick) Semi-local string comparison 31 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceSometimes dynamic programming can be run from both ends for extraflexibility Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceSometimes dynamic programming can be run from both ends for extraflexibilityIs there a better, fully flexible alternative (e.g. for comparing compressedstrings, comparing strings dynamically or in parallel, etc.)? Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceLCS: micro-block dynamic programming [MP: 1980; BF: 2008]Sweep cells in micro-blocks, in any -compatible orderMicro-block size: t = O(log n) when σ = O(1) log n t=O log log n otherwiseMicro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bitsMicro-block update: time O(1), by precomputing all possible interfaces mn mn(log log n)2Overall time O log2 n when σ = O(1), O log2 n otherwise Alexander Tiskin (Warwick) Semi-local string comparison 33 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string b Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceThe semi-local LCS problemGive the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string bCf.: dynamic programming gives prefix-prefix LCS Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
    • Semi-local string comparisonSemi-local LCS and edit distanceSemi-local LCS on the alignment graph B A A B C A B C A B A C A blue = 0B red = 1AABCBCAscore(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5String-substring LCS: all highest-score top-to-bottom pathsSemi-local LCS: all highest-score boundary-to-boundary paths Alexander Tiskin (Warwick) Semi-local string comparison 35 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 b = “BAABCABCABACA” -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 36 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesSemi-local LCS: output representation and running timesize query timeO(n2 ) O(1) trivialO(m1/2 n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006] . . . or any 2D orthogonal range counting data structurerunning timeO(mn2 ) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006] mnO log0.5 n [T: 2006] mn(log log n)2O log2 n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 37 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix PH(i, j): the number of matched characters for a vs substring b i : jj − i − H(i, j): the number of unmatched charactersProperties of matrix j − i − H(i, j): simple unit-Monge therefore, = P Σ , where P = −H is a permutation matrixP is the seaweed matrix, giving an implicit representation of HRange tree for P: memory O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 38 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: difference in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: difference in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: difference in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: difference in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 green: P(i, j) = 1-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 H(i, j) = j − i − P Σ (i, j)-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe score matrix H and the seaweed matrix P a = “BAABCBCA” • b = “BAABCABCABACA” • H(4, 11) = • 11 − 4 − P Σ (i, j) = 11 − 4 − 2 = 5 • • Alexander Tiskin (Warwick) Semi-local string comparison 40 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA”BA b = “BAABCABCABACA”A H(4, 11) =B 11 − 4 − P Σ (i, j) =C 11 − 4 − 2 = 5BCAP(i, j) = 1 corresponds to seaweed top i bottom j Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesThe seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA”BA b = “BAABCABCABACA”A H(4, 11) =B 11 − 4 − P Σ (i, j) =C 11 − 4 − 2 = 5BCAP(i, j) = 1 corresponds to seaweed top i bottom jAlso define top right, left right, left bottom seaweedsGives bijection between top-left and bottom-right graph boundaries Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
    • Semi-local string comparisonScore matrices and seaweed matricesSeaweed braid: a highly symmetric object (element of the 0-Hecke monoidof the symmetric group)Can be built recursively by assembling subbraids from separate partsHighly flexible: local alignment, compression, parallel computation. . . Alexander Tiskin (Warwick) Semi-local string comparison 42 / 132
    • Semi-local string comparisonWeighted alignmentThe LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
    • Semi-local string comparisonWeighted alignmentThe LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0Alignment score is rational, if wM , wX , wG are rational numbersEquivalent to LCS score on blown-up strings Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
    • Semi-local string comparisonWeighted alignmentThe LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0Alignment score is rational, if wM , wX , wG are rational numbersEquivalent to LCS score on blown-up stringsEdit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
    • Semi-local string comparisonWeighted alignmentWeighted alignment graph B A A B C A B C A B A C A blue = 0B red (solid) = 2A red (dotted) = 1ABCBCALevenshtein score(“BAABCBCA”, “CABCABA”) = 11 Alexander Tiskin (Warwick) Semi-local string comparison 44 / 132
    • Semi-local string comparisonWeighted alignmentAlignment graph for blown-up strings $B $A $A $B $C $A $B $C $A $B $A $C $A blue = 0$B red = 0.5 or 1$A$A$B$C$B$C$ALevenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5 Alexander Tiskin (Warwick) Semi-local string comparison 45 / 132
    • Semi-local string comparisonWeighted alignmentRational-weighted semi-local alignment reduced to semi-local LCS $B $A $A $B $C $A $B $C $A $B $A $C $A$B$A$A$B$C$B$C$ALet wM = 1, wX = µ , wG = 0 νIncrease × ν 2 in complexity (can be reduced to ν) Alexander Tiskin (Warwick) Semi-local string comparison 46 / 132
    • Alexander Tiskin (Warwick) Semi-local string comparison 47 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 48 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 49 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
    • The seaweed methodSeaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 51 / 132
    • The seaweed methodSeaweed combingSemi-local LCS: seaweed combing [T: 2006]Initialise seaweed braid: crossings in all mismatch cellsSweep cells in any -compatible orderMatch cell: two seaweeds uncrossed; skipMismatch cell: two seaweeds cross if the same seaweeds crossed before, uncross them otherwise skip, keep seaweeds crossedCell update: time O(1)Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 52 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 53 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
    • The seaweed methodMicro-block seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 55 / 132
    • The seaweed methodMicro-block seaweed combingSemi-local LCS: micro-block seaweed combing [T: 2007]Initialise seaweed braid: crossings in all mismatch cellsSweep cells in micro-blocks, in any -compatible order log nMicro-block size: t = O log log nMicro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) integers, each O(log n) bits, can be reduced to O(log t) bitsMicro-block update: time O(1), by precomputing all possible interfaces mn(log log n)2Overall time O log2 n Alexander Tiskin (Warwick) Semi-local string comparison 56 / 132
    • The seaweed methodCyclic LCSThe cyclic LCS problemGive the maximum LCS score for a vs all cyclic rotations of b Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
    • The seaweed methodCyclic LCSThe cyclic LCS problemGive the maximum LCS score for a vs all cyclic rotations of bCyclic LCS: running time mn 2O log n naiveO(mn log m) [Maes: 1990]O(mn) [Bunke, B¨hler: 1993; Landau+: 1998; Schmidt: 1998] u 2O mn(log 2log n) log n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
    • The seaweed methodCyclic LCSCyclic LCS: the algorithm mn(log log n)2Micro-block seaweed combing on a vs bb, time O log2 nMake n string-substring LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 58 / 132
    • The seaweed methodLongest repeating subsequenceThe longest repeating subsequence problemFind the longest subsequence of a that is a square (a repetition of twoidentical strings) Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
    • The seaweed methodLongest repeating subsequenceThe longest repeating subsequence problemFind the longest subsequence of a that is a square (a repetition of twoidentical strings)Longest repeating subsequence: running timeO(m3 ) naiveO(m2 ) [Kosowski: 2004] 2 log 2O m (log 2 m m) log [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
    • The seaweed methodLongest repeating subsequenceLongest repeating subsequence: the algorithm m2 (log log m)2Micro-block seaweed combing on a vs a, time O log2 mMake m − 1 suffix-prefix LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 60 / 132
    • The seaweed methodApproximate matchingThe approximate pattern matching problemGive the substring closest to a by alignment score, starting at eachposition in bAssume rational alignment scoreApproximate pattern matching: running timeO(mn) [Sellers: 1980] mnO log n σ = O(1) via [Masek, Paterson: 1980] mn(log log n)2O log2 n via [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 61 / 132
    • The seaweed methodApproximate matchingApproximate pattern matching: the algorithmMicro-block seaweed combing on a vs b (with blow-up), time 2O mn(log 2log n) log nThe implicit semi-local edit score matrix: an anti-Monge matrix approximate pattern matching ∼ row minimaRow minima in O(n) element queries [Aggarwal+: 1987]Each query in time O(log2 n) using the range tree representation,combined query time negligible mn(log log n)2Overall running time O log2 n , same as [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 62 / 132
    • Alexander Tiskin (Warwick) Semi-local string comparison 63 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 64 / 132
    • Periodic string comparisonWraparound seaweed combingThe periodic string-substring LCS problemGive (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞Let u be of length pMay assume that every character of a occurs in uOnly substrings of b of length at most mp (otherwise LCS score is m) Alexander Tiskin (Warwick) Semi-local string comparison 65 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 66 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
    • Periodic string comparisonWraparound seaweed combing B A A B C A B C A B A C ABAABCBCA Alexander Tiskin (Warwick) Semi-local string comparison 68 / 132
    • Periodic string comparisonWraparound seaweed combingPeriodic string-substring LCS: Wraparound seaweed combingInitialise seaweed braid: crossings in all mismatch cellsSweep cells row-by-row: each row starts at match cell, wraps at boundaryMatch cell: two seaweeds uncrossed; skipMismatch cell: two seaweeds cross if the same seaweeds crossed before (with wrapping), uncross them otherwise skip, keep seaweeds crossedCell update: time O(1)Overall time O(mn)String-substring LCS score: count seaweeds with multiplicities Alexander Tiskin (Warwick) Semi-local string comparison 69 / 132
    • Periodic string comparisonWraparound seaweed combingThe tandem LCS problemGive LCS score for a vs b = u kWe have n = kp; may assume k ≤ mTandem LCS: running timeO(mkp) naiveO m(k + p) [Landau, Ziv-Ukelson: 2001]O(mp) [T: 2009]Direct application of wraparound seaweed combing Alexander Tiskin (Warwick) Semi-local string comparison 70 / 132
    • Periodic string comparisonWraparound seaweed combingThe tandem alignment problemGive the substring closest to a by alignment score among certainsubstrings of b = u ±∞ : global: substrings u k of length kp across all k cyclic: substrings of length kp across all k local: substrings of any lengthTandem alignment: running timeO(m2 p) all naiveO(mp) global [Myers, Miller: 1989]O(mp log p) cyclic [Benson: 2005]O(mp) cyclic [T: 2009]O(mp) local [Myers, Miller: 1989] Alexander Tiskin (Warwick) Semi-local string comparison 71 / 132
    • Periodic string comparisonWraparound seaweed combingCyclic tandem alignment: the algorithmPeriodic seaweed combing for a vs b (with blow-up), time O(mp)For each k ∈ [1 : m]: solve tandem LCS (under given alignment score) for a vs u k obtain scores for a vs p successive substrings of b of length kp by LCS batch query: time O(1) per substringRunning time O(mp) Alexander Tiskin (Warwick) Semi-local string comparison 72 / 132
    • Alexander Tiskin (Warwick) Semi-local string comparison 73 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 74 / 132
    • The transposition network methodTransposition networksComparison network: a circuit of comparatorsA comparator sorts two inputs and outputs them in prescribed orderComparison networks traditionally used for non-branching merging/sortingClassical comparison networks # comparatorsmerging O(n log n) [Batcher: 1968]sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
    • The transposition network methodTransposition networksComparison network: a circuit of comparatorsA comparator sorts two inputs and outputs them in prescribed orderComparison networks traditionally used for non-branching merging/sortingClassical comparison networks # comparatorsmerging O(n log n) [Batcher: 1968]sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983]Comparison networks are visualised by wire diagramsTransposition network: all comparisons are between adjacent wires Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
    • The transposition network methodTransposition networksSeaweed combing as a transposition network −7 −5 −3 −1 A B C A +1 A +3 +5 C +7 B −7 C −1 +3 −3 −5 +7 +5 +1Character mismatches correspond to comparatorsInputs anti-sorted (sorted in reverse); each value traces a seaweed Alexander Tiskin (Warwick) Semi-local string comparison 76 / 132
    • The transposition network methodTransposition networksGlobal LCS: transposition network with binary input 0 0 0 0 A B C A 1 A 1 1 C 1 B 0 0 C 1 0 0 1 1 1Inputs still anti-sorted, but may not be distinctComparison between equal values is indeterminate Alexander Tiskin (Warwick) Semi-local string comparison 77 / 132
    • The transposition network methodParameterised string comparisonParameterised string comparisonString comparison sensitive e.g. to low similarity: small λ = LCS(a, b) high similarity: small κ = dist LCS (a, b) = m + n − 2λCan also use weighted alignment score or edit distanceAssume m = n, therefore κ = 2(n − λ) Alexander Tiskin (Warwick) Semi-local string comparison 78 / 132
    • The transposition network methodParameterised string comparisonLow-similarity comparison: small λ sparse set of matches, may need to look at them all preprocess matches for fast searching, time O(n log σ)High-similarity comparison: small κ set of matches may be dense, but only need to look at small subset no need to preprocess, linear search is OKFlexible comparison: sensitive to both high and low similarity, e.g. by bothcomparison types running alongside each other Alexander Tiskin (Warwick) Semi-local string comparison 79 / 132
    • The transposition network methodParameterised string comparisonParameterised string comparison: running timeLow-similarity, after preprocessing in O(n log σ)O(nλ) [Hirschberg: 1977] [Apostolico, Guerra: 1985] [Apostolico+: 1992]High-similarity, no preprocessingO(n · κ) [Ukkonen: 1985] [Myers: 1986]FlexibleO(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990]O(λ · κ) after preproc [Rick: 1995] Alexander Tiskin (Warwick) Semi-local string comparison 80 / 132
    • The transposition network methodParameterised string comparisonParameterised string comparison: the waterfall algorithmLow-similarity: O(n · λ) High-similarity: O(n · κ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0Trace 0s through network in contiguous blocks and gaps Alexander Tiskin (Warwick) Semi-local string comparison 81 / 132
    • The transposition network methodDynamic string comparisonThe dynamic LCS problemMaintain current LCS score under updates to one or both input stringsBoth input strings are streams, updated on-line: appending characters at left or right deleting characters at left or rightAssume for simplicity m ≈ n, i.e. m = Θ(n)Goal: linear time per update O(n) per update of a (n = |b|) O(m) per update of b (m = |a|) Alexander Tiskin (Warwick) Semi-local string comparison 82 / 132
    • The transposition network methodDynamic string comparisonDynamic LCS in linear time: update modelsleft right– app+del standard DP [Wagner, Fischer: 1974]app app a fixed [Landau+: 1998], [Kim, Park: 2004]app app [Ishida+: 2005]app+del app+del [T: NEW]Main idea: for append only, maintain seaweed matrix Pa,b for append+delete, maintain partial seaweed layout by tracing a transposition network Alexander Tiskin (Warwick) Semi-local string comparison 83 / 132
    • The transposition network methodBit-parallel string comparisonBit-parallel string comparisonString comparison using standard instructions on words of size wBit-parallel string comparison: running timeO(mn/w ) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001] Alexander Tiskin (Warwick) Semi-local string comparison 84 / 132
    • The transposition network methodBit-parallel string comparisonBit-parallel string comparison: binary transposition networkIn every cell: input bits s, c; output bits s , c ; match/mismatch flag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
    • The transposition network methodBit-parallel string comparisonBit-parallel string comparison: binary transposition networkIn every cell: input bits s, c; output bits s , c ; match/mismatch flag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
    • Alexander Tiskin (Warwick) Semi-local string comparison 86 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 87 / 132
    • Sparse string comparisonSemi-local LCS between permutationsThe LCS problem on permutation stringsGive LCS score for a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to longest increasing subsequence (LIS) in a string maximum clique in a permutation graph maximum planar matching in an embedded bipartite graphLCS on permutation strings: running timeO(n log n) implicit in [Erd¨s, Szekeres: o 1935] [Robinson: 1938; Knuth: 1970; Dijkstra: 1980]O(n log log n) unit-RAM [Chang, Wang: 1992] [Bespamyatnikh, Segal: 2000] Alexander Tiskin (Warwick) Semi-local string comparison 88 / 132
    • Sparse string comparisonSemi-local LCS between permutationsThe semi-local LCS problem on permutation stringsGive semi-local LCS scores of a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to longest increasing subsequence (LIS) in every substring of a stringSemi-local LCS on permutation strings: running timeO(n2 log n) naiveO(n2 ) restricted [Albert+: 2003; Chen+: 2005]O(n1.5 log n) randomised, restricted [Albert+: 2007]O(n1.5 ) [T: 2006]O(n log2 n) [T: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 89 / 132
    • Sparse string comparisonSemi-local LCS between permutations D E H C B A F GCFAEDHGB Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
    • Sparse string comparisonSemi-local LCS between permutations D E H C B A F GCFAEDHGB Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
    • Sparse string comparisonSemi-local LCS between permutations D E H C B A F GCFAEDHGB Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
    • Sparse string comparisonSemi-local LCS between permutations D E H C B A F GCFAEDHGB Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
    • Sparse string comparisonSemi-local LCS between permutations D E H C B A F GCFAEDHGB Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
    • Sparse string comparisonSemi-local LCS between permutations D E H C B A F GCFAEDHGB Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
    • Sparse string comparisonSemi-local LCS between permutationsSemi-local LCS on permutation strings: the algorithmDivide-and-conquer on the alignment graphDivide graph (say) horizontally; two subproblems of effective size n/2Conquer: seaweed matrix -multiplication, time O(n log n)Overall time O(n log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 93 / 132
    • Sparse string comparisonLongest piecewise monotone subsequencesA k-increasing sequence: a concatenation of k increasing sequencesA k-modal sequence: a concatenation of k alternating increasing anddecreasing sequencesThe longest k-increasing (k-modal) subsequence problemGive the longest k-increasing (k-modal) subsequence of string bLongest k-increasing (k-modal) subsequence: running timeO(nk log n) k-modal [Demange+: 2007]O(nk log n) via [Hunt, Szymanski: 1977]O(n log2 n) [T: NEW]Main idea: LCS for id k (respectively, (idid)k/2 ) vs b Alexander Tiskin (Warwick) Semi-local string comparison 94 / 132
    • Sparse string comparisonLongest piecewise monotone subsequencesLongest k-increasing subsequence: algorithm ASparse LCS for id k vs b: time O(nk log n) Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132
    • Sparse string comparisonLongest piecewise monotone subsequencesLongest k-increasing subsequence: algorithm ASparse LCS for id k vs b: time O(nk log n)Longest k-increasing subsequence: algorithm BCompute seaweed matrix for id vs b: time O(n log2 n)Extract three-way seaweed submatrixCompute three-way seaweed submatrix for id k vs b by log k instances ofseaweed matrix -square/multiply: time log k · O(n log n) = O(n log2 n)Query the LCS score for id k vs b: time negligibleOverall time O(n log2 n)Algorithm B faster than Algorithm A for k ≥ log n Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132
    • Sparse string comparisonMaximum clique in a circle graphThe maximum clique problem in a circle graphGiven a circle with n chords, find the maximum-size subset of pairwiseintersecting chords S Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132
    • Sparse string comparisonMaximum clique in a circle graphThe maximum clique problem in a circle graphGiven a circle with n chords, find the maximum-size subset of pairwiseintersecting chords SStandard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)Chords intersect iff intervals overlap, i.e. intersect without containment Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132
    • Sparse string comparisonMaximum clique in a circle graphMaximum clique in a circle graph: running timeexp(n) naiveO(n3 ) [Gavril: 1973]O(n2 ) [Rotem, Urrutia: 1981; Hsu: 1985] [Masuda+: 1990; Apostolico+: 1992]O(n1.5 ) [T: 2006]O(n log2 n) [T: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 97 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
    • Sparse string comparisonMaximum clique in a circle graphMaximum clique in a circle graph: the algorithmHelly property: if any set of intervals intersect pairwise, then they allintersect at a common pointCompute seaweed matrix, build range tree: time O(n log2 n)Run through all 2n + 1 possible common intersection pointsFor each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n)Overall time O(n log2 n) + O(n log2 n) = O(n log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 99 / 132
    • Sparse string comparisonMaximum clique in a circle graphParameterised maximum clique in a circle graphThe maximum clique problem in a circle graph, sensitive e.g. to the number e of edges the size l of maximum clique the thickness d of interval model (the maximum number of intervals covering a point)We have l ≤ d ≤ n, e ≤ n2Parameterised maximum clique in a circle graph: running timeO(n log n + e) [Apostolico+: 1992]O(n log n + nl log(n/l)) [Apostolico+: 1992]O(n log n + n log2 d) NEW Alexander Tiskin (Warwick) Semi-local string comparison 100 / 132
    • Sparse string comparisonMaximum clique in a circle graphParameterised maximum clique in a circle graph: the algorithmFor each diagonal block of size d, compute seaweed matrix, build rangetree: time n/d · O(d log2 d) = O(n log2 d)Extend each diagonal block to a quadrant: time O(n log2 d)Run through all 2n + 1 possible common intersection pointsFor each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time O(n log2 d)Overall time O(n log2 d) Alexander Tiskin (Warwick) Semi-local string comparison 101 / 132
    • Alexander Tiskin (Warwick) Semi-local string comparison 102 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 103 / 132
    • Compressed string comparisonGrammar compressionNotation: pattern p of length m; text t of length nA GC-string (grammar-compressed string) t is a straight-line program(context-free grammar) generating t = tn by n assignments of the form ¯ ¯ tk = α, where α is an alphabet character tk = ti tj , where i, j < kIn general, n = O(2n ) ¯Example: Fibonacci string “ABAABABAABAAB”t1 = ‘B’ t2 = ‘A’t3 = t2 t1 t4 = t3 t2 t5 = t4 t3 t6 = t5 t4 t7 = t6 t5 Alexander Tiskin (Warwick) Semi-local string comparison 104 / 132
    • Compressed string comparisonGrammar compressionGrammar-compression covers various compression types, e.g. LZ78, LZW(not LZ77 directly)Simplifying assumption: arithmetic up to n runs in O(1)This assumption can be removed by careful index remapping Alexander Tiskin (Warwick) Semi-local string comparison 105 / 132
    • Compressed string comparisonExtended substring-string LCS on GC-stringsLCS: running time (r = m + n, ¯ = m + n) r ¯ ¯p tplain plain O(mn) [Wagner, Fischer: 1974] mn O log2 m [Masek, Paterson: 1980] [Crochemore+: 2003]plain GC O(m3 n + . . .) ¯ gen. CFG [Myers: 1995] O(m ¯1.5 n) ext subs-s [T: 2008] O(m log m · n) ¯ ext subs-s [T: 2010]GC GC NP-hard [Lifshits: 2005] O(r 1.2¯1.4 ) r R weights [Hermelin+: 2009] O(r log r · ¯) r [T: 2010] O r log(r /¯) · ¯ r r [Hermelin+: 2010] O r log1/2 (r /¯) · ¯ r r [Gawrychowski: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 106 / 132
    • Compressed string comparisonExtended substring-string LCS on GC-stringsExtended substring-string LCS (plain pattern, GC text): the algorithmFor every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix -multiplication: time O(m log m · n) ¯Overall time O(m log m · n) ¯ Alexander Tiskin (Warwick) Semi-local string comparison 107 / 132
    • Compressed string comparisonSubsequence recognition on GC-stringsThe global subsequence recognition problemDoes pattern p appear in text t as a subsequence?Global subsequence recognition: running timep tplain plain O(n) greedyplain GC O(m¯) n greedyGC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Semi-local string comparison 108 / 132
    • Compressed string comparisonSubsequence recognition on GC-stringsThe local subsequence recognition problemFind all minimally matching substrings of t with respect to pSubstring of t is matching, if p is a subsequence of tMatching substring of t is minimally matching, if none of its propersubstrings are matching Alexander Tiskin (Warwick) Semi-local string comparison 109 / 132
    • Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition: running time ( + output)p tplain plain O(mn) [Mannila+: 1995] mn O log m [Das+: 1997] O(c m + n) [Boasson+: 2001] O(m + nσ) [Troniˇek: 2001] cplain GC O(m2 log m¯) n [C´gielski+: 2006] e 1.5 n) O(m ¯ [T: 2008] O(m log m · n) ¯ [T: NEW]GC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Semi-local string comparison 110 / 132
    • Compressed string comparisonSubsequence recognition on GC-strings 0 ˆ0+ ı ˆ1+ ı ˆ2+ ı n n 0+ ˆ 1+ ˆ  2+ ˆb i : j matching iff box [i : j] not pierced left-to-right -maximal seaweeds: -chain ˆ0+ , 0+ ı ˆ ˆ1+ , 1+ ı ˆ ··· ˆs − , s − ı ˆb i : j minimally matching iff (i, j) is in the interleaved -chain ˆ− , + ı1+ ˆ1− ˆ− , + ı2+ ˆ2− ··· ˆ− + , + − ı(s−1) ˆ(s−1) Alexander Tiskin (Warwick) Semi-local string comparison 111 / 132
    • Compressed string comparisonSubsequence recognition on GC-strings −m 0 × ˆ0+ ı • • × ˆ1+ ı • × ˆ2+ ı • • n × n 0+ ˆ 1+ ˆ 2+ ˆ n m+n Alexander Tiskin (Warwick) Semi-local string comparison 112 / 132
    • Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (plain pattern, GC text): the algorithmFor every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix -multiplication: time O(m log m · n) ¯ Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
    • Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (plain pattern, GC text): the algorithmFor every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix -multiplication: time O(m log m · n) ¯Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in t Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
    • Compressed string comparisonSubsequence recognition on GC-stringsLocal subsequence recognition (plain pattern, GC text): the algorithmFor every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix -multiplication: time O(m log m · n) ¯Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in tThen, find -chain of -maximal seaweeds in time n · O(m) = O(m¯) ¯ nThe interleaved -chain defines minimally matching substrings in toverlapping both t and tOverall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
    • Compressed string comparisonThreshold approximate matchingThe threshold approximate matching problemFind all matching substrings of t with respect to p, according to athreshold kSubstring of t is matching, if the edit distance for p vs t is at most k Alexander Tiskin (Warwick) Semi-local string comparison 114 / 132
    • Compressed string comparisonThreshold approximate matchingThreshold approximate matching: running time ( + output)p tplain plain O(mn) [Sellers: 1980] O(mk) [Landau, Vishkin: 1989] 4 O(m + n + nk ) m [Cole, Hariharan: 2002]plain GC O(m¯k 2 ) n [K¨rkk¨inen+: 2003] a a O(m¯k + n log n) n ¯ [LV: 1989] via [Bille+: 2010] O(m¯ + nk 4 + n log n) n ¯ ¯ [CH: 2002] via [Bille+: 2010] O(m log m · n) ¯ [T: NEW]GC GC NP-hard [Lifshits: 2005](Also many specialised variants for LZ compression) Alexander Tiskin (Warwick) Semi-local string comparison 115 / 132
    • Compressed string comparisonThreshold approximate matchingThreshold approx matching (plain pattern, GC text): the algorithmAlgorithm structure similar to local subsequence recognition by matrix -multiplication and seaweed -chainsExtra ingredients: the blow-up technique: reduction of edit distances to LCS scores implicit matrix searching, replaces -chain interleavingMonge row minima: “SMAWK” O(m) [Aggarwal+: 1987]Implicit unit-Monge row minima O(m log log m) [T: NEW]O(m) [Gawrychowski: NEW]replaces -chain interleavingOverall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Semi-local string comparison 116 / 132
    • Compressed string comparisonThreshold approximate matching 0 ˆ0+ ˆ1+ ı ı ˆ2+ ı ˆ3+ ˆ4+ ı ı n ˜ n ˜ 0+ 1+ 2+ 3+ ˆ ˆ ˆ ˆ  4+ ˆBlow up: weighted alignment on strings p, t of size m, n equivalent toLCS on strings p , ˜ of size m = νm, n = νn ˜ t ˜ ˜ Alexander Tiskin (Warwick) Semi-local string comparison 117 / 132
    • Compressed string comparisonThreshold approximate matching −m ˜ 0 × × × × × × ˆ0+ ı • × × × × × × ˆ1+ ı • × × × × × × ˆ2+ ı • × × × × × × ˆ3+ ı • × × × × × × ˆ4+ ı • n ˜ × × × × × × n ˜ 0+ 1+ 2+ 3+ ˆ ˆ ˆ ˆ 4+ ˆ n ˜ m+˜ ˜ n Alexander Tiskin (Warwick) Semi-local string comparison 118 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 119 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsGiven window length wThe window-local LCS problemGive semi-local LCS score for every w -substring of a vs whole b Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsGiven window length wThe window-local LCS problemGive semi-local LCS score for every w -substring of a vs whole bWindow-local LCS: running timeO(mn3 w ) naiveO(mnw ) multiple seaweedO(mn) [Krusche, T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsQuasi-local LCS: the algorithmCompute seaweed matrices for canonical substrings of a against bBuild seaweed matrices for windows of a against b recursivelyBoth stages use matrix -multiplication, overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 121 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsThe window-window LCS problemGive semi-local LCS score for every w -substring of a vs every w -substring bProvides an alignment plot: loss-free local comparison of genomesequences, important for studying conservation in the genome Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsThe window-window LCS problemGive semi-local LCS score for every w -substring of a vs every w -substring bProvides an alignment plot: loss-free local comparison of genomesequences, important for studying conservation in the genomeWindow-window LCS: running timeO(mnw 2 ) naiveO(mnw ) multiple seaweedO(mn) [Krusche, T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsAn implementation of alignment plots [Krusche: 2010]Based on the O(mnw ) multiple seaweed combing, using some ideas fromthe optimal O(mn) algorithmResulting theoretical complexity O(mnw 0.5 )Alignment weights (match 1, mismatch 0, gap −0.5): ×4 slowdownC++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism)SMP parallelism (currently two processors) Alexander Tiskin (Warwick) Semi-local string comparison 123 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsKrusche’s implementation used at Warwick Systems Biology Centre tostudy weak conservation in plant genomes (BLAST too inaccurate)Speedup on one processor: ×10 over heavily optimised, bit-parallel naive algorithm ×7 over biologists’ heuristicsSpeedup on two processors: extra ×2, near-perfect parallelism Alexander Tiskin (Warwick) Semi-local string comparison 124 / 132
    • Beyond semi-localityWindow-local LCS and alignment plotsPublications:[Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) inPlants. Plant Journal.[Baxter+: accepted] Conserved Noncoding Sequences Highlight SharedComponents of Regulatory Networks in Dicotyledonous Plants. Plant Cell.Web service: http://wsbc.warwick.ac.uk/earsFuture work: genomic repeats; genome-scale comparison Alexander Tiskin (Warwick) Semi-local string comparison 125 / 132
    • Beyond semi-localityQuasi-local LCS and sparse spliced alignmentGiven m (possibly overlapping) prescribed substrings in aThe quasi-local LCS problemGive semi-local LCS score for every prescribed substring of a vs whole b Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132
    • Beyond semi-localityQuasi-local LCS and sparse spliced alignmentGiven m (possibly overlapping) prescribed substrings in aThe quasi-local LCS problemGive semi-local LCS score for every prescribed substring of a vs whole bQuasi-local LCS: running timeO(m2 n) multiple seaweedO(mn log2 m) [T: unpublished] Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132
    • Beyond semi-localityQuasi-local LCS and sparse spliced alignmentWindow-local LCS: the algorithmCompute seaweed matrices for canonical substrings of a against bBuild seaweed matrices for prescribed substrings of a against b recursivelyBoth stages use matrix -multiplication, time O(mn log2 m) Alexander Tiskin (Warwick) Semi-local string comparison 127 / 132
    • Beyond semi-localityQuasi-local LCS and sparse spliced alignmentThe sparse spliced alignment problemGive the chain of non-overlapping prescribed substrings in a, closest to bby alignment scoreDescribes gene assembly: unknown gene from candidate exons, given aknown similar gene Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132
    • Beyond semi-localityQuasi-local LCS and sparse spliced alignmentThe sparse spliced alignment problemGive the chain of non-overlapping prescribed substrings in a, closest to bby alignment scoreDescribes gene assembly: unknown gene from candidate exons, given aknown similar geneAssume m = n; let s = total size of prescribed substringsSparse spliced alignment: running timeO(ns) = O(n3 ) [Gelfand+: 1996]O(n2.5 ) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009] Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132
    • 1 Introduction 7 Sparse string comparison2 Matrix distance multiplication 8 Compressed string comparison3 Semi-local string comparison 9 Beyond semi-locality4 The seaweed method 10 Conclusions and future work5 Periodic string comparison6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 129 / 132
    • Conclusions and future workImplicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound?Semi-local LCS problem: representation by implicit unit-Monge matrices generalisation to rational alignment scores next: real alignment scores?Seaweed combing and micro-block speedup: a simple algorithm for semi-local LCS semi-local LCS in time o(mn) improvements on related problems Alexander Tiskin (Warwick) Semi-local string comparison 130 / 132
    • Conclusions and future workTransposition networks: simple interpretation of high-similarity and dissimilarity LCS dynamic LCS simple interpretation of bit-parallel LCSSparse string comparison: fast LCS on permutation strings fast max-clique in a circle graph next: lower bound? or even faster max-cliqueSparse string comparison: semi-local LCS on permutations in time O(n log2 n) maximum clique in a circle graph in time O(n log2 n) improvements on related problems Alexander Tiskin (Warwick) Semi-local string comparison 131 / 132
    • Conclusions and future workCompressed string comparison: three-way semi-local LCS on GC text against plain pattern subsequence recognition in GC-strings next: approximate matching in GC-stringsBeyond semi-locality: quasi-local string comparison sparse spliced alignment next: full locality Alexander Tiskin (Warwick) Semi-local string comparison 132 / 132