Semi-local string comparison:
              Algorithmic techniques and applications


                                    Alexander Tiskin

                                Department of Computer Science
                                     University of Warwick
                             http://go.warwick.ac.uk/alextiskin




Alexander Tiskin (Warwick)           Semi-local string comparison   1 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      2 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      3 / 132
Introduction

String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
Applications: computational biology, image recognition, . . .




  Alexander Tiskin (Warwick)   Semi-local string comparison     4 / 132
Introduction

String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
Applications: computational biology, image recognition, . . .
Standard types of string comparison:
     global: whole string vs whole string
     local: substrings vs substrings
Main focus of this work:
     semi-local: whole string vs substrings; prefixes vs suffixes
Closely related to approximate string matching (no relation to
approximation algorithms!)
Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices)

  Alexander Tiskin (Warwick)   Semi-local string comparison         4 / 132
Introduction
Terminology and notation


x− = x −       1
               2        x+ = x +     1
                                     2
Integers: {. . . − 2, −1, 0, 1, 2, . . .}
Half-integers:         . . . − 3 , − 1 , 1 , 2 , 2 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+
                               2     2 2
                                             3 5

(i, j)      (i , j ) iff i < i and j < j                  (i, j)         (i , j ) iff i > i and j < j
A permutation matrix is a 0/1 matrix with exactly one nonzero per row
and per column
        
 0 1 0
1 0 0
 0 0 1




   Alexander Tiskin (Warwick)            Semi-local string comparison                                 5 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:




   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:
E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − )
    ı ˆ       ı ˆ            ı ˆ            ı ˆ            ı ˆ
where D Σ , E over integers; D, E         over half-integers




   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:
E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − )
    ı ˆ       ı ˆ            ı ˆ            ı ˆ            ı ˆ
where D Σ , E over integers; D, E over half-integers
                                            
        Σ      0 1 2 3           0 1 2 3                 
 0 1 0          0 1 1 2         0 1 1 2           0 1 0
1 0 0 = 
                                  0 0 0 1 = 1 0 0
                                                        
                0 0 0 1
 0 0 1                                                0 0 1
                 0 0 0 0           0 0 0 0




   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Given matrix D, its distribution matrix is made up of          -dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:
E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − )
    ı ˆ       ı ˆ            ı ˆ            ı ˆ            ı ˆ
where D Σ , E over integers; D, E over half-integers
                                            
        Σ      0 1 2 3           0 1 2 3                 
 0 1 0          0 1 1 2         0 1 1 2           0 1 0
1 0 0 = 
                                  0 0 0 1 = 1 0 0
                                                        
                0 0 0 1
 0 0 1                                                0 0 1
                 0 0 0 0           0 0 0 0
(D Σ ) = D for all D
Matrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the left
column and bottom row



   Alexander Tiskin (Warwick)   Semi-local string comparison                 6 / 132
Introduction
Terminology and notation


Matrix E is Monge, if E         is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Matrix E is unit-Monge, if E         is a permutation matrix
Intuition: boundary-to-boundary distances in a grid-like graph




   Alexander Tiskin (Warwick)     Semi-local string comparison           7 / 132
Introduction
Terminology and notation


Matrix E is Monge, if E               is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Matrix E is unit-Monge, if E                is a permutation matrix
Intuition: boundary-to-boundary distances in a grid-like graph
Simple unit-Monge matrix: P Σ , where P is a permutation matrix
Seaweed matrix: P               used as an implicit representation of P Σ
                                        
       Σ      0                1 2 3
 0 1 0        0
1 0 0 =                       1 1 2  
              0                 0 0 1
 0 0 1
                0                0 0 0




   Alexander Tiskin (Warwick)            Semi-local string comparison       7 / 132
Introduction
Implicit unit-Monge matrices


Efficient P Σ queries: range tree on nonzeros of P                             [Bentley: 1980]
         binary search tree by i-coordinate
         under every node, binary search tree by j-coordinate
     •                            •                       •
 •               −→         •                 −→      •
         •                            •                       •
             •                            •                       •
      ↓
     •                            •                       •
 •               −→         •                 −→      •
         •                            •                       •
             •                            •                       •
      ↓
     •                            •                       •
 •               −→         •                 −→      •
         •                            •                       •
             •                            •                       •

     Alexander Tiskin (Warwick)               Semi-local string comparison               8 / 132
Introduction
Implicit unit-Monge matrices


Efficient P Σ queries: (contd.)
Every node of the range tree represents a canonical range (rectangular
region), and stores its nonzero count
Overall, ≤ n log n canonical ranges are non-empty
A P Σ query is equivalent to -dominance counting: how many nonzeros
are -dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)




   Alexander Tiskin (Warwick)   Semi-local string comparison             9 / 132
Introduction
Implicit unit-Monge matrices


Efficient P Σ queries: (contd.)
Every node of the range tree represents a canonical range (rectangular
region), and stores its nonzero count
Overall, ≤ n log n canonical ranges are non-empty
A P Σ query is equivalent to -dominance counting: how many nonzeros
are -dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)
There are asymptotically more efficient (but less practical) data structures
                                      log n
Total size O(n), query time O       log log n                          [J´J´+: 2004]
                                                                         a a
                                                               [Chan, Pˇtra¸cu: 2010]
                                                                       a s


   Alexander Tiskin (Warwick)   Semi-local string comparison                     9 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      10 / 132
Matrix distance multiplication
Seaweed braids


Distance algebra (a.k.a (min, +) or tropical algebra):
       addition ⊕ given by min
       multiplication            given by +
Matrix        -multiplication
A      B=C               C (i, k) =    j   A(i, j)       B(j, k) = minj A(i, j) + B(j, k)
Matrix classes closed under                -multiplication (for given n):
       general numerical (integer, real) matrices
       Monge matrices
       simple unit-Monge matrices (!)




    Alexander Tiskin (Warwick)         Semi-local string comparison                    11 / 132
Matrix distance multiplication
Seaweed braids


Recall that simple unit-Monge matrices are represented implicitly by
permutation (seaweed) matrices
Define PA                       Σ
                   PB = PC as PA        Σ    Σ
                                       PB = PC
The seaweed monoid Tn :
      simple unit-Monge matrices under
      equivalently, permutation (seaweed) matrices under
Also known as the 0-Hecke monoid of the symmetric group H0 (Sn )




   Alexander Tiskin (Warwick)      Semi-local string comparison        12 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC




     Alexander Tiskin (Warwick)               Semi-local string comparison   13 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC


PA



PB



     Alexander Tiskin (Warwick)               Semi-local string comparison   13 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC


PA



PB



     Alexander Tiskin (Warwick)               Semi-local string comparison   13 / 132
Matrix distance multiplication
Seaweed braids


PA           PB = PC can be seen as combing of seaweed braids

     •                            •                     •
         •               •                         •
                 •           •                                  •
 •                                    •        •
                     •                    •                 •
             •                   •                     •
         PA                      PB                    PC


PA

                                                                             PC

PB



     Alexander Tiskin (Warwick)               Semi-local string comparison        13 / 132
Matrix distance multiplication
Seaweed braids


The seaweed monoid Tn :

      n! elements (permutations of size n)
      n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings)

Idempotence:
gi2 = gi for all i                                                           =


Far commutativity:
gi gj = gj gi j − i > 1                                        ···   =       ···

Braid relations:
gi gj gi = gj gi gj j − i = 1                                            =


   Alexander Tiskin (Warwick)   Semi-local string comparison                       14 / 132
Matrix distance multiplication
Seaweed braids


Identity: 1        x =x
           
   • · · ·
  · • · ·
  · · • · =
1=         

    · · · •

Zero: 0        x =0
           
    · · · •
  · · • ·
0=
  · • · · =
            

   • · · ·


   Alexander Tiskin (Warwick)   Semi-local string comparison   15 / 132
Matrix distance multiplication
Seaweed braids


Related structures:
      positive braids: far comm; braid relations
      braids: gi gi−1 = 1; far comm; braid relations
      Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations
      locally free idempotent monoid: idem; far comm            [Vershik+: 2000]
Generalisations:
      general 0-Hecke monoids                 [Fomin, Greene: 1998; Buch+: 2008]
      Coxeter monoids           [Tsaranov: 1990; Richardson, Springer: 1990]
      J -trivial monoids                                        [Denton+: 2011]




   Alexander Tiskin (Warwick)   Semi-local string comparison                16 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)




   Alexander Tiskin (Warwick)   Semi-local string comparison      17 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0

aa → a                          bb → b                  bab → 0         aba → 0




   Alexander Tiskin (Warwick)            Semi-local string comparison             17 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0

aa → a                          bb → b                   bab → 0         aba → 0

T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,
bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa → a                          ca → ac                  bab → aba       cbac → bcba
bb → b                          cc → c                   cbc → bcb       abacba → 0




   Alexander Tiskin (Warwick)             Semi-local string comparison             17 / 132
Matrix distance multiplication
Seaweed braids


Computation in the seaweed monoid: a confluent rewriting system can be
obtained by software (Semigroupe, GAP)
T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0

aa → a                          bb → b                   bab → 0         aba → 0

T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac,
bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa → a                          ca → ac                  bab → aba       cbac → bcba
bb → b                          cc → c                   cbc → bcb       abacba → 0

Easy to use, but not an efficient algorithm


   Alexander Tiskin (Warwick)             Semi-local string comparison             17 / 132
Matrix distance multiplication
Seaweed matrix multiplication


The implicit unit-Monge matrix              -multiplication problem
Given permutation matrices PA , PB , compute PC , such that
 Σ     Σ     Σ
PA PB = PC (equivalently, PA PB = PC )




   Alexander Tiskin (Warwick)   Semi-local string comparison          18 / 132
Matrix distance multiplication
Seaweed matrix multiplication


The implicit unit-Monge matrix                      -multiplication problem
Given permutation matrices PA , PB , compute PC , such that
 Σ     Σ     Σ
PA PB = PC (equivalently, PA PB = PC )

Matrix        -multiplication: running time
type                            time
general                         O(n3 )                                               standard
                                    3          3
                                O n (log log n)
                                       log2 n
                                                                                [Chan: 2007]
Monge                           O(n2 )                                 via [Aggarwal+: 1987]
implicit unit-Monge             O(n1.5 )                                            [T: 2006]
                                O(n log n)                                          [T: 2010]



   Alexander Tiskin (Warwick)           Semi-local string comparison                     18 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                             PB
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •
   •      • •
     • •
                  •
           ••
 ••
      •
                ••                            ?
    •          •
       •     •
         PA                                  PC


   Alexander Tiskin (Warwick)                         Semi-local string comparison   19 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •
   •      • •
     • •
                  •
           ••
 ••             ••
      •        •
    •        •
       •
   PA,lo , PA,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                                    ••
   •      • •                            •
     • •                                           •        •
                  •
           ••
 ••             ••              • •
      •        •                  •
    •        •                   •
       •                                               •
   PA,lo , PA,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                               •
   •      • •                                     •
     • •                                                   •
                  •                                     • •
           ••                   •
 ••             ••                                    • •
      •        •                              •
    •        •                               •
       •
   PA,lo , PA,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                      ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                               •• •
                                             •
   •      • •                                 •
     • •                                       •         •
                  •                                       •
           ••                    •                      • •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                                               •
   PA,lo , PA,hi                    PC ,lo + PC ,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   20 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                      ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                               •• •
                                             •
   •      • •                                 •
     • •                                       •         •
                  •                                       •
           ••                    •                      • •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                                               •
   PA,lo , PA,hi                    PC ,lo + PC ,hi


   Alexander Tiskin (Warwick)                         Semi-local string comparison   21 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                    PB,lo , PB,hi
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                                    ••          •
   •      • •                            •             • •
     • •                                          ••
                  •                                     • •
           ••                        •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                         •
   PA,lo , PA,hi                             PC


   Alexander Tiskin (Warwick)                         Semi-local string comparison   21 / 132
Matrix distance multiplication
Seaweed matrix multiplication

                                             PB
                                     ••
                                •             ••
                                    ••             •
                                                       • •
                                         •        •
                                •                       •
                                             ••       • •
                                                         • •

    ••   •                                    ••        •
   •      • •                            •           • •
     • •                                          ••
                  •                                   • •
           ••                        •
 ••             ••              • •                   • •
      •        •                   •
    •        •                    • ••
       •                         •
         PA                                  PC


   Alexander Tiskin (Warwick)                         Semi-local string comparison   22 / 132
Matrix distance multiplication
Seaweed matrix multiplication


Implicit unit-Monge matrix              -multiplication: the algorithm
 Σ                Σ           Σ
PC (i, k) = minj PA (i, j) + PB (j, k)
Divide-and-conquer on the range of j
Divide PA horizontally, PB vertically: two subproblems of effective size n/2
 Σ
PA,lo       Σ       Σ
           PB,lo = PC ,lo        Σ
                                PA,hi      Σ       Σ
                                          PB,hi = PC ,hi
Conquer:
   -low nonzeros of PC ,lo and          -high nonzeros of PC ,hi appear in PC
The remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to be
corrected to obtain the remaining nonzeros of PC
Correction can be done in time O(n) using the unit-Monge property
Overall time O(n log n)

   Alexander Tiskin (Warwick)      Semi-local string comparison                 23 / 132
Matrix distance multiplication
Bruhat order


Comparing permutations by the “degree of sortedness”
Bruhat order
Permutation A is lower (“more sorted”) than permutation B in the Bruhat
order (A B), if B can be transformed to A by successive pairwise sorting
between arbitrary pairs of elements.

Permutation matrices: PA PB , if PB can be transformed to PA by
successive submatrix substitution: ( 0 1 )
                                     10    (1 0)
                                            01




   Alexander Tiskin (Warwick)   Semi-local string comparison        24 / 132
Matrix distance multiplication
Bruhat order


Bruhat comparability: running time
O(n2 )                                                                  folklore
O(n log n)                                                            [T: NEW]

PA      PB iff PA ≤ PB elementwise, time O(n2 )
               Σ    Σ                                                   folklore
                   R               R
PA      PB iff     PA        PB = Id , time O(n log n)                 [T: NEW]
where P R denotes clockwise rotation of matrix P




   Alexander Tiskin (Warwick)          Semi-local string comparison         25 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      26 / 132
Semi-local string comparison
Semi-local LCS and edit distance


Consider strings (= sequences) over an alphabet of size σ
Distinguish contiguous substrings and not necessarily contiguous
subsequences
Special cases of substring: prefix, suffix
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close




   Alexander Tiskin (Warwick)      Semi-local string comparison    27 / 132
Semi-local string comparison
Semi-local LCS and edit distance


Consider strings (= sequences) over an alphabet of size σ
Distinguish contiguous substrings and not necessarily contiguous
subsequences
Special cases of substring: prefix, suffix
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close
The longest common subsequence (LCS) score:
      length of longest string that is a subsequence of both a and b
      equivalently, alignment score, where score(match) = 1 and
      score(mismatch) = 0
In biological terms, “loss-free alignment” (unlike “lossy” BLAST)

   Alexander Tiskin (Warwick)      Semi-local string comparison        27 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The LCS problem
Give the LCS score for a vs b




   Alexander Tiskin (Warwick)      Semi-local string comparison   28 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The LCS problem
Give the LCS score for a vs b

LCS: running time
O(mn)                                                                      [Wagner, Fischer:    1974]
    mn
O log2 n                    σ = O(1)                                      [Masek, Paterson:     1980]
                                                                              [Crochemore+:     2003]
     mn(log log n)2
O       log2 n
                                                                         [Paterson, Danˇ´cık:   1994]
                                                                      [Bille, Farach-Colton:    2008]

Running time varies depending on the RAM model version
We assume word-RAM with word size log n (where it matters)


    Alexander Tiskin (Warwick)         Semi-local string comparison                              28 / 132
Semi-local string comparison
Semi-local LCS and edit distance


LCS on the alignment graph (directed, acyclic)
  B A A B C A B C A B A C A                                       blue = 0
B                                                                  red = 1
A
A
B
C
B
C
A
score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8
LCS = highest-score path from top-left to bottom-right


   Alexander Tiskin (Warwick)      Semi-local string comparison       29 / 132
Semi-local string comparison
Semi-local LCS and edit distance


LCS: dynamic programming                                              [WF: 1974]
Sweep cells in any              -compatible order
Cell update: time O(1)
Overall time O(mn)




   Alexander Tiskin (Warwick)          Semi-local string comparison         30 / 132
Semi-local string comparison
Semi-local LCS and edit distance



                                     ‘Begin at the beginning,’ the King said
                                     gravely, ‘and go on till you come to the end:
                                     then stop.’
                                                      L. Carroll, Alice in Wonderland
                                     (The standard approach in dynamic
                                     programming)




   Alexander Tiskin (Warwick)      Semi-local string comparison                31 / 132
Semi-local string comparison
Semi-local LCS and edit distance




Sometimes dynamic programming can be run from both ends for extra
flexibility




   Alexander Tiskin (Warwick)      Semi-local string comparison     32 / 132
Semi-local string comparison
Semi-local LCS and edit distance




Sometimes dynamic programming can be run from both ends for extra
flexibility
Is there a better, fully flexible alternative (e.g. for comparing compressed
strings, comparing strings dynamically or in parallel, etc.)?

   Alexander Tiskin (Warwick)      Semi-local string comparison          32 / 132
Semi-local string comparison
Semi-local LCS and edit distance


LCS: micro-block dynamic programming                                    [MP: 1980; BF: 2008]
Sweep cells in micro-blocks, in any                  -compatible order
Micro-block size:
      t = O(log n) when σ = O(1)
                    log n
      t=O         log log n     otherwise
Micro-block interface:
      O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
      O(t) small integers, each O(1) bits
Micro-block update: time O(1), by precomputing all possible interfaces
                          mn                                  mn(log log n)2
Overall time O          log2 n
                                  when σ = O(1), O               log2 n
                                                                               otherwise

   Alexander Tiskin (Warwick)          Semi-local string comparison                        33 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The semi-local LCS problem
Give the (implicit) matrix of O (m + n)2 LCS scores:
      string-substring LCS: string a vs every substring of b
      prefix-suffix LCS: every prefix of a vs every suffix of b
      suffix-prefix LCS: every suffix of a vs every prefix of b
      substring-string LCS: every substring of a vs string b




   Alexander Tiskin (Warwick)      Semi-local string comparison   34 / 132
Semi-local string comparison
Semi-local LCS and edit distance


The semi-local LCS problem
Give the (implicit) matrix of O (m + n)2 LCS scores:
      string-substring LCS: string a vs every substring of b
      prefix-suffix LCS: every prefix of a vs every suffix of b
      suffix-prefix LCS: every suffix of a vs every prefix of b
      substring-string LCS: every substring of a vs string b

Cf.: dynamic programming gives prefix-prefix LCS




   Alexander Tiskin (Warwick)      Semi-local string comparison   34 / 132
Semi-local string comparison
Semi-local LCS and edit distance


Semi-local LCS on the alignment graph
  B A A B C A B C A B A C A                                       blue = 0
B                                                                  red = 1
A
A
B
C
B
C
A
score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5
String-substring LCS: all highest-score top-to-bottom paths
Semi-local LCS: all highest-score boundary-to-boundary paths

   Alexander Tiskin (Warwick)      Semi-local string comparison       35 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H
 0     1   2   3    4   5   6     6   7   8   8   8    8    8           a = “BAABCBCA”
 -1 0      1   2    3   4   5     5   6   7   7   7    7    7
 -2 -1 0       1    2   3   4     4   5   6   6   6    6    7
                                                                        b = “BAABCABCABACA”
 -3 -2 -1 0         1   2   3     3   4   5   5   6    6    7           H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2     2   3   4   4   5    5    6
                                                                        H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1     2   3   4   4   5    5    6
 -6 -5 -4 -3 -2 -1 0              1   2   3   3   4    4    5           H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0               1   2   2   3    3    4
 -8 -7 -6 -5 -4 -3 -2 -1 0                1   2   3    3    4
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                 1   2    3    4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                  1    2    3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                   1    2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                    1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)               Semi-local string comparison                             36 / 132
Semi-local string comparison
Score matrices and seaweed matrices


Semi-local LCS: output representation and running time
size               query time
O(n2 )             O(1)                                                                 trivial
O(m1/2 n)          O(log n)    string-substring                               [Alves+: 2003]
O(n)               O(n)        string-substring                               [Alves+: 2005]
O(n log n)         O(log2 n)                                                        [T: 2006]
  . . . or any    2D orthogonal range counting data structure
running time
O(mn2 )                                                                                 naive
O(mn)                       string-substring                   [Schmidt: 1998; Alves+: 2005]
O(mn)                                                                               [T: 2006]
     mn
O log0.5 n                                                                          [T: 2006]
     mn(log log n)2
O       log2 n
                                                                                    [T: 2007]

    Alexander Tiskin (Warwick)         Semi-local string comparison                       37 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
H(i, j): the number of matched characters for a vs substring b i : j
j − i − H(i, j): the number of unmatched characters
Properties of matrix j − i − H(i, j):
      simple unit-Monge
      therefore, = P Σ , where P = −H               is a permutation matrix
P is the seaweed matrix, giving an implicit representation of H
Range tree for P: memory O(n log n), query time O(log2 n)




   Alexander Tiskin (Warwick)    Semi-local string comparison                 38 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
 0     1   2   3    4   5   6       6   7   8       8       8       8       8   a = “BAABCBCA”
 -1 0      1   2    3   4   5       5   6   7       7       7       7       7
                                                                        •       b = “BAABCABCABACA”
 -2 -1 0       1    2   3   4       4   5   6       6       6       6       7
                                                        •
 -3 -2 -1 0         1   2   3       3   4   5       5       6       6       7   H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2       2   3   4       4       5       5       6
                                •                                               H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1       2   3   4       4       5       5       6
 -6 -5 -4 -3 -2 -1 0                1   2   3       3       4       4       5   H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0                 1   2       2       3       3       4
                                                •
 -8 -7 -6 -5 -4 -3 -2 -1 0                  1       2       3       3       4
                                                                •
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                       1       2       3       4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                            1       2       3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                1       2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                    1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)                     Semi-local string comparison                               39 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
 0     1   2   3    4   5   6       6   7   8       8       8       8       8   a = “BAABCBCA”
 -1 0      1   2    3   4   5       5   6   7       7       7       7       7
                                                                        •       b = “BAABCABCABACA”
 -2 -1 0       1    2   3   4       4   5   6       6       6       6       7
                                                        •
 -3 -2 -1 0         1   2   3       3   4   5       5       6       6       7   H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2       2   3   4       4       5       5       6
                                •                                               H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1       2   3   4       4       5       5       6
 -6 -5 -4 -3 -2 -1 0                1   2   3       3       4       4       5   H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0                 1   2       2       3       3       4
                                                •                               blue: difference in H is 0
 -8 -7 -6 -5 -4 -3 -2 -1 0                  1       2       3       3       4
                                                                •               red: difference in H is 1
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                       1       2       3       4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                            1       2       3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                1       2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                    1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)                     Semi-local string comparison                               39 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
 0     1   2   3    4   5   6       6   7   8       8       8       8       8   a = “BAABCBCA”
 -1 0      1   2    3   4   5       5   6   7       7       7       7       7
                                                                        •       b = “BAABCABCABACA”
 -2 -1 0       1    2   3   4       4   5   6       6       6       6       7
                                                        •
 -3 -2 -1 0         1   2   3       3   4   5       5       6       6       7   H(i, j) = score(a, b i : j )
 -4 -3 -2 -1 0          1   2       2   3   4       4       5       5       6
                                •                                               H(4, 11) = 5
 -5 -4 -3 -2 -1 0           1       2   3   4       4       5       5       6
 -6 -5 -4 -3 -2 -1 0                1   2   3       3       4       4       5   H(i, j) = j − i if i > j
 -7 -6 -5 -4 -3 -2 -1 0                 1   2       2       3       3       4
                                                •                               blue: difference in H is 0
 -8 -7 -6 -5 -4 -3 -2 -1 0                  1       2       3       3       4
                                                                •               red: difference in H is 1
 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                       1       2       3       4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                            1       2       3   green: P(i, j) = 1
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                1       2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0                                    1   H(i, j) = j − i − P Σ (i, j)
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

     Alexander Tiskin (Warwick)                     Semi-local string comparison                               39 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The score matrix H and the seaweed matrix P
                                                              a = “BAABCBCA”
                                               •              b = “BAABCABCABACA”
                                      •
                                                              H(4, 11) =
                            •                                 11 − 4 − P Σ (i, j) =
                                                              11 − 4 − 2 = 5

                                •
                                          •




   Alexander Tiskin (Warwick)       Semi-local string comparison                      40 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The seaweed braid in the alignment graph
  B A A B C A B C A B A C A                                a = “BAABCBCA”
B
A                                                          b = “BAABCABCABACA”
A                                                          H(4, 11) =
B                                                          11 − 4 − P Σ (i, j) =
C                                                          11 − 4 − 2 = 5
B
C
A
P(i, j) = 1 corresponds to seaweed top i                  bottom j




   Alexander Tiskin (Warwick)    Semi-local string comparison                      41 / 132
Semi-local string comparison
Score matrices and seaweed matrices


The seaweed braid in the alignment graph
  B A A B C A B C A B A C A                                         a = “BAABCBCA”
B
A                                                                   b = “BAABCABCABACA”
A                                                                   H(4, 11) =
B                                                                   11 − 4 − P Σ (i, j) =
C                                                                   11 − 4 − 2 = 5
B
C
A
P(i, j) = 1 corresponds to seaweed top i                           bottom j
Also define top              right, left        right, left               bottom seaweeds
Gives bijection between top-left and bottom-right graph boundaries

   Alexander Tiskin (Warwick)             Semi-local string comparison                      41 / 132
Semi-local string comparison
Score matrices and seaweed matrices




Seaweed braid: a highly symmetric object (element of the 0-Hecke monoid
of the symmetric group)
Can be built recursively by assembling subbraids from separate parts
Highly flexible: local alignment, compression, parallel computation. . .

   Alexander Tiskin (Warwick)    Semi-local string comparison             42 / 132
Semi-local string comparison
Weighted alignment


The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )

      LCS score: wM = 1, wX = wG = 0
      Levenshtein score: wM = 2, wX = 1, wG = 0




   Alexander Tiskin (Warwick)   Semi-local string comparison       43 / 132
Semi-local string comparison
Weighted alignment


The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )

      LCS score: wM = 1, wX = wG = 0
      Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings




   Alexander Tiskin (Warwick)   Semi-local string comparison        43 / 132
Semi-local string comparison
Weighted alignment


The LCS problem is a special case of the weighted alignment score
problem with weighted matches (wM ), mismatches (wX ) and gaps (wG )

      LCS score: wM = 1, wX = wG = 0
      Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings
Edit distance: minimum cost to transform a into b by weighted character
edits (insertion, deletion, substitution)
Corresponds to weighted alignment score with wM = 0, insertion/deletion
weight −wG , substitution weight −wX



   Alexander Tiskin (Warwick)   Semi-local string comparison        43 / 132
Semi-local string comparison
Weighted alignment


Weighted alignment graph
  B A A B C A B C A B A C A                                            blue = 0
B                                                                red (solid) = 2
A                                                              red (dotted) = 1
A
B
C
B
C
A
Levenshtein score(“BAABCBCA”, “CABCABA”) = 11




   Alexander Tiskin (Warwick)   Semi-local string comparison                44 / 132
Semi-local string comparison
Weighted alignment


Alignment graph for blown-up strings
     $B $A $A $B $C $A $B $C $A $B $A $C $A                         blue = 0
$B                                                             red = 0.5 or 1
$A
$A
$B
$C
$B
$C
$A
Levenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5




   Alexander Tiskin (Warwick)   Semi-local string comparison             45 / 132
Semi-local string comparison
Weighted alignment


Rational-weighted semi-local alignment reduced to semi-local LCS
     $B $A $A $B $C $A $B $C $A $B $A $C $A
$B
$A
$A
$B
$C
$B
$C
$A

Let wM = 1, wX = µ , wG = 0
                 ν
Increase × ν 2 in complexity (can be reduced to ν)


   Alexander Tiskin (Warwick)   Semi-local string comparison       46 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   47 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      48 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   49 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   50 / 132
The seaweed method
Seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   51 / 132
The seaweed method
Seaweed combing


Semi-local LCS: seaweed combing                                       [T: 2006]
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells in any              -compatible order
Match cell: two seaweeds uncrossed; skip
Mismatch cell: two seaweeds cross
      if the same seaweeds crossed before, uncross them
      otherwise skip, keep seaweeds crossed
Cell update: time O(1)
Overall time O(mn)




   Alexander Tiskin (Warwick)          Semi-local string comparison        52 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   53 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   54 / 132
The seaweed method
Micro-block seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   55 / 132
The seaweed method
Micro-block seaweed combing


Semi-local LCS: micro-block seaweed combing                                [T: 2007]
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells in micro-blocks, in any                    -compatible order
                                     log n
Micro-block size: t = O            log log n
Micro-block interface:
      O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
      O(t) integers, each O(log n) bits, can be reduced to O(log t) bits
Micro-block update: time O(1), by precomputing all possible interfaces
                        mn(log log n)2
Overall time O             log2 n




   Alexander Tiskin (Warwick)            Semi-local string comparison           56 / 132
The seaweed method
Cyclic LCS


The cyclic LCS problem
Give the maximum LCS score for a vs all cyclic rotations of b




   Alexander Tiskin (Warwick)   Semi-local string comparison    57 / 132
The seaweed method
Cyclic LCS


The cyclic LCS problem
Give the maximum LCS score for a vs all cyclic rotations of b

Cyclic LCS: running time
  mn    2
O log n                                                                       naive
O(mn log m)                                                            [Maes: 1990]
O(mn)                           [Bunke, B¨hler: 1993; Landau+: 1998; Schmidt: 1998]
                                         u
                2
O mn(log 2log n)
       log n
                                                                          [T: 2007]




   Alexander Tiskin (Warwick)           Semi-local string comparison           57 / 132
The seaweed method
Cyclic LCS


Cyclic LCS: the algorithm
                                                               mn(log log n)2
Micro-block seaweed combing on a vs bb, time O                    log2 n
Make n string-substring LCS queries, time negligible




   Alexander Tiskin (Warwick)   Semi-local string comparison                    58 / 132
The seaweed method
Longest repeating subsequence


The longest repeating subsequence problem
Find the longest subsequence of a that is a square (a repetition of two
identical strings)




   Alexander Tiskin (Warwick)   Semi-local string comparison              59 / 132
The seaweed method
Longest repeating subsequence


The longest repeating subsequence problem
Find the longest subsequence of a that is a square (a repetition of two
identical strings)

Longest repeating subsequence: running time
O(m3 )                                                                    naive
O(m2 )                                                         [Kosowski: 2004]
   2      log 2
O m (log 2 m m)
      log
                                                                      [T: 2007]




   Alexander Tiskin (Warwick)   Semi-local string comparison               59 / 132
The seaweed method
Longest repeating subsequence


Longest repeating subsequence: the algorithm
                                                               m2 (log log m)2
Micro-block seaweed combing on a vs a, time O                      log2 m
Make m − 1 suffix-prefix LCS queries, time negligible




   Alexander Tiskin (Warwick)   Semi-local string comparison                     60 / 132
The seaweed method
Approximate matching


The approximate pattern matching problem
Give the substring closest to a by alignment score, starting at each
position in b

Assume rational alignment score
Approximate pattern matching: running time
O(mn)                                                                              [Sellers: 1980]
   mn
O log n                     σ = O(1)                                  via [Masek, Paterson: 1980]
     mn(log log n)2
O       log2 n
                                                               via [Bille, Farach-Colton: 2008]




    Alexander Tiskin (Warwick)         Semi-local string comparison                           61 / 132
The seaweed method
Approximate matching


Approximate pattern matching: the algorithm
Micro-block seaweed combing on a vs b (with blow-up), time
                2
O mn(log 2log n)
      log n
The implicit semi-local edit score matrix:
      an anti-Monge matrix
      approximate pattern matching ∼ row minima
Row minima in O(n) element queries                                  [Aggarwal+: 1987]
Each query in time O(log2 n) using the range tree representation,
combined query time negligible
                                mn(log log n)2
Overall running time O             log2 n
                                                  , same as [Bille, Farach-Colton: 2008]


   Alexander Tiskin (Warwick)        Semi-local string comparison                   62 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   63 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      64 / 132
Periodic string comparison
Wraparound seaweed combing


The periodic string-substring LCS problem
Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞

Let u be of length p
May assume that every character of a occurs in u
Only substrings of b of length at most mp (otherwise LCS score is m)




   Alexander Tiskin (Warwick)   Semi-local string comparison               65 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   66 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   67 / 132
Periodic string comparison
Wraparound seaweed combing

    B A A B C A B C A B A C A
B
A
A
B
C
B
C
A




    Alexander Tiskin (Warwick)   Semi-local string comparison   68 / 132
Periodic string comparison
Wraparound seaweed combing


Periodic string-substring LCS: Wraparound seaweed combing
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells row-by-row: each row starts at match cell, wraps at boundary
Match cell: two seaweeds uncrossed; skip
Mismatch cell: two seaweeds cross
      if the same seaweeds crossed before (with wrapping), uncross them
      otherwise skip, keep seaweeds crossed
Cell update: time O(1)
Overall time O(mn)
String-substring LCS score: count seaweeds with multiplicities


   Alexander Tiskin (Warwick)   Semi-local string comparison         69 / 132
Periodic string comparison
Wraparound seaweed combing


The tandem LCS problem
Give LCS score for a vs b = u k

We have n = kp; may assume k ≤ m
Tandem LCS: running time
O(mkp)                                                                               naive
O m(k + p)                                                     [Landau, Ziv-Ukelson: 2001]
O(mp)                                                                            [T: 2009]

Direct application of wraparound seaweed combing




   Alexander Tiskin (Warwick)   Semi-local string comparison                          70 / 132
Periodic string comparison
Wraparound seaweed combing


The tandem alignment problem
Give the substring closest to a by alignment score among certain
substrings of b = u ±∞ :
      global: substrings u k of length kp across all k
      cyclic: substrings of length kp across all k
      local: substrings of any length

Tandem alignment: running time
O(m2 p)               all                                                        naive
O(mp)                 global                                   [Myers, Miller:   1989]
O(mp log p)           cyclic                                        [Benson:     2005]
O(mp)                 cyclic                                               [T:   2009]
O(mp)                 local                                    [Myers, Miller:   1989]

   Alexander Tiskin (Warwick)   Semi-local string comparison                      71 / 132
Periodic string comparison
Wraparound seaweed combing


Cyclic tandem alignment: the algorithm
Periodic seaweed combing for a vs b (with blow-up), time O(mp)
For each k ∈ [1 : m]:
      solve tandem LCS (under given alignment score) for a vs u k
      obtain scores for a vs p successive substrings of b of length kp by LCS
      batch query: time O(1) per substring
Running time O(mp)




   Alexander Tiskin (Warwick)   Semi-local string comparison             72 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   73 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      74 / 132
The transposition network method
Transposition networks


Comparison network: a circuit of comparators
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks
               # comparators
merging        O(n log n)                                      [Batcher: 1968]
sorting        O(n log2 n)                                     [Batcher: 1968]
               O(n log n)                                       [Ajtai+: 1983]




   Alexander Tiskin (Warwick)   Semi-local string comparison              75 / 132
The transposition network method
Transposition networks


Comparison network: a circuit of comparators
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks
               # comparators
merging        O(n log n)                                      [Batcher: 1968]
sorting        O(n log2 n)                                     [Batcher: 1968]
               O(n log n)                                       [Ajtai+: 1983]

Comparison networks are visualised by wire diagrams
Transposition network: all comparisons are between adjacent wires


   Alexander Tiskin (Warwick)   Semi-local string comparison              75 / 132
The transposition network method
Transposition networks


Seaweed combing as a transposition network
                                                         −7
                                                        −5
                                                      −3
                                                     −1
               A B C A                             +1
          A                                       +3
                                                +5
          C                                    +7

          B                                                              −7
          C                                                             −1
                                                                      +3
                                                                     −3
                                                                   −5
                                                                  +7
                                                                +5
                                                               +1

Character mismatches correspond to comparators
Inputs anti-sorted (sorted in reverse); each value traces a seaweed


   Alexander Tiskin (Warwick)   Semi-local string comparison                  76 / 132
The transposition network method
Transposition networks


Global LCS: transposition network with binary input
                                                                           0
                                                                       0
                                                                   0
                                                               0
               A B C A                                     1
          A                                            1
                                                   1
          C                                    1

          B                                                                                                0
                                                                                                       0
          C                                                                                        1
                                                                                               0
                                                                                           0
                                                                                       1
                                                                                   1
                                                                               1

Inputs still anti-sorted, but may not be distinct
Comparison between equal values is indeterminate


   Alexander Tiskin (Warwick)   Semi-local string comparison                                                   77 / 132
The transposition network method
Parameterised string comparison


Parameterised string comparison
String comparison sensitive e.g. to
      low similarity: small λ = LCS(a, b)
      high similarity: small κ = dist LCS (a, b) = m + n − 2λ
Can also use weighted alignment score or edit distance

Assume m = n, therefore κ = 2(n − λ)




   Alexander Tiskin (Warwick)     Semi-local string comparison   78 / 132
The transposition network method
Parameterised string comparison


Low-similarity comparison: small λ
      sparse set of matches, may need to look at them all
      preprocess matches for fast searching, time O(n log σ)
High-similarity comparison: small κ
      set of matches may be dense, but only need to look at small subset
      no need to preprocess, linear search is OK
Flexible comparison: sensitive to both high and low similarity, e.g. by both
comparison types running alongside each other




   Alexander Tiskin (Warwick)     Semi-local string comparison          79 / 132
The transposition network method
Parameterised string comparison


Parameterised string comparison: running time
Low-similarity, after preprocessing in O(n log σ)
O(nλ)                                                                       [Hirschberg: 1977]
                                                                    [Apostolico, Guerra: 1985]
                                                                          [Apostolico+: 1992]
High-similarity, no preprocessing
O(n · κ)                                                                      [Ukkonen: 1985]
                                                                                [Myers: 1986]
Flexible
O(λ · κ · log n)          no preproc                                [Myers: 1986; Wu+: 1990]
O(λ · κ)                  after preproc                                           [Rick: 1995]




   Alexander Tiskin (Warwick)        Semi-local string comparison                         80 / 132
The transposition network method
Parameterised string comparison


Parameterised string comparison: the waterfall algorithm
Low-similarity: O(n · λ)                                    High-similarity: O(n · κ)
      0    0    0    0    0       0   0   0                       0    0     0   0   0   0   0   0
 1                                             0             1                                       0
 1                                             1             1                                       0
 1                                             0             1                                       0
 1                                             1             1                                       0
 1                                             1             1                                       1
 1                                             0             1                                       1
 1                                             1             1                                       1
 1                                             1             1                                       0

      0    0    1    1    0       0   1   0                       1    1     1   0   1   1   0   0

Trace 0s through network in contiguous blocks and gaps

     Alexander Tiskin (Warwick)               Semi-local string comparison                               81 / 132
The transposition network method
Dynamic string comparison


The dynamic LCS problem
Maintain current LCS score under updates to one or both input strings

Both input strings are streams, updated on-line:
      appending characters at left or right
      deleting characters at left or right
Assume for simplicity m ≈ n, i.e. m = Θ(n)
Goal: linear time per update
      O(n) per update of a (n = |b|)
      O(m) per update of b (m = |a|)



   Alexander Tiskin (Warwick)   Semi-local string comparison        82 / 132
The transposition network method
Dynamic string comparison


Dynamic LCS in linear time: update models
left            right
–               app+del                               standard DP [Wagner, Fischer: 1974]
app             app             a fixed                 [Landau+: 1998], [Kim, Park: 2004]
app             app                                                       [Ishida+: 2005]
app+del         app+del                                                         [T: NEW]

Main idea:
      for append only, maintain seaweed matrix Pa,b
      for append+delete, maintain partial seaweed layout by tracing a
      transposition network



   Alexander Tiskin (Warwick)            Semi-local string comparison                83 / 132
The transposition network method
Bit-parallel string comparison


Bit-parallel string comparison
String comparison using standard instructions on words of size w

Bit-parallel string comparison: running time
O(mn/w )                        [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001]




   Alexander Tiskin (Warwick)           Semi-local string comparison            84 / 132
The transposition network method
Bit-parallel string comparison


Bit-parallel string comparison: binary transposition network
In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ
                      c                               s      0       1   0   1   0   1   0   1
 µ       ¬                                            c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                                s                   s      0       1   1   1   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c
                      c                               s      0       1   0   1   0   1   0   1
 µ
           ∧                                          c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                    +           s                   s      0       1   1   0   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c




     Alexander Tiskin (Warwick)       Semi-local string comparison                               85 / 132
The transposition network method
Bit-parallel string comparison


Bit-parallel string comparison: binary transposition network
In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ
                      c                               s      0       1   0   1   0   1   0   1
 µ       ¬                                            c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                                s                   s      0       1   1   1   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c
                      c                               s      0       1   0   1   0   1   0   1
 µ
           ∧                                          c      0       0   1   1   0   0   1   1
                                                      µ      0       0   0   0   1   1   1   1
 s                    +           s                   s      0       1   1   0   0   0   1   1
                                                      c      0       0   0   1   0   1   0   1

                      c

2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)
S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ
     Alexander Tiskin (Warwick)       Semi-local string comparison                               85 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   86 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      87 / 132
Sparse string comparison
Semi-local LCS between permutations


The LCS problem on permutation strings
Give LCS score for a vs b
In each of a, b all characters distinct: total m = n matches
Equivalent to
      longest increasing subsequence (LIS) in a string
      maximum clique in a permutation graph
      maximum planar matching in an embedded bipartite graph

LCS on permutation strings: running time
O(n log n)                                   implicit in [Erd¨s, Szekeres:
                                                             o               1935]
                                 [Robinson: 1938; Knuth: 1970; Dijkstra:     1980]
O(n log log n)          unit-RAM                           [Chang, Wang:     1992]
                                                 [Bespamyatnikh, Segal:      2000]

   Alexander Tiskin (Warwick)       Semi-local string comparison              88 / 132
Sparse string comparison
Semi-local LCS between permutations


The semi-local LCS problem on permutation strings
Give semi-local LCS scores of a vs b
In each of a, b all characters distinct: total m = n matches
Equivalent to
      longest increasing subsequence (LIS) in every substring of a string

Semi-local LCS on permutation strings: running time
O(n2 log n)                                                                         naive
O(n2 )                restricted                           [Albert+: 2003; Chen+: 2005]
O(n1.5 log n)         randomised, restricted                             [Albert+: 2007]
O(n1.5 )                                                                        [T: 2006]
O(n log2 n)                                                                    [T: NEW]



   Alexander Tiskin (Warwick)      Semi-local string comparison                      89 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   90 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   90 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   91 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   91 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   92 / 132
Sparse string comparison
Semi-local LCS between permutations

    D E H C B A F G
C
F
A
E
D
H
G
B




    Alexander Tiskin (Warwick)   Semi-local string comparison   92 / 132
Sparse string comparison
Semi-local LCS between permutations


Semi-local LCS on permutation strings: the algorithm
Divide-and-conquer on the alignment graph
Divide graph (say) horizontally; two subproblems of effective size n/2
Conquer: seaweed matrix         -multiplication, time O(n log n)
Overall time O(n log2 n)




   Alexander Tiskin (Warwick)    Semi-local string comparison           93 / 132
Sparse string comparison
Longest piecewise monotone subsequences


A k-increasing sequence: a concatenation of k increasing sequences
A k-modal sequence: a concatenation of k alternating increasing and
decreasing sequences

The longest k-increasing (k-modal) subsequence problem
Give the longest k-increasing (k-modal) subsequence of string b

Longest k-increasing (k-modal) subsequence: running time
O(nk log n)          k-modal                                            [Demange+: 2007]
O(nk log n)                                                    via [Hunt, Szymanski: 1977]
O(n log2 n)                                                                     [T: NEW]

Main idea: LCS for id k (respectively, (idid)k/2 ) vs b

   Alexander Tiskin (Warwick)   Semi-local string comparison                          94 / 132
Sparse string comparison
Longest piecewise monotone subsequences


Longest k-increasing subsequence: algorithm A
Sparse LCS for id k vs b: time O(nk log n)




   Alexander Tiskin (Warwick)   Semi-local string comparison   95 / 132
Sparse string comparison
Longest piecewise monotone subsequences


Longest k-increasing subsequence: algorithm A
Sparse LCS for id k vs b: time O(nk log n)

Longest k-increasing subsequence: algorithm B
Compute seaweed matrix for id vs b: time O(n log2 n)
Extract three-way seaweed submatrix
Compute three-way seaweed submatrix for id k vs b by log k instances of
seaweed matrix -square/multiply: time log k · O(n log n) = O(n log2 n)
Query the LCS score for id k vs b: time negligible
Overall time O(n log2 n)

Algorithm B faster than Algorithm A for k ≥ log n

   Alexander Tiskin (Warwick)   Semi-local string comparison         95 / 132
Sparse string comparison
Maximum clique in a circle graph


The maximum clique problem in a circle graph
Given a circle with n chords, find the maximum-size subset of pairwise
intersecting chords




                                S




   Alexander Tiskin (Warwick)       Semi-local string comparison        96 / 132
Sparse string comparison
Maximum clique in a circle graph


The maximum clique problem in a circle graph
Given a circle with n chords, find the maximum-size subset of pairwise
intersecting chords




                                S


Standard reduction to an interval model: cut the circle and lay it out on
the line; chords become intervals (here drawn as square diagonals)
Chords intersect iff intervals overlap, i.e. intersect without containment
   Alexander Tiskin (Warwick)       Semi-local string comparison        96 / 132
Sparse string comparison
Maximum clique in a circle graph


Maximum clique in a circle graph: running time
exp(n)                                                                         naive
O(n3 )                                                                [Gavril: 1973]
O(n2 )                                            [Rotem, Urrutia: 1981; Hsu: 1985]
                                                [Masuda+: 1990; Apostolico+: 1992]
O(n1.5 )                                                                   [T: 2006]
O(n log2 n)                                                               [T: NEW]




   Alexander Tiskin (Warwick)      Semi-local string comparison                 97 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph




   Alexander Tiskin (Warwick)      Semi-local string comparison   98 / 132
Sparse string comparison
Maximum clique in a circle graph


Maximum clique in a circle graph: the algorithm
Helly property: if any set of intervals intersect pairwise, then they all
intersect at a common point
Compute seaweed matrix, build range tree: time O(n log2 n)
Run through all 2n + 1 possible common intersection points
For each point, find a maximum subset of covering overlapping segments
by a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n)
Overall time O(n log2 n) + O(n log2 n) = O(n log2 n)




   Alexander Tiskin (Warwick)      Semi-local string comparison             99 / 132
Sparse string comparison
Maximum clique in a circle graph


Parameterised maximum clique in a circle graph
The maximum clique problem in a circle graph, sensitive e.g. to
      the number e of edges
      the size l of maximum clique
      the thickness d of interval model (the maximum number of intervals
      covering a point)

We have l ≤ d ≤ n, e ≤ n2
Parameterised maximum clique in a circle graph: running time
O(n log n + e)                                                    [Apostolico+: 1992]
O(n log n + nl log(n/l))                                          [Apostolico+: 1992]
O(n log n + n log2 d)                                                           NEW

   Alexander Tiskin (Warwick)      Semi-local string comparison                 100 / 132
Sparse string comparison
Maximum clique in a circle graph


Parameterised maximum clique in a circle graph: the algorithm
For each diagonal block of size d, compute seaweed matrix, build range
tree: time n/d · O(d log2 d) = O(n log2 d)
Extend each diagonal block to a quadrant: time O(n log2 d)
Run through all 2n + 1 possible common intersection points
For each point, find a maximum subset of covering overlapping segments
by a prefix-suffix LCS query: time O(n log2 d)
Overall time O(n log2 d)




   Alexander Tiskin (Warwick)      Semi-local string comparison     101 / 132
Alexander Tiskin (Warwick)   Semi-local string comparison   102 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      103 / 132
Compressed string comparison
Grammar compression


Notation: pattern p of length m; text t of length n
A GC-string (grammar-compressed string) t is a straight-line program
(context-free grammar) generating t = tn by n assignments of the form
                                       ¯    ¯
      tk = α, where α is an alphabet character
      tk = ti tj , where i, j < k
In general, n = O(2n )
                   ¯

Example: Fibonacci string “ABAABABAABAAB”
t1 = ‘B’           t2 = ‘A’
t3 = t2 t1          t4 = t3 t2   t5 = t4 t3             t6 = t5 t4   t7 = t6 t5




   Alexander Tiskin (Warwick)       Semi-local string comparison                  104 / 132
Compressed string comparison
Grammar compression


Grammar-compression covers various compression types, e.g. LZ78, LZW
(not LZ77 directly)
Simplifying assumption: arithmetic up to n runs in O(1)
This assumption can be removed by careful index remapping




   Alexander Tiskin (Warwick)   Semi-local string comparison     105 / 132
Compressed string comparison
Extended substring-string LCS on GC-strings


LCS: running time (r = m + n, ¯ = m + n)
                              r   ¯ ¯
p         t
plain     plain       O(mn)                                           [Wagner, Fischer: 1974]
                          mn
                      O log2 m                                       [Masek, Paterson: 1980]
                                                                        [Crochemore+: 2003]
plain     GC          O(m3 n + . . .)
                             ¯                     gen. CFG                    [Myers: 1995]
                      O(m ¯1.5 n)                  ext subs-s                        [T: 2008]
                      O(m log m · n) ¯             ext subs-s                        [T: 2010]
GC        GC          NP-hard                                                 [Lifshits: 2005]
                      O(r 1.2¯1.4 )
                             r                     R weights               [Hermelin+: 2009]
                      O(r log r · ¯)
                                   r                                                 [T: 2010]
                      O r log(r /¯) · ¯
                                   r r                                     [Hermelin+: 2010]
                      O r log1/2 (r /¯) · ¯
                                       r r                             [Gawrychowski: NEW]


   Alexander Tiskin (Warwick)         Semi-local string comparison                       106 / 132
Compressed string comparison
Extended substring-string LCS on GC-strings


Extended substring-string LCS (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrix
Pp,tk , using matrix -multiplication: time O(m log m · n)
                                                       ¯
Overall time O(m log m · n)
                         ¯




   Alexander Tiskin (Warwick)    Semi-local string comparison       107 / 132
Compressed string comparison
Subsequence recognition on GC-strings


The global subsequence recognition problem
Does pattern p appear in text t as a subsequence?

Global subsequence recognition: running time
p         t
plain     plain       O(n)                                                greedy
plain     GC          O(m¯)
                          n                                               greedy
GC        GC          NP-hard                                   [Lifshits: 2005]




   Alexander Tiskin (Warwick)    Semi-local string comparison              108 / 132
Compressed string comparison
Subsequence recognition on GC-strings


The local subsequence recognition problem
Find all minimally matching substrings of t with respect to p

Substring of t is matching, if p is a subsequence of t
Matching substring of t is minimally matching, if none of its proper
substrings are matching




   Alexander Tiskin (Warwick)    Semi-local string comparison          109 / 132
Compressed string comparison
Subsequence recognition on GC-strings


Local subsequence recognition: running time ( + output)
p         t
plain     plain       O(mn)                                        [Mannila+: 1995]
                          mn
                      O log m                                            [Das+: 1997]
                      O(c m + n)                                   [Boasson+: 2001]
                      O(m + nσ)                                      [Troniˇek: 2001]
                                                                            c
plain     GC          O(m2 log m¯)
                                 n                                [C´gielski+: 2006]
                                                                     e
                          1.5 n)
                      O(m ¯                                                   [T: 2008]
                      O(m log m · n)
                                  ¯                                          [T: NEW]
GC        GC          NP-hard                                          [Lifshits: 2005]




   Alexander Tiskin (Warwick)      Semi-local string comparison                  110 / 132
Compressed string comparison
Subsequence recognition on GC-strings

 0      ˆ0+
        ı            ˆ1+
                     ı       ˆ2+
                             ı      n                                     n




                                             0+
                                             ˆ         1+
                                                       ˆ         2+
                                                                ˆ



b i : j matching iff box [i : j] not pierced left-to-right
  -maximal seaweeds:               -chain ˆ0+ , 0+
                                          ı ˆ                          ˆ1+ , 1+
                                                                       ı ˆ         ···       ˆs − , s −
                                                                                             ı ˆ
b i : j minimally matching iff (i, j) is in the interleaved                          -chain
 ˆ− , +
 ı1+ ˆ1−     ˆ− , +
              ı2+ ˆ2−     ···      ˆ− + , + −
                                   ı(s−1) ˆ(s−1)


     Alexander Tiskin (Warwick)         Semi-local string comparison                                111 / 132
Compressed string comparison
Subsequence recognition on GC-strings

 −m




   0
       ×
 ˆ0+
 ı              •
            •


                  ×
 ˆ1+
 ı                        •
                            ×
 ˆ2+
 ı                              •
                      •


   n                                ×
       n        0+
                ˆ         1+
                          ˆ     2+
                                ˆ       n                   m+n


   Alexander Tiskin (Warwick)               Semi-local string comparison   112 / 132
Compressed string comparison
Subsequence recognition on GC-strings


Local subsequence recognition (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrix
Pp,tk , using matrix -multiplication: time O(m log m · n)
                                                       ¯




   Alexander Tiskin (Warwick)    Semi-local string comparison       113 / 132
Compressed string comparison
Subsequence recognition on GC-strings


Local subsequence recognition (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrix
Pp,tk , using matrix -multiplication: time O(m log m · n)
                                                       ¯
Given an assignment t = t t , count by recursion
      minimally matching substrings in t
      minimally matching substrings in t




   Alexander Tiskin (Warwick)    Semi-local string comparison       113 / 132
Compressed string comparison
Subsequence recognition on GC-strings


Local subsequence recognition (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrix
Pp,tk , using matrix -multiplication: time O(m log m · n)
                                                       ¯
Given an assignment t = t t , count by recursion
      minimally matching substrings in t
      minimally matching substrings in t
Then, find           -chain of   -maximal seaweeds in time n · O(m) = O(m¯)
                                                          ¯             n
The interleaved -chain defines minimally matching substrings in t
overlapping both t and t
Overall time O(m log m · n) + O(m¯) = O(m log m · n)
                         ¯       n                ¯



   Alexander Tiskin (Warwick)      Semi-local string comparison         113 / 132
Compressed string comparison
Threshold approximate matching


The threshold approximate matching problem
Find all matching substrings of t with respect to p, according to a
threshold k

Substring of t is matching, if the edit distance for p vs t is at most k




   Alexander Tiskin (Warwick)    Semi-local string comparison              114 / 132
Compressed string comparison
Threshold approximate matching


Threshold approximate matching: running time ( + output)
p         t
plain     plain       O(mn)                                                   [Sellers: 1980]
                      O(mk)                                         [Landau, Vishkin: 1989]
                                    4
                      O(m + n + nk )
                                  m                                  [Cole, Hariharan: 2002]
plain     GC          O(m¯k 2 )
                          n                                            [K¨rkk¨inen+: 2003]
                                                                          a a
                      O(m¯k + n log n)
                          n     ¯                              [LV: 1989] via [Bille+: 2010]
                      O(m¯ + nk 4 + n log n)
                          n ¯         ¯                       [CH: 2002] via [Bille+: 2010]
                      O(m log m · n)
                                  ¯                                                [T: NEW]
GC        GC          NP-hard                                                [Lifshits: 2005]

(Also many specialised variants for LZ compression)



   Alexander Tiskin (Warwick)      Semi-local string comparison                         115 / 132
Compressed string comparison
Threshold approximate matching


Threshold approx matching (plain pattern, GC text): the algorithm
Algorithm structure similar to local subsequence recognition by matrix
 -multiplication and seaweed -chains
Extra ingredients:
      the blow-up technique: reduction of edit distances to LCS scores
      implicit matrix searching, replaces              -chain interleaving
Monge row minima: “SMAWK” O(m)                                       [Aggarwal+: 1987]
Implicit unit-Monge row minima O(m log log m)                               [T: NEW]
O(m)                                                             [Gawrychowski: NEW]
replaces        -chain interleaving
Overall time O(m log m · n) + O(m¯) = O(m log m · n)
                         ¯       n                ¯

   Alexander Tiskin (Warwick)     Semi-local string comparison                   116 / 132
Compressed string comparison
Threshold approximate matching

 0      ˆ0+ ˆ1+
        ı ı          ˆ2+
                     ı       ˆ3+ ˆ4+
                             ı ı       n
                                       ˜                                  n
                                                                          ˜




                                            0+ 1+ 2+ 3+
                                            ˆ ˆ ˆ ˆ                 4+
                                                                   ˆ


Blow up: weighted alignment on strings p, t of size m, n equivalent to
LCS on strings p , ˜ of size m = νm, n = νn
               ˜ t           ˜       ˜




     Alexander Tiskin (Warwick)            Semi-local string comparison       117 / 132
Compressed string comparison
Threshold approximate matching

 −m
  ˜




   0
       ×     × × × ×                      ×
 ˆ0+
 ı            •
       ×     × × × ×                      ×
 ˆ1+
 ı          •
       ×     × × × ×                      ×
 ˆ2+
 ı                              •
       ×     × × × ×                      ×
 ˆ3+
 ı                                      •
       ×     × × × ×                      ×
 ˆ4+
 ı                      •
  n
  ˜    ×        ×   ×       ×       ×       ×
       n
       ˜   0+ 1+ 2+ 3+
           ˆ ˆ ˆ ˆ                      4+
                                        ˆ       n
                                                ˜                   m+˜
                                                                    ˜ n


   Alexander Tiskin (Warwick)                       Semi-local string comparison   118 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      119 / 132
Beyond semi-locality
Window-local LCS and alignment plots


Given window length w
The window-local LCS problem
Give semi-local LCS score for every w -substring of a vs whole b




   Alexander Tiskin (Warwick)   Semi-local string comparison       120 / 132
Beyond semi-locality
Window-local LCS and alignment plots


Given window length w
The window-local LCS problem
Give semi-local LCS score for every w -substring of a vs whole b

Window-local LCS: running time
O(mn3 w )                                                                    naive
O(mnw )                                                          multiple seaweed
O(mn)                                                          [Krusche, T: 2010]




   Alexander Tiskin (Warwick)   Semi-local string comparison                 120 / 132
Beyond semi-locality
Window-local LCS and alignment plots


Quasi-local LCS: the algorithm
Compute seaweed matrices for canonical substrings of a against b
Build seaweed matrices for windows of a against b recursively
Both stages use matrix          -multiplication, overall time O(mn)




   Alexander Tiskin (Warwick)      Semi-local string comparison       121 / 132
Beyond semi-locality
Window-local LCS and alignment plots


The window-window LCS problem
Give semi-local LCS score for every w -substring of a vs every w -substring b

Provides an alignment plot: loss-free local comparison of genome
sequences, important for studying conservation in the genome




   Alexander Tiskin (Warwick)   Semi-local string comparison            122 / 132
Beyond semi-locality
Window-local LCS and alignment plots


The window-window LCS problem
Give semi-local LCS score for every w -substring of a vs every w -substring b

Provides an alignment plot: loss-free local comparison of genome
sequences, important for studying conservation in the genome
Window-window LCS: running time
O(mnw 2 )                                                                    naive
O(mnw )                                                          multiple seaweed
O(mn)                                                          [Krusche, T: 2010]




   Alexander Tiskin (Warwick)   Semi-local string comparison                 122 / 132
Beyond semi-locality
Window-local LCS and alignment plots


An implementation of alignment plots                           [Krusche: 2010]
Based on the O(mnw ) multiple seaweed combing, using some ideas from
the optimal O(mn) algorithm
Resulting theoretical complexity O(mnw 0.5 )
Alignment weights (match 1, mismatch 0, gap −0.5): ×4 slowdown
C++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism)
SMP parallelism (currently two processors)




   Alexander Tiskin (Warwick)   Semi-local string comparison             123 / 132
Beyond semi-locality
Window-local LCS and alignment plots


Krusche’s implementation used at Warwick Systems Biology Centre to
study weak conservation in plant genomes (BLAST too inaccurate)
Speedup on one processor:
      ×10 over heavily optimised, bit-parallel naive algorithm
      ×7 over biologists’ heuristics
Speedup on two processors: extra ×2, near-perfect parallelism




   Alexander Tiskin (Warwick)   Semi-local string comparison     124 / 132
Beyond semi-locality
Window-local LCS and alignment plots


Publications:
[Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) in
Plants. Plant Journal.
[Baxter+: accepted] Conserved Noncoding Sequences Highlight Shared
Components of Regulatory Networks in Dicotyledonous Plants. Plant Cell.
Web service: http://wsbc.warwick.ac.uk/ears
Future work: genomic repeats; genome-scale comparison




   Alexander Tiskin (Warwick)   Semi-local string comparison      125 / 132
Beyond semi-locality
Quasi-local LCS and sparse spliced alignment


Given m (possibly overlapping) prescribed substrings in a

The quasi-local LCS problem
Give semi-local LCS score for every prescribed substring of a vs whole b




   Alexander Tiskin (Warwick)    Semi-local string comparison         126 / 132
Beyond semi-locality
Quasi-local LCS and sparse spliced alignment


Given m (possibly overlapping) prescribed substrings in a

The quasi-local LCS problem
Give semi-local LCS score for every prescribed substring of a vs whole b

Quasi-local LCS: running time
O(m2 n)                                                         multiple seaweed
O(mn log2 m)                                                    [T: unpublished]




   Alexander Tiskin (Warwick)    Semi-local string comparison              126 / 132
Beyond semi-locality
Quasi-local LCS and sparse spliced alignment


Window-local LCS: the algorithm
Compute seaweed matrices for canonical substrings of a against b
Build seaweed matrices for prescribed substrings of a against b recursively
Both stages use matrix          -multiplication, time O(mn log2 m)




   Alexander Tiskin (Warwick)      Semi-local string comparison        127 / 132
Beyond semi-locality
Quasi-local LCS and sparse spliced alignment


The sparse spliced alignment problem
Give the chain of non-overlapping prescribed substrings in a, closest to b
by alignment score

Describes gene assembly: unknown gene from candidate exons, given a
known similar gene




   Alexander Tiskin (Warwick)    Semi-local string comparison          128 / 132
Beyond semi-locality
Quasi-local LCS and sparse spliced alignment


The sparse spliced alignment problem
Give the chain of non-overlapping prescribed substrings in a, closest to b
by alignment score

Describes gene assembly: unknown gene from candidate exons, given a
known similar gene
Assume m = n; let s = total size of prescribed substrings
Sparse spliced alignment: running time
O(ns) = O(n3 )                                                  [Gelfand+: 1996]
O(n2.5 )                                                           [Kent+: 2006]
O(n2 log2 n)                                                     [T: unpublished]
O(n2 log n)                                                         [Sakai: 2009]


   Alexander Tiskin (Warwick)    Semi-local string comparison               128 / 132
1   Introduction                                7    Sparse string comparison

2   Matrix distance multiplication              8    Compressed string comparison

3   Semi-local string comparison                9    Beyond semi-locality

4   The seaweed method                          10   Conclusions and future work

5   Periodic string comparison

6   The transposition network
    method



    Alexander Tiskin (Warwick)   Semi-local string comparison                      129 / 132
Conclusions and future work

Implicit unit-Monge matrices:
     the seaweed monoid
     distance multiplication in time O(n log n)
     next: lower bound?
Semi-local LCS problem:
     representation by implicit unit-Monge matrices
     generalisation to rational alignment scores
     next: real alignment scores?
Seaweed combing and micro-block speedup:
     a simple algorithm for semi-local LCS
     semi-local LCS in time o(mn)
     improvements on related problems
  Alexander Tiskin (Warwick)   Semi-local string comparison   130 / 132
Conclusions and future work

Transposition networks:
     simple interpretation of high-similarity and dissimilarity LCS
     dynamic LCS
     simple interpretation of bit-parallel LCS
Sparse string comparison:
     fast LCS on permutation strings
     fast max-clique in a circle graph
     next: lower bound? or even faster max-clique
Sparse string comparison:
     semi-local LCS on permutations in time O(n log2 n)
     maximum clique in a circle graph in time O(n log2 n)
     improvements on related problems
  Alexander Tiskin (Warwick)   Semi-local string comparison           131 / 132
Conclusions and future work

Compressed string comparison:
     three-way semi-local LCS on GC text against plain pattern
     subsequence recognition in GC-strings
     next: approximate matching in GC-strings
Beyond semi-locality:
     quasi-local string comparison
     sparse spliced alignment
     next: full locality




  Alexander Tiskin (Warwick)    Semi-local string comparison     132 / 132

20121020 semi local-string_comparison_tiskin

  • 1.
    Semi-local string comparison: Algorithmic techniques and applications Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick) Semi-local string comparison 1 / 132
  • 2.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 2 / 132
  • 3.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 3 / 132
  • 4.
    Introduction String matching: findingan exact pattern in a string String comparison: finding similar patterns in two strings Applications: computational biology, image recognition, . . . Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
  • 5.
    Introduction String matching: findingan exact pattern in a string String comparison: finding similar patterns in two strings Applications: computational biology, image recognition, . . . Standard types of string comparison: global: whole string vs whole string local: substrings vs substrings Main focus of this work: semi-local: whole string vs substrings; prefixes vs suffixes Closely related to approximate string matching (no relation to approximation algorithms!) Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices) Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
  • 6.
    Introduction Terminology and notation x−= x − 1 2 x+ = x + 1 2 Integers: {. . . − 2, −1, 0, 1, 2, . . .} Half-integers: . . . − 3 , − 1 , 1 , 2 , 2 , . . . = . . . (−2)+ , (−1)+ , 0+ , 1+ , 2+ 2 2 2 3 5 (i, j) (i , j ) iff i < i and j < j (i, j) (i , j ) iff i > i and j < j A permutation matrix is a 0/1 matrix with exactly one nonzero per row and per column   0 1 0 1 0 0 0 0 1 Alexander Tiskin (Warwick) Semi-local string comparison 5 / 132
  • 7.
    Introduction Terminology and notation Givenmatrix D, its distribution matrix is made up of -dominance sums: Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 8.
    Introduction Terminology and notation Givenmatrix D, its distribution matrix is made up of -dominance sums: Given matrix E , its density matrix is made up of quadrangle differences: E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆ where D Σ , E over integers; D, E over half-integers Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 9.
    Introduction Terminology and notation Givenmatrix D, its distribution matrix is made up of -dominance sums: Given matrix E , its density matrix is made up of quadrangle differences: E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆ where D Σ , E over integers; D, E over half-integers      Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 0 1 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 10.
    Introduction Terminology and notation Givenmatrix D, its distribution matrix is made up of -dominance sums: Given matrix E , its density matrix is made up of quadrangle differences: E (ˆ, ) = E (ˆ− , + ) − E (ˆ− , − ) − E (ˆ+ , + ) + E (ˆ+ , − ) ı ˆ ı ˆ ı ˆ ı ˆ ı ˆ where D Σ , E over integers; D, E over half-integers      Σ 0 1 2 3 0 1 2 3   0 1 0 0 1 1 2 0 1 1 2 0 1 0 1 0 0 =  0 0 0 1 = 1 0 0      0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 (D Σ ) = D for all D Matrix E is simple, if (E )Σ = E ; equivalently, if it has all zeros in the left column and bottom row Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
  • 11.
    Introduction Terminology and notation MatrixE is Monge, if E is nonnegative Intuition: boundary-to-boundary distances in a (weighted) planar graph Matrix E is unit-Monge, if E is a permutation matrix Intuition: boundary-to-boundary distances in a grid-like graph Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
  • 12.
    Introduction Terminology and notation MatrixE is Monge, if E is nonnegative Intuition: boundary-to-boundary distances in a (weighted) planar graph Matrix E is unit-Monge, if E is a permutation matrix Intuition: boundary-to-boundary distances in a grid-like graph Simple unit-Monge matrix: P Σ , where P is a permutation matrix Seaweed matrix: P used as an implicit representation of P Σ    Σ 0 1 2 3 0 1 0 0 1 0 0 =  1 1 2  0 0 0 1 0 0 1 0 0 0 0 Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
  • 13.
    Introduction Implicit unit-Monge matrices EfficientP Σ queries: range tree on nonzeros of P [Bentley: 1980] binary search tree by i-coordinate under every node, binary search tree by j-coordinate • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • ↓ • • • • −→ • −→ • • • • • • • Alexander Tiskin (Warwick) Semi-local string comparison 8 / 132
  • 14.
    Introduction Implicit unit-Monge matrices EfficientP Σ queries: (contd.) Every node of the range tree represents a canonical range (rectangular region), and stores its nonzero count Overall, ≤ n log n canonical ranges are non-empty A P Σ query is equivalent to -dominance counting: how many nonzeros are -dominated by query point? Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges Total size O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
  • 15.
    Introduction Implicit unit-Monge matrices EfficientP Σ queries: (contd.) Every node of the range tree represents a canonical range (rectangular region), and stores its nonzero count Overall, ≤ n log n canonical ranges are non-empty A P Σ query is equivalent to -dominance counting: how many nonzeros are -dominated by query point? Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges Total size O(n log n), query time O(log2 n) There are asymptotically more efficient (but less practical) data structures log n Total size O(n), query time O log log n [J´J´+: 2004] a a [Chan, Pˇtra¸cu: 2010] a s Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
  • 16.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 10 / 132
  • 17.
    Matrix distance multiplication Seaweedbraids Distance algebra (a.k.a (min, +) or tropical algebra): addition ⊕ given by min multiplication given by + Matrix -multiplication A B=C C (i, k) = j A(i, j) B(j, k) = minj A(i, j) + B(j, k) Matrix classes closed under -multiplication (for given n): general numerical (integer, real) matrices Monge matrices simple unit-Monge matrices (!) Alexander Tiskin (Warwick) Semi-local string comparison 11 / 132
  • 18.
    Matrix distance multiplication Seaweedbraids Recall that simple unit-Monge matrices are represented implicitly by permutation (seaweed) matrices Define PA Σ PB = PC as PA Σ Σ PB = PC The seaweed monoid Tn : simple unit-Monge matrices under equivalently, permutation (seaweed) matrices under Also known as the 0-Hecke monoid of the symmetric group H0 (Sn ) Alexander Tiskin (Warwick) Semi-local string comparison 12 / 132
  • 19.
    Matrix distance multiplication Seaweedbraids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 20.
    Matrix distance multiplication Seaweedbraids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC PA PB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 21.
    Matrix distance multiplication Seaweedbraids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC PA PB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 22.
    Matrix distance multiplication Seaweedbraids PA PB = PC can be seen as combing of seaweed braids • • • • • • • • • • • • • • • • • • PA PB PC PA PC PB Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
  • 23.
    Matrix distance multiplication Seaweedbraids The seaweed monoid Tn : n! elements (permutations of size n) n − 1 generators g1 , g2 , . . . , gn−1 (elementary crossings) Idempotence: gi2 = gi for all i = Far commutativity: gi gj = gj gi j − i > 1 ··· = ··· Braid relations: gi gj gi = gj gi gj j − i = 1 = Alexander Tiskin (Warwick) Semi-local string comparison 14 / 132
  • 24.
    Matrix distance multiplication Seaweedbraids Identity: 1 x =x   • · · · · • · · · · • · = 1=  · · · • Zero: 0 x =0   · · · • · · • · 0= · • · · =  • · · · Alexander Tiskin (Warwick) Semi-local string comparison 15 / 132
  • 25.
    Matrix distance multiplication Seaweedbraids Related structures: positive braids: far comm; braid relations braids: gi gi−1 = 1; far comm; braid relations Coxeter’s presentation of Sn : gi2 = 1; far comm; braid relations locally free idempotent monoid: idem; far comm [Vershik+: 2000] Generalisations: general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008] Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990] J -trivial monoids [Denton+: 2011] Alexander Tiskin (Warwick) Semi-local string comparison 16 / 132
  • 26.
    Matrix distance multiplication Seaweedbraids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 27.
    Matrix distance multiplication Seaweedbraids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0 aa → a bb → b bab → 0 aba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 28.
    Matrix distance multiplication Seaweedbraids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0 aa → a bb → b bab → 0 aba → 0 T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac, bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0 aa → a ca → ac bab → aba cbac → bcba bb → b cc → c cbc → bcb abacba → 0 Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 29.
    Matrix distance multiplication Seaweedbraids Computation in the seaweed monoid: a confluent rewriting system can be obtained by software (Semigroupe, GAP) T3 : 1, a = g1 , b = g2 ; ab, ba, aba = 0 aa → a bb → b bab → 0 aba → 0 T4 : 1, a = g1 , b = g2 , c = g3 ; ab, ac, ba, bc, cb, aba, abc, acb, bac, bcb, cba, abac, abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0 aa → a ca → ac bab → aba cbac → bcba bb → b cc → c cbc → bcb abacba → 0 Easy to use, but not an efficient algorithm Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
  • 30.
    Matrix distance multiplication Seaweedmatrix multiplication The implicit unit-Monge matrix -multiplication problem Given permutation matrices PA , PB , compute PC , such that Σ Σ Σ PA PB = PC (equivalently, PA PB = PC ) Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
  • 31.
    Matrix distance multiplication Seaweedmatrix multiplication The implicit unit-Monge matrix -multiplication problem Given permutation matrices PA , PB , compute PC , such that Σ Σ Σ PA PB = PC (equivalently, PA PB = PC ) Matrix -multiplication: running time type time general O(n3 ) standard 3 3 O n (log log n) log2 n [Chan: 2007] Monge O(n2 ) via [Aggarwal+: 1987] implicit unit-Monge O(n1.5 ) [T: 2006] O(n log n) [T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
  • 32.
    Matrix distance multiplication Seaweedmatrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• • •• ? • • • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 19 / 132
  • 33.
    Matrix distance multiplication Seaweedmatrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • •• •• •• • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 34.
    Matrix distance multiplication Seaweedmatrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• •• •• • • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 35.
    Matrix distance multiplication Seaweedmatrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • • • • • • • • • • • • •• • •• •• • • • • • • • • • PA,lo , PA,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 36.
    Matrix distance multiplication Seaweedmatrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
  • 37.
    Matrix distance multiplication Seaweedmatrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • • • • •• • • • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC ,lo + PC ,hi Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
  • 38.
    Matrix distance multiplication Seaweedmatrix multiplication PB,lo , PB,hi •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA,lo , PA,hi PC Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
  • 39.
    Matrix distance multiplication Seaweedmatrix multiplication PB •• • •• •• • • • • • • • •• • • • • •• • •• • • • • • • • • • •• • • • •• • •• •• • • • • • • • • • • •• • • PA PC Alexander Tiskin (Warwick) Semi-local string comparison 22 / 132
  • 40.
    Matrix distance multiplication Seaweedmatrix multiplication Implicit unit-Monge matrix -multiplication: the algorithm Σ Σ Σ PC (i, k) = minj PA (i, j) + PB (j, k) Divide-and-conquer on the range of j Divide PA horizontally, PB vertically: two subproblems of effective size n/2 Σ PA,lo Σ Σ PB,lo = PC ,lo Σ PA,hi Σ Σ PB,hi = PC ,hi Conquer: -low nonzeros of PC ,lo and -high nonzeros of PC ,hi appear in PC The remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to be corrected to obtain the remaining nonzeros of PC Correction can be done in time O(n) using the unit-Monge property Overall time O(n log n) Alexander Tiskin (Warwick) Semi-local string comparison 23 / 132
  • 41.
    Matrix distance multiplication Bruhatorder Comparing permutations by the “degree of sortedness” Bruhat order Permutation A is lower (“more sorted”) than permutation B in the Bruhat order (A B), if B can be transformed to A by successive pairwise sorting between arbitrary pairs of elements. Permutation matrices: PA PB , if PB can be transformed to PA by successive submatrix substitution: ( 0 1 ) 10 (1 0) 01 Alexander Tiskin (Warwick) Semi-local string comparison 24 / 132
  • 42.
    Matrix distance multiplication Bruhatorder Bruhat comparability: running time O(n2 ) folklore O(n log n) [T: NEW] PA PB iff PA ≤ PB elementwise, time O(n2 ) Σ Σ folklore R R PA PB iff PA PB = Id , time O(n log n) [T: NEW] where P R denotes clockwise rotation of matrix P Alexander Tiskin (Warwick) Semi-local string comparison 25 / 132
  • 43.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 26 / 132
  • 44.
    Semi-local string comparison Semi-localLCS and edit distance Consider strings (= sequences) over an alphabet of size σ Distinguish contiguous substrings and not necessarily contiguous subsequences Special cases of substring: prefix, suffix Notation: strings a, b of length m, n respectively Assume where necessary: m ≤ n; m, n reasonably close Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
  • 45.
    Semi-local string comparison Semi-localLCS and edit distance Consider strings (= sequences) over an alphabet of size σ Distinguish contiguous substrings and not necessarily contiguous subsequences Special cases of substring: prefix, suffix Notation: strings a, b of length m, n respectively Assume where necessary: m ≤ n; m, n reasonably close The longest common subsequence (LCS) score: length of longest string that is a subsequence of both a and b equivalently, alignment score, where score(match) = 1 and score(mismatch) = 0 In biological terms, “loss-free alignment” (unlike “lossy” BLAST) Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
  • 46.
    Semi-local string comparison Semi-localLCS and edit distance The LCS problem Give the LCS score for a vs b Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
  • 47.
    Semi-local string comparison Semi-localLCS and edit distance The LCS problem Give the LCS score for a vs b LCS: running time O(mn) [Wagner, Fischer: 1974] mn O log2 n σ = O(1) [Masek, Paterson: 1980] [Crochemore+: 2003] mn(log log n)2 O log2 n [Paterson, Danˇ´cık: 1994] [Bille, Farach-Colton: 2008] Running time varies depending on the RAM model version We assume word-RAM with word size log n (where it matters) Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
  • 48.
    Semi-local string comparison Semi-localLCS and edit distance LCS on the alignment graph (directed, acyclic) B A A B C A B C A B A C A blue = 0 B red = 1 A A B C B C A score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8 LCS = highest-score path from top-left to bottom-right Alexander Tiskin (Warwick) Semi-local string comparison 29 / 132
  • 49.
    Semi-local string comparison Semi-localLCS and edit distance LCS: dynamic programming [WF: 1974] Sweep cells in any -compatible order Cell update: time O(1) Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 30 / 132
  • 50.
    Semi-local string comparison Semi-localLCS and edit distance ‘Begin at the beginning,’ the King said gravely, ‘and go on till you come to the end: then stop.’ L. Carroll, Alice in Wonderland (The standard approach in dynamic programming) Alexander Tiskin (Warwick) Semi-local string comparison 31 / 132
  • 51.
    Semi-local string comparison Semi-localLCS and edit distance Sometimes dynamic programming can be run from both ends for extra flexibility Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
  • 52.
    Semi-local string comparison Semi-localLCS and edit distance Sometimes dynamic programming can be run from both ends for extra flexibility Is there a better, fully flexible alternative (e.g. for comparing compressed strings, comparing strings dynamically or in parallel, etc.)? Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
  • 53.
    Semi-local string comparison Semi-localLCS and edit distance LCS: micro-block dynamic programming [MP: 1980; BF: 2008] Sweep cells in micro-blocks, in any -compatible order Micro-block size: t = O(log n) when σ = O(1) log n t=O log log n otherwise Micro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) small integers, each O(1) bits Micro-block update: time O(1), by precomputing all possible interfaces mn mn(log log n)2 Overall time O log2 n when σ = O(1), O log2 n otherwise Alexander Tiskin (Warwick) Semi-local string comparison 33 / 132
  • 54.
    Semi-local string comparison Semi-localLCS and edit distance The semi-local LCS problem Give the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string b Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
  • 55.
    Semi-local string comparison Semi-localLCS and edit distance The semi-local LCS problem Give the (implicit) matrix of O (m + n)2 LCS scores: string-substring LCS: string a vs every substring of b prefix-suffix LCS: every prefix of a vs every suffix of b suffix-prefix LCS: every suffix of a vs every prefix of b substring-string LCS: every substring of a vs string b Cf.: dynamic programming gives prefix-prefix LCS Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
  • 56.
    Semi-local string comparison Semi-localLCS and edit distance Semi-local LCS on the alignment graph B A A B C A B C A B A C A blue = 0 B red = 1 A A B C B C A score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5 String-substring LCS: all highest-score top-to-bottom paths Semi-local LCS: all highest-score boundary-to-boundary paths Alexander Tiskin (Warwick) Semi-local string comparison 35 / 132
  • 57.
    Semi-local string comparison Scorematrices and seaweed matrices The score matrix H 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 b = “BAABCABCABACA” -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 36 / 132
  • 58.
    Semi-local string comparison Scorematrices and seaweed matrices Semi-local LCS: output representation and running time size query time O(n2 ) O(1) trivial O(m1/2 n) O(log n) string-substring [Alves+: 2003] O(n) O(n) string-substring [Alves+: 2005] O(n log n) O(log2 n) [T: 2006] . . . or any 2D orthogonal range counting data structure running time O(mn2 ) naive O(mn) string-substring [Schmidt: 1998; Alves+: 2005] O(mn) [T: 2006] mn O log0.5 n [T: 2006] mn(log log n)2 O log2 n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 37 / 132
  • 59.
    Semi-local string comparison Scorematrices and seaweed matrices The score matrix H and the seaweed matrix P H(i, j): the number of matched characters for a vs substring b i : j j − i − H(i, j): the number of unmatched characters Properties of matrix j − i − H(i, j): simple unit-Monge therefore, = P Σ , where P = −H is a permutation matrix P is the seaweed matrix, giving an implicit representation of H Range tree for P: memory O(n log n), query time O(log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 38 / 132
  • 60.
    Semi-local string comparison Scorematrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
  • 61.
    Semi-local string comparison Scorematrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: difference in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: difference in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
  • 62.
    Semi-local string comparison Scorematrices and seaweed matrices The score matrix H and the seaweed matrix P 0 1 2 3 4 5 6 6 7 8 8 8 8 8 a = “BAABCBCA” -1 0 1 2 3 4 5 5 6 7 7 7 7 7 • b = “BAABCABCABACA” -2 -1 0 1 2 3 4 4 5 6 6 6 6 7 • -3 -2 -1 0 1 2 3 3 4 5 5 6 6 7 H(i, j) = score(a, b i : j ) -4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6 • H(4, 11) = 5 -5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5 H(i, j) = j − i if i > j -7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4 • blue: difference in H is 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4 • red: difference in H is 1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 green: P(i, j) = 1 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 H(i, j) = j − i − P Σ (i, j) -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
  • 63.
    Semi-local string comparison Scorematrices and seaweed matrices The score matrix H and the seaweed matrix P a = “BAABCBCA” • b = “BAABCABCABACA” • H(4, 11) = • 11 − 4 − P Σ (i, j) = 11 − 4 − 2 = 5 • • Alexander Tiskin (Warwick) Semi-local string comparison 40 / 132
  • 64.
    Semi-local string comparison Scorematrices and seaweed matrices The seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA” B A b = “BAABCABCABACA” A H(4, 11) = B 11 − 4 − P Σ (i, j) = C 11 − 4 − 2 = 5 B C A P(i, j) = 1 corresponds to seaweed top i bottom j Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
  • 65.
    Semi-local string comparison Scorematrices and seaweed matrices The seaweed braid in the alignment graph B A A B C A B C A B A C A a = “BAABCBCA” B A b = “BAABCABCABACA” A H(4, 11) = B 11 − 4 − P Σ (i, j) = C 11 − 4 − 2 = 5 B C A P(i, j) = 1 corresponds to seaweed top i bottom j Also define top right, left right, left bottom seaweeds Gives bijection between top-left and bottom-right graph boundaries Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
  • 66.
    Semi-local string comparison Scorematrices and seaweed matrices Seaweed braid: a highly symmetric object (element of the 0-Hecke monoid of the symmetric group) Can be built recursively by assembling subbraids from separate parts Highly flexible: local alignment, compression, parallel computation. . . Alexander Tiskin (Warwick) Semi-local string comparison 42 / 132
  • 67.
    Semi-local string comparison Weightedalignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
  • 68.
    Semi-local string comparison Weightedalignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alignment score is rational, if wM , wX , wG are rational numbers Equivalent to LCS score on blown-up strings Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
  • 69.
    Semi-local string comparison Weightedalignment The LCS problem is a special case of the weighted alignment score problem with weighted matches (wM ), mismatches (wX ) and gaps (wG ) LCS score: wM = 1, wX = wG = 0 Levenshtein score: wM = 2, wX = 1, wG = 0 Alignment score is rational, if wM , wX , wG are rational numbers Equivalent to LCS score on blown-up strings Edit distance: minimum cost to transform a into b by weighted character edits (insertion, deletion, substitution) Corresponds to weighted alignment score with wM = 0, insertion/deletion weight −wG , substitution weight −wX Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
  • 70.
    Semi-local string comparison Weightedalignment Weighted alignment graph B A A B C A B C A B A C A blue = 0 B red (solid) = 2 A red (dotted) = 1 A B C B C A Levenshtein score(“BAABCBCA”, “CABCABA”) = 11 Alexander Tiskin (Warwick) Semi-local string comparison 44 / 132
  • 71.
    Semi-local string comparison Weightedalignment Alignment graph for blown-up strings $B $A $A $B $C $A $B $C $A $B $A $C $A blue = 0 $B red = 0.5 or 1 $A $A $B $C $B $C $A Levenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5 Alexander Tiskin (Warwick) Semi-local string comparison 45 / 132
  • 72.
    Semi-local string comparison Weightedalignment Rational-weighted semi-local alignment reduced to semi-local LCS $B $A $A $B $C $A $B $C $A $B $A $C $A $B $A $A $B $C $B $C $A Let wM = 1, wX = µ , wG = 0 ν Increase × ν 2 in complexity (can be reduced to ν) Alexander Tiskin (Warwick) Semi-local string comparison 46 / 132
  • 73.
    Alexander Tiskin (Warwick) Semi-local string comparison 47 / 132
  • 74.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 48 / 132
  • 75.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 49 / 132
  • 76.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 77.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 78.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 79.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 80.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 81.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 82.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 83.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 84.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 85.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 86.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 87.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 88.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 89.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 90.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 91.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 92.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 93.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 94.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 95.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 96.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 97.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 98.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 99.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 100.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 101.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 102.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 103.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 104.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 105.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 106.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 107.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 108.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 109.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 110.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 111.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 112.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
  • 113.
    The seaweed method Seaweedcombing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 51 / 132
  • 114.
    The seaweed method Seaweedcombing Semi-local LCS: seaweed combing [T: 2006] Initialise seaweed braid: crossings in all mismatch cells Sweep cells in any -compatible order Match cell: two seaweeds uncrossed; skip Mismatch cell: two seaweeds cross if the same seaweeds crossed before, uncross them otherwise skip, keep seaweeds crossed Cell update: time O(1) Overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 52 / 132
  • 115.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 53 / 132
  • 116.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 117.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 118.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 119.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 120.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 121.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 122.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 123.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 124.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 125.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 126.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 127.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 128.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
  • 129.
    The seaweed method Micro-blockseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 55 / 132
  • 130.
    The seaweed method Micro-blockseaweed combing Semi-local LCS: micro-block seaweed combing [T: 2007] Initialise seaweed braid: crossings in all mismatch cells Sweep cells in micro-blocks, in any -compatible order log n Micro-block size: t = O log log n Micro-block interface: O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits O(t) integers, each O(log n) bits, can be reduced to O(log t) bits Micro-block update: time O(1), by precomputing all possible interfaces mn(log log n)2 Overall time O log2 n Alexander Tiskin (Warwick) Semi-local string comparison 56 / 132
  • 131.
    The seaweed method CyclicLCS The cyclic LCS problem Give the maximum LCS score for a vs all cyclic rotations of b Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
  • 132.
    The seaweed method CyclicLCS The cyclic LCS problem Give the maximum LCS score for a vs all cyclic rotations of b Cyclic LCS: running time mn 2 O log n naive O(mn log m) [Maes: 1990] O(mn) [Bunke, B¨hler: 1993; Landau+: 1998; Schmidt: 1998] u 2 O mn(log 2log n) log n [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
  • 133.
    The seaweed method CyclicLCS Cyclic LCS: the algorithm mn(log log n)2 Micro-block seaweed combing on a vs bb, time O log2 n Make n string-substring LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 58 / 132
  • 134.
    The seaweed method Longestrepeating subsequence The longest repeating subsequence problem Find the longest subsequence of a that is a square (a repetition of two identical strings) Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
  • 135.
    The seaweed method Longestrepeating subsequence The longest repeating subsequence problem Find the longest subsequence of a that is a square (a repetition of two identical strings) Longest repeating subsequence: running time O(m3 ) naive O(m2 ) [Kosowski: 2004] 2 log 2 O m (log 2 m m) log [T: 2007] Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
  • 136.
    The seaweed method Longestrepeating subsequence Longest repeating subsequence: the algorithm m2 (log log m)2 Micro-block seaweed combing on a vs a, time O log2 m Make m − 1 suffix-prefix LCS queries, time negligible Alexander Tiskin (Warwick) Semi-local string comparison 60 / 132
  • 137.
    The seaweed method Approximatematching The approximate pattern matching problem Give the substring closest to a by alignment score, starting at each position in b Assume rational alignment score Approximate pattern matching: running time O(mn) [Sellers: 1980] mn O log n σ = O(1) via [Masek, Paterson: 1980] mn(log log n)2 O log2 n via [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 61 / 132
  • 138.
    The seaweed method Approximatematching Approximate pattern matching: the algorithm Micro-block seaweed combing on a vs b (with blow-up), time 2 O mn(log 2log n) log n The implicit semi-local edit score matrix: an anti-Monge matrix approximate pattern matching ∼ row minima Row minima in O(n) element queries [Aggarwal+: 1987] Each query in time O(log2 n) using the range tree representation, combined query time negligible mn(log log n)2 Overall running time O log2 n , same as [Bille, Farach-Colton: 2008] Alexander Tiskin (Warwick) Semi-local string comparison 62 / 132
  • 139.
    Alexander Tiskin (Warwick) Semi-local string comparison 63 / 132
  • 140.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 64 / 132
  • 141.
    Periodic string comparison Wraparoundseaweed combing The periodic string-substring LCS problem Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u ±∞ Let u be of length p May assume that every character of a occurs in u Only substrings of b of length at most mp (otherwise LCS score is m) Alexander Tiskin (Warwick) Semi-local string comparison 65 / 132
  • 142.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 66 / 132
  • 143.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 144.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 145.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 146.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 147.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 148.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 149.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 150.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 151.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 152.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 153.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 154.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 155.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 156.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 157.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 158.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 159.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 160.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 161.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 162.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 163.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 164.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 165.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 166.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 167.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 168.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 169.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 170.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 171.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 172.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 173.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 174.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 175.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 176.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 177.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 178.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 179.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 180.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 181.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 182.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
  • 183.
    Periodic string comparison Wraparoundseaweed combing B A A B C A B C A B A C A B A A B C B C A Alexander Tiskin (Warwick) Semi-local string comparison 68 / 132
  • 184.
    Periodic string comparison Wraparoundseaweed combing Periodic string-substring LCS: Wraparound seaweed combing Initialise seaweed braid: crossings in all mismatch cells Sweep cells row-by-row: each row starts at match cell, wraps at boundary Match cell: two seaweeds uncrossed; skip Mismatch cell: two seaweeds cross if the same seaweeds crossed before (with wrapping), uncross them otherwise skip, keep seaweeds crossed Cell update: time O(1) Overall time O(mn) String-substring LCS score: count seaweeds with multiplicities Alexander Tiskin (Warwick) Semi-local string comparison 69 / 132
  • 185.
    Periodic string comparison Wraparoundseaweed combing The tandem LCS problem Give LCS score for a vs b = u k We have n = kp; may assume k ≤ m Tandem LCS: running time O(mkp) naive O m(k + p) [Landau, Ziv-Ukelson: 2001] O(mp) [T: 2009] Direct application of wraparound seaweed combing Alexander Tiskin (Warwick) Semi-local string comparison 70 / 132
  • 186.
    Periodic string comparison Wraparoundseaweed combing The tandem alignment problem Give the substring closest to a by alignment score among certain substrings of b = u ±∞ : global: substrings u k of length kp across all k cyclic: substrings of length kp across all k local: substrings of any length Tandem alignment: running time O(m2 p) all naive O(mp) global [Myers, Miller: 1989] O(mp log p) cyclic [Benson: 2005] O(mp) cyclic [T: 2009] O(mp) local [Myers, Miller: 1989] Alexander Tiskin (Warwick) Semi-local string comparison 71 / 132
  • 187.
    Periodic string comparison Wraparoundseaweed combing Cyclic tandem alignment: the algorithm Periodic seaweed combing for a vs b (with blow-up), time O(mp) For each k ∈ [1 : m]: solve tandem LCS (under given alignment score) for a vs u k obtain scores for a vs p successive substrings of b of length kp by LCS batch query: time O(1) per substring Running time O(mp) Alexander Tiskin (Warwick) Semi-local string comparison 72 / 132
  • 188.
    Alexander Tiskin (Warwick) Semi-local string comparison 73 / 132
  • 189.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 74 / 132
  • 190.
    The transposition networkmethod Transposition networks Comparison network: a circuit of comparators A comparator sorts two inputs and outputs them in prescribed order Comparison networks traditionally used for non-branching merging/sorting Classical comparison networks # comparators merging O(n log n) [Batcher: 1968] sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
  • 191.
    The transposition networkmethod Transposition networks Comparison network: a circuit of comparators A comparator sorts two inputs and outputs them in prescribed order Comparison networks traditionally used for non-branching merging/sorting Classical comparison networks # comparators merging O(n log n) [Batcher: 1968] sorting O(n log2 n) [Batcher: 1968] O(n log n) [Ajtai+: 1983] Comparison networks are visualised by wire diagrams Transposition network: all comparisons are between adjacent wires Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
  • 192.
    The transposition networkmethod Transposition networks Seaweed combing as a transposition network −7 −5 −3 −1 A B C A +1 A +3 +5 C +7 B −7 C −1 +3 −3 −5 +7 +5 +1 Character mismatches correspond to comparators Inputs anti-sorted (sorted in reverse); each value traces a seaweed Alexander Tiskin (Warwick) Semi-local string comparison 76 / 132
  • 193.
    The transposition networkmethod Transposition networks Global LCS: transposition network with binary input 0 0 0 0 A B C A 1 A 1 1 C 1 B 0 0 C 1 0 0 1 1 1 Inputs still anti-sorted, but may not be distinct Comparison between equal values is indeterminate Alexander Tiskin (Warwick) Semi-local string comparison 77 / 132
  • 194.
    The transposition networkmethod Parameterised string comparison Parameterised string comparison String comparison sensitive e.g. to low similarity: small λ = LCS(a, b) high similarity: small κ = dist LCS (a, b) = m + n − 2λ Can also use weighted alignment score or edit distance Assume m = n, therefore κ = 2(n − λ) Alexander Tiskin (Warwick) Semi-local string comparison 78 / 132
  • 195.
    The transposition networkmethod Parameterised string comparison Low-similarity comparison: small λ sparse set of matches, may need to look at them all preprocess matches for fast searching, time O(n log σ) High-similarity comparison: small κ set of matches may be dense, but only need to look at small subset no need to preprocess, linear search is OK Flexible comparison: sensitive to both high and low similarity, e.g. by both comparison types running alongside each other Alexander Tiskin (Warwick) Semi-local string comparison 79 / 132
  • 196.
    The transposition networkmethod Parameterised string comparison Parameterised string comparison: running time Low-similarity, after preprocessing in O(n log σ) O(nλ) [Hirschberg: 1977] [Apostolico, Guerra: 1985] [Apostolico+: 1992] High-similarity, no preprocessing O(n · κ) [Ukkonen: 1985] [Myers: 1986] Flexible O(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990] O(λ · κ) after preproc [Rick: 1995] Alexander Tiskin (Warwick) Semi-local string comparison 80 / 132
  • 197.
    The transposition networkmethod Parameterised string comparison Parameterised string comparison: the waterfall algorithm Low-similarity: O(n · λ) High-similarity: O(n · κ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0 0 Trace 0s through network in contiguous blocks and gaps Alexander Tiskin (Warwick) Semi-local string comparison 81 / 132
  • 198.
    The transposition networkmethod Dynamic string comparison The dynamic LCS problem Maintain current LCS score under updates to one or both input strings Both input strings are streams, updated on-line: appending characters at left or right deleting characters at left or right Assume for simplicity m ≈ n, i.e. m = Θ(n) Goal: linear time per update O(n) per update of a (n = |b|) O(m) per update of b (m = |a|) Alexander Tiskin (Warwick) Semi-local string comparison 82 / 132
  • 199.
    The transposition networkmethod Dynamic string comparison Dynamic LCS in linear time: update models left right – app+del standard DP [Wagner, Fischer: 1974] app app a fixed [Landau+: 1998], [Kim, Park: 2004] app app [Ishida+: 2005] app+del app+del [T: NEW] Main idea: for append only, maintain seaweed matrix Pa,b for append+delete, maintain partial seaweed layout by tracing a transposition network Alexander Tiskin (Warwick) Semi-local string comparison 83 / 132
  • 200.
    The transposition networkmethod Bit-parallel string comparison Bit-parallel string comparison String comparison using standard instructions on words of size w Bit-parallel string comparison: running time O(mn/w ) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001] Alexander Tiskin (Warwick) Semi-local string comparison 84 / 132
  • 201.
    The transposition networkmethod Bit-parallel string comparison Bit-parallel string comparison: binary transposition network In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
  • 202.
    The transposition networkmethod Bit-parallel string comparison Bit-parallel string comparison: binary transposition network In every cell: input bits s, c; output bits s , c ; match/mismatch flag µ c s 0 1 0 1 0 1 0 1 µ ¬ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s s s 0 1 1 1 0 0 1 1 c 0 0 0 1 0 1 0 1 c c s 0 1 0 1 0 1 0 1 µ ∧ c 0 0 1 1 0 0 1 1 µ 0 0 0 0 1 1 1 1 s + s s 0 1 1 0 0 0 1 1 c 0 0 0 1 0 1 0 1 c 2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ) S ← (S + (S ∧ M)) ∨ (S ∧ ¬M), where S, M are words of bits s, µ Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
  • 203.
    Alexander Tiskin (Warwick) Semi-local string comparison 86 / 132
  • 204.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 87 / 132
  • 205.
    Sparse string comparison Semi-localLCS between permutations The LCS problem on permutation strings Give LCS score for a vs b In each of a, b all characters distinct: total m = n matches Equivalent to longest increasing subsequence (LIS) in a string maximum clique in a permutation graph maximum planar matching in an embedded bipartite graph LCS on permutation strings: running time O(n log n) implicit in [Erd¨s, Szekeres: o 1935] [Robinson: 1938; Knuth: 1970; Dijkstra: 1980] O(n log log n) unit-RAM [Chang, Wang: 1992] [Bespamyatnikh, Segal: 2000] Alexander Tiskin (Warwick) Semi-local string comparison 88 / 132
  • 206.
    Sparse string comparison Semi-localLCS between permutations The semi-local LCS problem on permutation strings Give semi-local LCS scores of a vs b In each of a, b all characters distinct: total m = n matches Equivalent to longest increasing subsequence (LIS) in every substring of a string Semi-local LCS on permutation strings: running time O(n2 log n) naive O(n2 ) restricted [Albert+: 2003; Chen+: 2005] O(n1.5 log n) randomised, restricted [Albert+: 2007] O(n1.5 ) [T: 2006] O(n log2 n) [T: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 89 / 132
  • 207.
    Sparse string comparison Semi-localLCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
  • 208.
    Sparse string comparison Semi-localLCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
  • 209.
    Sparse string comparison Semi-localLCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
  • 210.
    Sparse string comparison Semi-localLCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
  • 211.
    Sparse string comparison Semi-localLCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
  • 212.
    Sparse string comparison Semi-localLCS between permutations D E H C B A F G C F A E D H G B Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
  • 213.
    Sparse string comparison Semi-localLCS between permutations Semi-local LCS on permutation strings: the algorithm Divide-and-conquer on the alignment graph Divide graph (say) horizontally; two subproblems of effective size n/2 Conquer: seaweed matrix -multiplication, time O(n log n) Overall time O(n log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 93 / 132
  • 214.
    Sparse string comparison Longestpiecewise monotone subsequences A k-increasing sequence: a concatenation of k increasing sequences A k-modal sequence: a concatenation of k alternating increasing and decreasing sequences The longest k-increasing (k-modal) subsequence problem Give the longest k-increasing (k-modal) subsequence of string b Longest k-increasing (k-modal) subsequence: running time O(nk log n) k-modal [Demange+: 2007] O(nk log n) via [Hunt, Szymanski: 1977] O(n log2 n) [T: NEW] Main idea: LCS for id k (respectively, (idid)k/2 ) vs b Alexander Tiskin (Warwick) Semi-local string comparison 94 / 132
  • 215.
    Sparse string comparison Longestpiecewise monotone subsequences Longest k-increasing subsequence: algorithm A Sparse LCS for id k vs b: time O(nk log n) Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132
  • 216.
    Sparse string comparison Longestpiecewise monotone subsequences Longest k-increasing subsequence: algorithm A Sparse LCS for id k vs b: time O(nk log n) Longest k-increasing subsequence: algorithm B Compute seaweed matrix for id vs b: time O(n log2 n) Extract three-way seaweed submatrix Compute three-way seaweed submatrix for id k vs b by log k instances of seaweed matrix -square/multiply: time log k · O(n log n) = O(n log2 n) Query the LCS score for id k vs b: time negligible Overall time O(n log2 n) Algorithm B faster than Algorithm A for k ≥ log n Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132
  • 217.
    Sparse string comparison Maximumclique in a circle graph The maximum clique problem in a circle graph Given a circle with n chords, find the maximum-size subset of pairwise intersecting chords S Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132
  • 218.
    Sparse string comparison Maximumclique in a circle graph The maximum clique problem in a circle graph Given a circle with n chords, find the maximum-size subset of pairwise intersecting chords S Standard reduction to an interval model: cut the circle and lay it out on the line; chords become intervals (here drawn as square diagonals) Chords intersect iff intervals overlap, i.e. intersect without containment Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132
  • 219.
    Sparse string comparison Maximumclique in a circle graph Maximum clique in a circle graph: running time exp(n) naive O(n3 ) [Gavril: 1973] O(n2 ) [Rotem, Urrutia: 1981; Hsu: 1985] [Masuda+: 1990; Apostolico+: 1992] O(n1.5 ) [T: 2006] O(n log2 n) [T: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 97 / 132
  • 220.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 221.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 222.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 223.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 224.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 225.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 226.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 227.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 228.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 229.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 230.
    Sparse string comparison Maximumclique in a circle graph Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
  • 231.
    Sparse string comparison Maximumclique in a circle graph Maximum clique in a circle graph: the algorithm Helly property: if any set of intervals intersect pairwise, then they all intersect at a common point Compute seaweed matrix, build range tree: time O(n log2 n) Run through all 2n + 1 possible common intersection points For each point, find a maximum subset of covering overlapping segments by a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n) Overall time O(n log2 n) + O(n log2 n) = O(n log2 n) Alexander Tiskin (Warwick) Semi-local string comparison 99 / 132
  • 232.
    Sparse string comparison Maximumclique in a circle graph Parameterised maximum clique in a circle graph The maximum clique problem in a circle graph, sensitive e.g. to the number e of edges the size l of maximum clique the thickness d of interval model (the maximum number of intervals covering a point) We have l ≤ d ≤ n, e ≤ n2 Parameterised maximum clique in a circle graph: running time O(n log n + e) [Apostolico+: 1992] O(n log n + nl log(n/l)) [Apostolico+: 1992] O(n log n + n log2 d) NEW Alexander Tiskin (Warwick) Semi-local string comparison 100 / 132
  • 233.
    Sparse string comparison Maximumclique in a circle graph Parameterised maximum clique in a circle graph: the algorithm For each diagonal block of size d, compute seaweed matrix, build range tree: time n/d · O(d log2 d) = O(n log2 d) Extend each diagonal block to a quadrant: time O(n log2 d) Run through all 2n + 1 possible common intersection points For each point, find a maximum subset of covering overlapping segments by a prefix-suffix LCS query: time O(n log2 d) Overall time O(n log2 d) Alexander Tiskin (Warwick) Semi-local string comparison 101 / 132
  • 234.
    Alexander Tiskin (Warwick) Semi-local string comparison 102 / 132
  • 235.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 103 / 132
  • 236.
    Compressed string comparison Grammarcompression Notation: pattern p of length m; text t of length n A GC-string (grammar-compressed string) t is a straight-line program (context-free grammar) generating t = tn by n assignments of the form ¯ ¯ tk = α, where α is an alphabet character tk = ti tj , where i, j < k In general, n = O(2n ) ¯ Example: Fibonacci string “ABAABABAABAAB” t1 = ‘B’ t2 = ‘A’ t3 = t2 t1 t4 = t3 t2 t5 = t4 t3 t6 = t5 t4 t7 = t6 t5 Alexander Tiskin (Warwick) Semi-local string comparison 104 / 132
  • 237.
    Compressed string comparison Grammarcompression Grammar-compression covers various compression types, e.g. LZ78, LZW (not LZ77 directly) Simplifying assumption: arithmetic up to n runs in O(1) This assumption can be removed by careful index remapping Alexander Tiskin (Warwick) Semi-local string comparison 105 / 132
  • 238.
    Compressed string comparison Extendedsubstring-string LCS on GC-strings LCS: running time (r = m + n, ¯ = m + n) r ¯ ¯ p t plain plain O(mn) [Wagner, Fischer: 1974] mn O log2 m [Masek, Paterson: 1980] [Crochemore+: 2003] plain GC O(m3 n + . . .) ¯ gen. CFG [Myers: 1995] O(m ¯1.5 n) ext subs-s [T: 2008] O(m log m · n) ¯ ext subs-s [T: 2010] GC GC NP-hard [Lifshits: 2005] O(r 1.2¯1.4 ) r R weights [Hermelin+: 2009] O(r log r · ¯) r [T: 2010] O r log(r /¯) · ¯ r r [Hermelin+: 2010] O r log1/2 (r /¯) · ¯ r r [Gawrychowski: NEW] Alexander Tiskin (Warwick) Semi-local string comparison 106 / 132
  • 239.
    Compressed string comparison Extendedsubstring-string LCS on GC-strings Extended substring-string LCS (plain pattern, GC text): the algorithm For every k, compute by recursion the appropriate part of seaweed matrix Pp,tk , using matrix -multiplication: time O(m log m · n) ¯ Overall time O(m log m · n) ¯ Alexander Tiskin (Warwick) Semi-local string comparison 107 / 132
  • 240.
    Compressed string comparison Subsequencerecognition on GC-strings The global subsequence recognition problem Does pattern p appear in text t as a subsequence? Global subsequence recognition: running time p t plain plain O(n) greedy plain GC O(m¯) n greedy GC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Semi-local string comparison 108 / 132
  • 241.
    Compressed string comparison Subsequencerecognition on GC-strings The local subsequence recognition problem Find all minimally matching substrings of t with respect to p Substring of t is matching, if p is a subsequence of t Matching substring of t is minimally matching, if none of its proper substrings are matching Alexander Tiskin (Warwick) Semi-local string comparison 109 / 132
  • 242.
    Compressed string comparison Subsequencerecognition on GC-strings Local subsequence recognition: running time ( + output) p t plain plain O(mn) [Mannila+: 1995] mn O log m [Das+: 1997] O(c m + n) [Boasson+: 2001] O(m + nσ) [Troniˇek: 2001] c plain GC O(m2 log m¯) n [C´gielski+: 2006] e 1.5 n) O(m ¯ [T: 2008] O(m log m · n) ¯ [T: NEW] GC GC NP-hard [Lifshits: 2005] Alexander Tiskin (Warwick) Semi-local string comparison 110 / 132
  • 243.
    Compressed string comparison Subsequencerecognition on GC-strings 0 ˆ0+ ı ˆ1+ ı ˆ2+ ı n n 0+ ˆ 1+ ˆ  2+ ˆ b i : j matching iff box [i : j] not pierced left-to-right -maximal seaweeds: -chain ˆ0+ , 0+ ı ˆ ˆ1+ , 1+ ı ˆ ··· ˆs − , s − ı ˆ b i : j minimally matching iff (i, j) is in the interleaved -chain ˆ− , + ı1+ ˆ1− ˆ− , + ı2+ ˆ2− ··· ˆ− + , + − ı(s−1) ˆ(s−1) Alexander Tiskin (Warwick) Semi-local string comparison 111 / 132
  • 244.
    Compressed string comparison Subsequencerecognition on GC-strings −m 0 × ˆ0+ ı • • × ˆ1+ ı • × ˆ2+ ı • • n × n 0+ ˆ 1+ ˆ 2+ ˆ n m+n Alexander Tiskin (Warwick) Semi-local string comparison 112 / 132
  • 245.
    Compressed string comparison Subsequencerecognition on GC-strings Local subsequence recognition (plain pattern, GC text): the algorithm For every k, compute by recursion the appropriate part of seaweed matrix Pp,tk , using matrix -multiplication: time O(m log m · n) ¯ Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
  • 246.
    Compressed string comparison Subsequencerecognition on GC-strings Local subsequence recognition (plain pattern, GC text): the algorithm For every k, compute by recursion the appropriate part of seaweed matrix Pp,tk , using matrix -multiplication: time O(m log m · n) ¯ Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in t Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
  • 247.
    Compressed string comparison Subsequencerecognition on GC-strings Local subsequence recognition (plain pattern, GC text): the algorithm For every k, compute by recursion the appropriate part of seaweed matrix Pp,tk , using matrix -multiplication: time O(m log m · n) ¯ Given an assignment t = t t , count by recursion minimally matching substrings in t minimally matching substrings in t Then, find -chain of -maximal seaweeds in time n · O(m) = O(m¯) ¯ n The interleaved -chain defines minimally matching substrings in t overlapping both t and t Overall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
  • 248.
    Compressed string comparison Thresholdapproximate matching The threshold approximate matching problem Find all matching substrings of t with respect to p, according to a threshold k Substring of t is matching, if the edit distance for p vs t is at most k Alexander Tiskin (Warwick) Semi-local string comparison 114 / 132
  • 249.
    Compressed string comparison Thresholdapproximate matching Threshold approximate matching: running time ( + output) p t plain plain O(mn) [Sellers: 1980] O(mk) [Landau, Vishkin: 1989] 4 O(m + n + nk ) m [Cole, Hariharan: 2002] plain GC O(m¯k 2 ) n [K¨rkk¨inen+: 2003] a a O(m¯k + n log n) n ¯ [LV: 1989] via [Bille+: 2010] O(m¯ + nk 4 + n log n) n ¯ ¯ [CH: 2002] via [Bille+: 2010] O(m log m · n) ¯ [T: NEW] GC GC NP-hard [Lifshits: 2005] (Also many specialised variants for LZ compression) Alexander Tiskin (Warwick) Semi-local string comparison 115 / 132
  • 250.
    Compressed string comparison Thresholdapproximate matching Threshold approx matching (plain pattern, GC text): the algorithm Algorithm structure similar to local subsequence recognition by matrix -multiplication and seaweed -chains Extra ingredients: the blow-up technique: reduction of edit distances to LCS scores implicit matrix searching, replaces -chain interleaving Monge row minima: “SMAWK” O(m) [Aggarwal+: 1987] Implicit unit-Monge row minima O(m log log m) [T: NEW] O(m) [Gawrychowski: NEW] replaces -chain interleaving Overall time O(m log m · n) + O(m¯) = O(m log m · n) ¯ n ¯ Alexander Tiskin (Warwick) Semi-local string comparison 116 / 132
  • 251.
    Compressed string comparison Thresholdapproximate matching 0 ˆ0+ ˆ1+ ı ı ˆ2+ ı ˆ3+ ˆ4+ ı ı n ˜ n ˜ 0+ 1+ 2+ 3+ ˆ ˆ ˆ ˆ  4+ ˆ Blow up: weighted alignment on strings p, t of size m, n equivalent to LCS on strings p , ˜ of size m = νm, n = νn ˜ t ˜ ˜ Alexander Tiskin (Warwick) Semi-local string comparison 117 / 132
  • 252.
    Compressed string comparison Thresholdapproximate matching −m ˜ 0 × × × × × × ˆ0+ ı • × × × × × × ˆ1+ ı • × × × × × × ˆ2+ ı • × × × × × × ˆ3+ ı • × × × × × × ˆ4+ ı • n ˜ × × × × × × n ˜ 0+ 1+ 2+ 3+ ˆ ˆ ˆ ˆ 4+ ˆ n ˜ m+˜ ˜ n Alexander Tiskin (Warwick) Semi-local string comparison 118 / 132
  • 253.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 119 / 132
  • 254.
    Beyond semi-locality Window-local LCSand alignment plots Given window length w The window-local LCS problem Give semi-local LCS score for every w -substring of a vs whole b Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132
  • 255.
    Beyond semi-locality Window-local LCSand alignment plots Given window length w The window-local LCS problem Give semi-local LCS score for every w -substring of a vs whole b Window-local LCS: running time O(mn3 w ) naive O(mnw ) multiple seaweed O(mn) [Krusche, T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132
  • 256.
    Beyond semi-locality Window-local LCSand alignment plots Quasi-local LCS: the algorithm Compute seaweed matrices for canonical substrings of a against b Build seaweed matrices for windows of a against b recursively Both stages use matrix -multiplication, overall time O(mn) Alexander Tiskin (Warwick) Semi-local string comparison 121 / 132
  • 257.
    Beyond semi-locality Window-local LCSand alignment plots The window-window LCS problem Give semi-local LCS score for every w -substring of a vs every w -substring b Provides an alignment plot: loss-free local comparison of genome sequences, important for studying conservation in the genome Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132
  • 258.
    Beyond semi-locality Window-local LCSand alignment plots The window-window LCS problem Give semi-local LCS score for every w -substring of a vs every w -substring b Provides an alignment plot: loss-free local comparison of genome sequences, important for studying conservation in the genome Window-window LCS: running time O(mnw 2 ) naive O(mnw ) multiple seaweed O(mn) [Krusche, T: 2010] Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132
  • 259.
    Beyond semi-locality Window-local LCSand alignment plots An implementation of alignment plots [Krusche: 2010] Based on the O(mnw ) multiple seaweed combing, using some ideas from the optimal O(mn) algorithm Resulting theoretical complexity O(mnw 0.5 ) Alignment weights (match 1, mismatch 0, gap −0.5): ×4 slowdown C++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism) SMP parallelism (currently two processors) Alexander Tiskin (Warwick) Semi-local string comparison 123 / 132
  • 260.
    Beyond semi-locality Window-local LCSand alignment plots Krusche’s implementation used at Warwick Systems Biology Centre to study weak conservation in plant genomes (BLAST too inaccurate) Speedup on one processor: ×10 over heavily optimised, bit-parallel naive algorithm ×7 over biologists’ heuristics Speedup on two processors: extra ×2, near-perfect parallelism Alexander Tiskin (Warwick) Semi-local string comparison 124 / 132
  • 261.
    Beyond semi-locality Window-local LCSand alignment plots Publications: [Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) in Plants. Plant Journal. [Baxter+: accepted] Conserved Noncoding Sequences Highlight Shared Components of Regulatory Networks in Dicotyledonous Plants. Plant Cell. Web service: http://wsbc.warwick.ac.uk/ears Future work: genomic repeats; genome-scale comparison Alexander Tiskin (Warwick) Semi-local string comparison 125 / 132
  • 262.
    Beyond semi-locality Quasi-local LCSand sparse spliced alignment Given m (possibly overlapping) prescribed substrings in a The quasi-local LCS problem Give semi-local LCS score for every prescribed substring of a vs whole b Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132
  • 263.
    Beyond semi-locality Quasi-local LCSand sparse spliced alignment Given m (possibly overlapping) prescribed substrings in a The quasi-local LCS problem Give semi-local LCS score for every prescribed substring of a vs whole b Quasi-local LCS: running time O(m2 n) multiple seaweed O(mn log2 m) [T: unpublished] Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132
  • 264.
    Beyond semi-locality Quasi-local LCSand sparse spliced alignment Window-local LCS: the algorithm Compute seaweed matrices for canonical substrings of a against b Build seaweed matrices for prescribed substrings of a against b recursively Both stages use matrix -multiplication, time O(mn log2 m) Alexander Tiskin (Warwick) Semi-local string comparison 127 / 132
  • 265.
    Beyond semi-locality Quasi-local LCSand sparse spliced alignment The sparse spliced alignment problem Give the chain of non-overlapping prescribed substrings in a, closest to b by alignment score Describes gene assembly: unknown gene from candidate exons, given a known similar gene Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132
  • 266.
    Beyond semi-locality Quasi-local LCSand sparse spliced alignment The sparse spliced alignment problem Give the chain of non-overlapping prescribed substrings in a, closest to b by alignment score Describes gene assembly: unknown gene from candidate exons, given a known similar gene Assume m = n; let s = total size of prescribed substrings Sparse spliced alignment: running time O(ns) = O(n3 ) [Gelfand+: 1996] O(n2.5 ) [Kent+: 2006] O(n2 log2 n) [T: unpublished] O(n2 log n) [Sakai: 2009] Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132
  • 267.
    1 Introduction 7 Sparse string comparison 2 Matrix distance multiplication 8 Compressed string comparison 3 Semi-local string comparison 9 Beyond semi-locality 4 The seaweed method 10 Conclusions and future work 5 Periodic string comparison 6 The transposition network method Alexander Tiskin (Warwick) Semi-local string comparison 129 / 132
  • 268.
    Conclusions and futurework Implicit unit-Monge matrices: the seaweed monoid distance multiplication in time O(n log n) next: lower bound? Semi-local LCS problem: representation by implicit unit-Monge matrices generalisation to rational alignment scores next: real alignment scores? Seaweed combing and micro-block speedup: a simple algorithm for semi-local LCS semi-local LCS in time o(mn) improvements on related problems Alexander Tiskin (Warwick) Semi-local string comparison 130 / 132
  • 269.
    Conclusions and futurework Transposition networks: simple interpretation of high-similarity and dissimilarity LCS dynamic LCS simple interpretation of bit-parallel LCS Sparse string comparison: fast LCS on permutation strings fast max-clique in a circle graph next: lower bound? or even faster max-clique Sparse string comparison: semi-local LCS on permutations in time O(n log2 n) maximum clique in a circle graph in time O(n log2 n) improvements on related problems Alexander Tiskin (Warwick) Semi-local string comparison 131 / 132
  • 270.
    Conclusions and futurework Compressed string comparison: three-way semi-local LCS on GC text against plain pattern subsequence recognition in GC-strings next: approximate matching in GC-strings Beyond semi-locality: quasi-local string comparison sparse spliced alignment next: full locality Alexander Tiskin (Warwick) Semi-local string comparison 132 / 132