2. Why do we need sketching and Locality Sensitive Hashing for
alignment?
1
3. Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
2
4. Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
?
2
5. Fast growth of sequence databases
1x1010
1x1011
1x1012
1x1013
1x1014
1x1015
1x1016
1x1017
2006
2008
2010
2012
2014
2016
2018
2020
2022
Number
of
bases
Year
SRA open accessible bases
• Exponential growth in public and private databases (SRA: 1.5×/year)
• =⇒ hidden exponential slow down in large scale analysis
3
6. Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ > 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O
n2
log(n)
• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
7. Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O
n2
log(n)
• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
8. Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O
n2
log(n)
• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
9. Seed and extend paradigm
Main paradigm:
• Find seeds (small exact matches)
• Cluster “coherent” seeds
• Extend between seeds using DP
• Used since the 90s’ (Blast, MUMmer)
• Still computationally intensive for large scale
• Many ways to find seeds:
• k-mers
• Suffix trees/arrays, FM Index
• LSH / sketching
Reference
Query
Seed
Extend
5
11. Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
12. Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
13. Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
14. Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1 d2, p1 p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Low distance ⇐⇒ High collisions
• High distance ⇐⇒ Low collisions
• In between d1, d2: No guarantee
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
15. Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1 d2, p1 p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Probability over choice of h ∈ H,
not over the elements x, y
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
16. Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1 d2, p1 p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• d1 d2: “gapped” LSH
• d1 = d2, “ungapped” LSH
• Gap not desirable but not always
avoidable.
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
17. Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
18. Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
19. Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
20. Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
21. LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10
22. LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10
24. Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
Jaccard between sequences x, y:
Jaccard distance of their k-mer sets
Jd(x, y) = Jd(K(x), K(y))
• Low Ed(x, y) =⇒ Low Jd(x, y)
• High Ed(x, y) ̸
=⇒ High Jd(x, y)
• Can have false positive, few false
negative
11
25. MinHash: an LSH for the Jaccard distance
• Permutation of k-mers: π : 4k → 4k one-to-one
H = {hπ(S) = argmin
m∈K(S)
π(m) | π permutation of k-mers}
• Fix π, every k-mer of A ∪ B equally likely to be the
minimum for π
Pr
h∈H
[h(A) = h(B)] =
|A ∩ B|
|A ∪ B|
• Unbiased estimator, ungapped LSH
12
26. minHash sketch: dimensionality reduction
• Choose L hash functions from H: hi, 1 ≤ i ≤ L
• Sketch of S: vector Sk(S) = (hi(S))1≤i≤L
• Big compression: Mash3 L = 1000, k = 21, 7000 × compression
• Very fast pairwise comparison (Hamming distance between sketches)
Sk(A) =
CGAG
TTAC
CATC
CCAT
CATG
ACAA
, Sk(B) =
GTTT
TTAC
GTAG
ATTT
ACCC
ACAA
→ Jd(K(A), K(B)) ≈ 1 −
2
6
3
Mash: fast genome and metagenome distance estimation using MinHash
13
27. OMH: LSH for the edit distance
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
14
28. Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
15
29. Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
15
30. Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
Jaccard distance Jd(x, y) = 0 Edit distance Ed(x, y) ≥ 1 − 2k
n
Identical k-mer content and high edit distance
15
31. Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
16
32. Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
Jw
(A, B) =
P
x∈U min(χw
A(x), χw
B(x))
P
x∈U max(χw
A(x), χw
B(x))
16
33. Weighted Jaccard handles repetitions
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC →
n
(AAAAA,1),(AAAAA,2),...,(AAAAA,11)
(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1)
o
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→
n
(AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),
(CCCCC,1),(CCCCC,2),...,(CCCCC,11)
o
Weighted Jaccard Jw
d (x, y) = 1 − k+2
n Edit distance Ed(x, y) ≥ 1 − 2k
n
Weighted Jaccard = Jaccard for multi-sets
17
34. Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC
y = AAAACACAACCCCACCAAA
18
35. Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
18
36. Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
Jd(x, y) = Jw
d (x, y) = 0 Ed(x, y) = 0.63
18
37. Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19
38. Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19
39. OMH: Order Min Hash
• minHash is an LSH for Jaccard
• OMH is a refinement of minHash
• OMH is sensitive to
• repeated k-mers
• relative order of k-mers
20
40. minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
21
41. minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
GC
CG
GG
AA
AC
AT
CA
CT
GT
TA
TC
21
42. minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21
43. minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21
44. minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
21
48. minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
Jaccard:
Sk(S) =
GC
TG
GT
OMH:
Sk(S) =
GC CA
AG GG
AG TG
21
49. OMH is a LSH for edit distance
Theorem: OMH is a LSH for edit distance
There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance.
• p1: related to probability of hash collisions of weighted Jaccard
• p2: related to length of increasing sequence given weighted Jaccard
22
50. Practical considerations with Jaccard sketches
Jaccard:
• Can use canonical k-mers
• Difficult to find independent hashes:
use bottom sketches (L ≪ n)
OMH:
• ℓ times as large (cost to encode order)
• ℓ = 1: LSH / unbiased estimator of
weighted Jaccard
• Can’t use canonical k-mers: double
sketch
23
51. OMH has a large gap
• |S| = 100, k = 5
• Current proof has a large gap
• What is smallest gap
possible?
• OMH/minHash similar to
embedding in Hamming
space: gap probably
unavoidable
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
` = 2
` = 5
` = 10
` = 2
` = 5
` = 10
probability
p
similarity s
p1
p2
24