Sketching and locality sensitive hashing for alignment

Sketching and locality sensitive hashing for alignment
Guillaume Marçais
2/15/23

Why do we need sketching and Locality Sensitive Hashing for
alignment?
1

Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
2

Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
?
2

Fast growth of sequence databases
1x1010
1x1011
1x1012
1x1013
1x1014
1x1015
1x1016
1x1017
2006
2008
2010
2012
2014
2016
2018
2020
2022
Number
of
bases
Year
SRA open accessible bases
• Exponential growth in public and private databases (SRA: 1.5×/year)
• =⇒ hidden exponential slow down in large scale analysis
3

Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ > 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4

Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4

Seed and extend paradigm
Main paradigm:
• Find seeds (small exact matches)
• Cluster “coherent” seeds
• Extend between seeds using DP
• Used since the 90s’ (Blast, MUMmer)
• Still computationally intensive for large scale
• Many ways to find seeds:
• k-mers
• Suffix trees/arrays, FM Index
• LSH / sketching
Reference
Query
Seed
Extend
5

Sketching / Locality Sensitive Hashing
Avoid computing edit distance directly, use proxy measures easier to compute
• LSH: hashing method to avoid fruitless comparisons
• Sketching: sparse representation allowing quick comparison
6

Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7

Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1 d2, p1 p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Low distance ⇐⇒ High collisions
• High distance ⇐⇒ Low collisions
• In between d1, d2: No guarantee
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8

D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Probability over choice of h ∈ H,
not over the elements x, y
elements.
8

D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• d1 d2: “gapped” LSH
• d1 = d2, “ungapped” LSH
• Gap not desirable but not always
avoidable.
elements.
8

Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9

Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9

LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10

Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
11

Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
Jaccard between sequences x, y:
Jaccard distance of their k-mer sets
Jd(x, y) = Jd(K(x), K(y))
• Low Ed(x, y) =⇒ Low Jd(x, y)
• High Ed(x, y) ̸
=⇒ High Jd(x, y)
• Can have false positive, few false
negative
11

MinHash: an LSH for the Jaccard distance
• Permutation of k-mers: π : 4k → 4k one-to-one
H = {hπ(S) = argmin
m∈K(S)
π(m) | π permutation of k-mers}
• Fix π, every k-mer of A ∪ B equally likely to be the
minimum for π
Pr
h∈H
[h(A) = h(B)] =
|A ∩ B|
|A ∪ B|
• Unbiased estimator, ungapped LSH
12

minHash sketch: dimensionality reduction
• Choose L hash functions from H: hi, 1 ≤ i ≤ L
• Sketch of S: vector Sk(S) = (hi(S))1≤i≤L
• Big compression: Mash3 L = 1000, k = 21, 7000 × compression
• Very fast pairwise comparison (Hamming distance between sketches)
Sk(A) =











CGAG
TTAC
CATC
CCAT
CATG
ACAA











, Sk(B) =











GTTT
TTAC
GTAG
ATTT
ACCC
ACAA











→ Jd(K(A), K(B)) ≈ 1 −
2
6
3
Mash: fast genome and metagenome distance estimation using MinHash
13

OMH: LSH for the edit distance
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
14

Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
15

x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
15

x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
Jaccard distance Jd(x, y) = 0 Edit distance Ed(x, y) ≥ 1 − 2k
n
Identical k-mer content and high edit distance
15

Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
16

Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
Jw
(A, B) =
P
x∈U min(χw
A(x), χw
B(x))
P
x∈U max(χw
A(x), χw
B(x))
16

Weighted Jaccard handles repetitions
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC →
n
(AAAAA,1),(AAAAA,2),...,(AAAAA,11)
(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1)
o
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→
n
(AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),
(CCCCC,1),(CCCCC,2),...,(CCCCC,11)
o
Weighted Jaccard Jw
d (x, y) = 1 − k+2
n Edit distance Ed(x, y) ≥ 1 − 2k
n
Weighted Jaccard = Jaccard for multi-sets
17

Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC
y = AAAACACAACCCCACCAAA
18

x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
18

x = CCCCACCAACACAAAACCC →
n
o
y = AAAACACAACCCCACCAAA →
n
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
Jd(x, y) = Jw
d (x, y) = 0 Ed(x, y) = 0.63
18

Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19

OMH: Order Min Hash
• minHash is an LSH for Jaccard
• OMH is a refinement of minHash
• OMH is sensitive to
• repeated k-mers
• relative order of k-mers
20

minHash OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
21

S = AGTTGAGCGGAAGGTG, k = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
GC
CG
GG
AA
AC
AT
CA
CT
GT
TA
TC
21

S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21

S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
21

S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
21

AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
GA AG GG
GC AG TG
21

Jaccard:
Sk(S) =




GC
TG
GT




OMH:
Sk(S) =




GC CA
AG GG
AG TG




21

OMH is a LSH for edit distance
Theorem: OMH is a LSH for edit distance
There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance.
• p1: related to probability of hash collisions of weighted Jaccard
• p2: related to length of increasing sequence given weighted Jaccard
22

Practical considerations with Jaccard sketches
Jaccard:
• Can use canonical k-mers
• Difficult to find independent hashes:
use bottom sketches (L ≪ n)
OMH:
• ℓ times as large (cost to encode order)
• ℓ = 1: LSH / unbiased estimator of
weighted Jaccard
• Can’t use canonical k-mers: double
sketch
23

OMH has a large gap
• |S| = 100, k = 5
• Current proof has a large gap
• What is smallest gap
possible?
• OMH/minHash similar to
embedding in Hamming
space: gap probably
unavoidable
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
` = 2
` = 5
` = 10
` = 2
` = 5
` = 10
probability
p
similarity s
p1
p2
24

Sketching and locality sensitive hashing for alignment

Recommended

Recommended

More Related Content

Similar to Sketching and locality sensitive hashing for alignment

Similar to Sketching and locality sensitive hashing for alignment (20)

Recently uploaded

Recently uploaded (20)

Sketching and locality sensitive hashing for alignment