SlideShare a Scribd company logo
1 of 51
Download to read offline
Sketching and locality sensitive hashing for alignment
Guillaume Marçais
2/15/23
Why do we need sketching and Locality Sensitive Hashing for
alignment?
1
Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
2
Large scale alignment problems
Cluster N samples based on
sequence similarity
• → N2/2 alignment problems
• Speed-up pairwise alignment task?
• Skip hopeless alignments?
Sequence search in large database
• Avoid aligning to all sequences in database?
• Approximate nearer neighbor search
• High dimension, non-geometric space
?
2
Fast growth of sequence databases
1x1010
1x1011
1x1012
1x1013
1x1014
1x1015
1x1016
1x1017
2006
2008
2010
2012
2014
2016
2018
2020
2022
Number
of
bases
Year
SRA open accessible bases
• Exponential growth in public and private databases (SRA: 1.5×/year)
• =⇒ hidden exponential slow down in large scale analysis
3
Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ > 0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ  0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
Sequence alignment is hard
No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015)
Computing the edit distance Ed in time O(n2−δ), δ  0 violates the Strong Exponential
Time Hypothesis (SETH).
• Usual dynamic programming: O(n2)
• 1Masek and Paterson: O

n2
log(n)

• n2−δ ≪ n2
log(n) ≪ n2
• Can’t fundamentally improve
1
A faster algorithm computing string edit distances (1980)
4
Seed and extend paradigm
Main paradigm:
• Find seeds (small exact matches)
• Cluster “coherent” seeds
• Extend between seeds using DP
• Used since the 90s’ (Blast, MUMmer)
• Still computationally intensive for large scale
• Many ways to find seeds:
• k-mers
• Suffix trees/arrays, FM Index
• LSH / sketching
Reference
Query
Seed
Extend
5
Sketching / Locality Sensitive Hashing
Avoid computing edit distance directly, use proxy measures easier to compute
• LSH: hashing method to avoid fruitless comparisons
• Sketching: sparse representation allowing quick comparison
6
Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
Locality Sensitive Hashing: Make collisions matters
U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1].
H = {h : U → [0, |T| − 1]}
0
1
2
3
4
Universal Hashing
• Collisions as rare as possible
• ∀x, y ∈ U, x ̸= y,
Pr
h∈H
[h(x) = h(y)] =
1
|T|
Locality Sensitive Hashing
• Collision between similar elements
• ∀x, y ∈ U
Ed(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
Ed(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
7
Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1  d2, p1  p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Low distance ⇐⇒ High collisions
• High distance ⇐⇒ Low collisions
• In between d1, d2: No guarantee
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1  d2, p1  p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• Probability over choice of h ∈ H,
not over the elements x, y
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
Locality Sensitive Hashing Definition
The family H is “(d1, d2, p1, p2)-sensitive” for
distance D if there exists d1  d2, p1  p2 such
that for all x, y ∈ U
D(x, y) ≤ d1 =⇒ Pr
h∈H
[h(x) = h(y)] ≥ p1
D(x, y) ≥ d2 =⇒ Pr
h∈H
[h(x) = h(y)] ≤ p2
• d1  d2: “gapped” LSH
• d1 = d2, “ungapped” LSH
• Gap not desirable but not always
avoidable.
Locality sensitive hash family
Family H of hash functions where
similar elements are more likely to
have the same value than distant
elements.
8
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
Overlap computation
• Compute overlaps between reads (MHAP2)
• Instance of “Nearest Neighbor Problem” for
edit distance
• Use multiple hash tables
• Orange ellipse in same location as yellow circle
Reads
Overlap?
Hash Tables
2
Assembling large genomes with single-molecule sequencing and locality-sensitive hashing
9
LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10
LSH for the edit distance
How to design an LSH for edit distance?
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
10
Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
11
Jaccard distance
Jaccard distance between sets A, B:
Jd(A, B) = 1 −
|A ∩ B|
|A ∪ B|
Jaccard between sequences x, y:
Jaccard distance of their k-mer sets
Jd(x, y) = Jd(K(x), K(y))
• Low Ed(x, y) =⇒ Low Jd(x, y)
• High Ed(x, y) ̸
=⇒ High Jd(x, y)
• Can have false positive, few false
negative
11
MinHash: an LSH for the Jaccard distance
• Permutation of k-mers: π : 4k → 4k one-to-one
H = {hπ(S) = argmin
m∈K(S)
π(m) | π permutation of k-mers}
• Fix π, every k-mer of A ∪ B equally likely to be the
minimum for π
Pr
h∈H
[h(A) = h(B)] =
|A ∩ B|
|A ∪ B|
• Unbiased estimator, ungapped LSH
12
minHash sketch: dimensionality reduction
• Choose L hash functions from H: hi, 1 ≤ i ≤ L
• Sketch of S: vector Sk(S) = (hi(S))1≤i≤L
• Big compression: Mash3 L = 1000, k = 21, 7000 × compression
• Very fast pairwise comparison (Hamming distance between sketches)
Sk(A) =











CGAG
TTAC
CATC
CCAT
CATG
ACAA











, Sk(B) =











GTTT
TTAC
GTAG
ATTT
ACCC
ACAA











→ Jd(K(A), K(B)) ≈ 1 −
2
6
3
Mash: fast genome and metagenome distance estimation using MinHash
13
OMH: LSH for the edit distance
• minHash: LSH for k-mer Jaccard distance
• OMH: Ordered Min Hash
14
Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
15
Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
15
Jaccard ignores k-mer repetition
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→ {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC}
Jaccard distance Jd(x, y) = 0 Edit distance Ed(x, y) ≥ 1 − 2k
n
Identical k-mer content and high edit distance
15
Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
16
Weighted Jaccard: Jaccard on multi-set
• χA : U → {0, 1},
χA(x) = 1 ⇐⇒ x ∈ A
• χw
A : U → N,
χw
A(x) = # of instances of x in A
J(A, B) =
|A ∩ B|
|A ∪ B|
=
P
x∈U min(χA(x), χB(x))
P
x∈U max(χA(x), χB(x))
Jw
(A, B) =
P
x∈U min(χw
A(x), χw
B(x))
P
x∈U max(χw
A(x), χw
B(x))
16
Weighted Jaccard handles repetitions
x =
n−k
z }| {
AAAAAAAAAAAAAAA
k
z }| {
CCCCC →
n
(AAAAA,1),(AAAAA,2),...,(AAAAA,11)
(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1)
o
y = AAAAA
| {z }
k
CCCCCCCCCCCCCCC
| {z }
n−k
→
n
(AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),
(CCCCC,1),(CCCCC,2),...,(CCCCC,11)
o
Weighted Jaccard Jw
d (x, y) = 1 − k+2
n Edit distance Ed(x, y) ≥ 1 − 2k
n
Weighted Jaccard = Jaccard for multi-sets
17
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC
y = AAAACACAACCCCACCAAA
18
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
18
Jaccard and weighted Jaccard ignore relative order
x = CCCCACCAACACAAAACCC →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
y = AAAACACAACCCCACCAAA →
n
AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC
CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC
o
x, y: de Bruijn sequences,
contain all 16 possible 4-mers once
(σ!)σk−1
de Bruijn sequences of length σk + σ − 1
Jd(x, y) = Jw
d (x, y) = 0 Ed(x, y) = 0.63
18
Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19
Jaccard is different from edit distance
Unlike edit distance, k-mer Jaccard is insensitive to:
1. k-mer repetitions
2. relative positions of k-mers
• k-mer Jaccard is not an LSH for the edit distance
• Still provides big computation saving: asymmetric error model
19
OMH: Order Min Hash
• minHash is an LSH for Jaccard
• OMH is a refinement of minHash
• OMH is sensitive to
• repeated k-mers
• relative order of k-mers
20
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
GC
CG
GG
AA
AC
AT
CA
CT
GT
TA
TC
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
AG
GT
TT
TG
GA
GC
CG
GG
AA
AG
GT
TT
TG
GA
CG
GG
AA
2
AG
GT
TT
TG
GA
GC
CG
GG
AA
3
1 AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
2
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG,11
AG, 0
GT, 1
TG,14
GA, 9
GG,12
3
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
4
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
5
AG,11
AG, 5
GT,13
TT, 2
TG, 3
GA, 4
GC, 6
CG, 7
GG, 8
AA,10
AG, 0
GT, 1
TG,14
GA, 9
GG,12
6
1
GA AG GG
GC AG TG
21
minHash  OMH sketches
S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2
Jaccard:
Sk(S) =




GC
TG
GT




OMH:
Sk(S) =




GC CA
AG GG
AG TG




21
OMH is a LSH for edit distance
Theorem: OMH is a LSH for edit distance
There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance.
• p1: related to probability of hash collisions of weighted Jaccard
• p2: related to length of increasing sequence given weighted Jaccard
22
Practical considerations with Jaccard sketches
Jaccard:
• Can use canonical k-mers
• Difficult to find independent hashes:
use bottom sketches (L ≪ n)
OMH:
• ℓ times as large (cost to encode order)
• ℓ = 1: LSH / unbiased estimator of
weighted Jaccard
• Can’t use canonical k-mers: double
sketch
23
OMH has a large gap
• |S| = 100, k = 5
• Current proof has a large gap
• What is smallest gap
possible?
• OMH/minHash similar to
embedding in Hamming
space: gap probably
unavoidable
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
` = 2
` = 5
` = 10
` = 2
` = 5
` = 10
probability
p
similarity s
p1
p2
24

More Related Content

Similar to Sketching and locality sensitive hashing for alignment

2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdfAdvanced-Concepts-Team
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductorswtyru1989
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Meanstthonet
 
Approximate Nearest Neighbour in Higher Dimensions
Approximate Nearest Neighbour in Higher DimensionsApproximate Nearest Neighbour in Higher Dimensions
Approximate Nearest Neighbour in Higher DimensionsShrey Verma
 
Searching Informed Search.pdf
Searching Informed Search.pdfSearching Informed Search.pdf
Searching Informed Search.pdfDrBashirMSaad
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2MuradAmn
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingMaruf Aytekin
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential David Gleich
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphsDavid Gleich
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...Adam Fausett
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...David Gleich
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksDavid Gleich
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 

Similar to Sketching and locality sensitive hashing for alignment (20)

Extended Isolation Forest
Extended Isolation Forest Extended Isolation Forest
Extended Isolation Forest
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
 
Randomness conductors
Randomness conductorsRandomness conductors
Randomness conductors
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
Approximate Nearest Neighbour in Higher Dimensions
Approximate Nearest Neighbour in Higher DimensionsApproximate Nearest Neighbour in Higher Dimensions
Approximate Nearest Neighbour in Higher Dimensions
 
Searching Informed Search.pdf
Searching Informed Search.pdfSearching Informed Search.pdf
Searching Informed Search.pdf
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2
 
Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via HashingSimilarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
 
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
Deep Learning Opening Workshop - ProxSARAH Algorithms for Stochastic Composit...
 
Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential Fast relaxation methods for the matrix exponential
Fast relaxation methods for the matrix exponential
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Lec5
Lec5Lec5
Lec5
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Hashing
HashingHashing
Hashing
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Sketching and locality sensitive hashing for alignment

  • 1. Sketching and locality sensitive hashing for alignment Guillaume Marçais 2/15/23
  • 2. Why do we need sketching and Locality Sensitive Hashing for alignment? 1
  • 3. Large scale alignment problems Cluster N samples based on sequence similarity • → N2/2 alignment problems • Speed-up pairwise alignment task? • Skip hopeless alignments? Sequence search in large database • Avoid aligning to all sequences in database? • Approximate nearer neighbor search • High dimension, non-geometric space 2
  • 4. Large scale alignment problems Cluster N samples based on sequence similarity • → N2/2 alignment problems • Speed-up pairwise alignment task? • Skip hopeless alignments? Sequence search in large database • Avoid aligning to all sequences in database? • Approximate nearer neighbor search • High dimension, non-geometric space ? 2
  • 5. Fast growth of sequence databases 1x1010 1x1011 1x1012 1x1013 1x1014 1x1015 1x1016 1x1017 2006 2008 2010 2012 2014 2016 2018 2020 2022 Number of bases Year SRA open accessible bases • Exponential growth in public and private databases (SRA: 1.5×/year) • =⇒ hidden exponential slow down in large scale analysis 3
  • 6. Sequence alignment is hard No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015) Computing the edit distance Ed in time O(n2−δ), δ > 0 violates the Strong Exponential Time Hypothesis (SETH). • Usual dynamic programming: O(n2) • 1Masek and Paterson: O n2 log(n) • n2−δ ≪ n2 log(n) ≪ n2 • Can’t fundamentally improve 1 A faster algorithm computing string edit distances (1980) 4
  • 7. Sequence alignment is hard No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015) Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential Time Hypothesis (SETH). • Usual dynamic programming: O(n2) • 1Masek and Paterson: O n2 log(n) • n2−δ ≪ n2 log(n) ≪ n2 • Can’t fundamentally improve 1 A faster algorithm computing string edit distances (1980) 4
  • 8. Sequence alignment is hard No strongly subquadratic time algorithm, most likely (Backurs, Indyk 2015) Computing the edit distance Ed in time O(n2−δ), δ 0 violates the Strong Exponential Time Hypothesis (SETH). • Usual dynamic programming: O(n2) • 1Masek and Paterson: O n2 log(n) • n2−δ ≪ n2 log(n) ≪ n2 • Can’t fundamentally improve 1 A faster algorithm computing string edit distances (1980) 4
  • 9. Seed and extend paradigm Main paradigm: • Find seeds (small exact matches) • Cluster “coherent” seeds • Extend between seeds using DP • Used since the 90s’ (Blast, MUMmer) • Still computationally intensive for large scale • Many ways to find seeds: • k-mers • Suffix trees/arrays, FM Index • LSH / sketching Reference Query Seed Extend 5
  • 10. Sketching / Locality Sensitive Hashing Avoid computing edit distance directly, use proxy measures easier to compute • LSH: hashing method to avoid fruitless comparisons • Sketching: sparse representation allowing quick comparison 6
  • 11. Locality Sensitive Hashing: Make collisions matters U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1]. H = {h : U → [0, |T| − 1]} 0 1 2 3 4 Universal Hashing • Collisions as rare as possible • ∀x, y ∈ U, x ̸= y, Pr h∈H [h(x) = h(y)] = 1 |T| Locality Sensitive Hashing • Collision between similar elements • ∀x, y ∈ U Ed(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 Ed(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 7
  • 12. Locality Sensitive Hashing: Make collisions matters U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1]. H = {h : U → [0, |T| − 1]} 0 1 2 3 4 Universal Hashing • Collisions as rare as possible • ∀x, y ∈ U, x ̸= y, Pr h∈H [h(x) = h(y)] = 1 |T| Locality Sensitive Hashing • Collision between similar elements • ∀x, y ∈ U Ed(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 Ed(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 7
  • 13. Locality Sensitive Hashing: Make collisions matters U: universe. T: hash table. |T| ≪ |U|. h : U → [0, |T| − 1]. H = {h : U → [0, |T| − 1]} 0 1 2 3 4 Universal Hashing • Collisions as rare as possible • ∀x, y ∈ U, x ̸= y, Pr h∈H [h(x) = h(y)] = 1 |T| Locality Sensitive Hashing • Collision between similar elements • ∀x, y ∈ U Ed(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 Ed(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 7
  • 14. Locality Sensitive Hashing Definition The family H is “(d1, d2, p1, p2)-sensitive” for distance D if there exists d1 d2, p1 p2 such that for all x, y ∈ U D(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 D(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 • Low distance ⇐⇒ High collisions • High distance ⇐⇒ Low collisions • In between d1, d2: No guarantee Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 8
  • 15. Locality Sensitive Hashing Definition The family H is “(d1, d2, p1, p2)-sensitive” for distance D if there exists d1 d2, p1 p2 such that for all x, y ∈ U D(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 D(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 • Probability over choice of h ∈ H, not over the elements x, y Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 8
  • 16. Locality Sensitive Hashing Definition The family H is “(d1, d2, p1, p2)-sensitive” for distance D if there exists d1 d2, p1 p2 such that for all x, y ∈ U D(x, y) ≤ d1 =⇒ Pr h∈H [h(x) = h(y)] ≥ p1 D(x, y) ≥ d2 =⇒ Pr h∈H [h(x) = h(y)] ≤ p2 • d1 d2: “gapped” LSH • d1 = d2, “ungapped” LSH • Gap not desirable but not always avoidable. Locality sensitive hash family Family H of hash functions where similar elements are more likely to have the same value than distant elements. 8
  • 17. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 18. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 19. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? Hash Tables 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 20. Overlap computation • Compute overlaps between reads (MHAP2) • Instance of “Nearest Neighbor Problem” for edit distance • Use multiple hash tables • Orange ellipse in same location as yellow circle Reads Overlap? Hash Tables 2 Assembling large genomes with single-molecule sequencing and locality-sensitive hashing 9
  • 21. LSH for the edit distance How to design an LSH for edit distance? • minHash: LSH for k-mer Jaccard distance • OMH: Ordered Min Hash 10
  • 22. LSH for the edit distance How to design an LSH for edit distance? • minHash: LSH for k-mer Jaccard distance • OMH: Ordered Min Hash 10
  • 23. Jaccard distance Jaccard distance between sets A, B: Jd(A, B) = 1 − |A ∩ B| |A ∪ B| 11
  • 24. Jaccard distance Jaccard distance between sets A, B: Jd(A, B) = 1 − |A ∩ B| |A ∪ B| Jaccard between sequences x, y: Jaccard distance of their k-mer sets Jd(x, y) = Jd(K(x), K(y)) • Low Ed(x, y) =⇒ Low Jd(x, y) • High Ed(x, y) ̸ =⇒ High Jd(x, y) • Can have false positive, few false negative 11
  • 25. MinHash: an LSH for the Jaccard distance • Permutation of k-mers: π : 4k → 4k one-to-one H = {hπ(S) = argmin m∈K(S) π(m) | π permutation of k-mers} • Fix π, every k-mer of A ∪ B equally likely to be the minimum for π Pr h∈H [h(A) = h(B)] = |A ∩ B| |A ∪ B| • Unbiased estimator, ungapped LSH 12
  • 26. minHash sketch: dimensionality reduction • Choose L hash functions from H: hi, 1 ≤ i ≤ L • Sketch of S: vector Sk(S) = (hi(S))1≤i≤L • Big compression: Mash3 L = 1000, k = 21, 7000 × compression • Very fast pairwise comparison (Hamming distance between sketches) Sk(A) =            CGAG TTAC CATC CCAT CATG ACAA            , Sk(B) =            GTTT TTAC GTAG ATTT ACCC ACAA            → Jd(K(A), K(B)) ≈ 1 − 2 6 3 Mash: fast genome and metagenome distance estimation using MinHash 13
  • 27. OMH: LSH for the edit distance • minHash: LSH for k-mer Jaccard distance • OMH: Ordered Min Hash 14
  • 28. Jaccard ignores k-mer repetition x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k 15
  • 29. Jaccard ignores k-mer repetition x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} 15
  • 30. Jaccard ignores k-mer repetition x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k → {AAAAA, AAAAC, AAACC, AACCC, ACCCC, CCCCC} Jaccard distance Jd(x, y) = 0 Edit distance Ed(x, y) ≥ 1 − 2k n Identical k-mer content and high edit distance 15
  • 31. Weighted Jaccard: Jaccard on multi-set • χA : U → {0, 1}, χA(x) = 1 ⇐⇒ x ∈ A • χw A : U → N, χw A(x) = # of instances of x in A J(A, B) = |A ∩ B| |A ∪ B| = P x∈U min(χA(x), χB(x)) P x∈U max(χA(x), χB(x)) 16
  • 32. Weighted Jaccard: Jaccard on multi-set • χA : U → {0, 1}, χA(x) = 1 ⇐⇒ x ∈ A • χw A : U → N, χw A(x) = # of instances of x in A J(A, B) = |A ∩ B| |A ∪ B| = P x∈U min(χA(x), χB(x)) P x∈U max(χA(x), χB(x)) Jw (A, B) = P x∈U min(χw A(x), χw B(x)) P x∈U max(χw A(x), χw B(x)) 16
  • 33. Weighted Jaccard handles repetitions x = n−k z }| { AAAAAAAAAAAAAAA k z }| { CCCCC → n (AAAAA,1),(AAAAA,2),...,(AAAAA,11) (AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1),(CCCCC,1) o y = AAAAA | {z } k CCCCCCCCCCCCCCC | {z } n−k → n (AAAAA,1),(AAAAC,1),(AAACC,1),(AACCC,1),(ACCCC,1), (CCCCC,1),(CCCCC,2),...,(CCCCC,11) o Weighted Jaccard Jw d (x, y) = 1 − k+2 n Edit distance Ed(x, y) ≥ 1 − 2k n Weighted Jaccard = Jaccard for multi-sets 17
  • 34. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC y = AAAACACAACCCCACCAAA 18
  • 35. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o y = AAAACACAACCCCACCAAA → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o x, y: de Bruijn sequences, contain all 16 possible 4-mers once (σ!)σk−1 de Bruijn sequences of length σk + σ − 1 18
  • 36. Jaccard and weighted Jaccard ignore relative order x = CCCCACCAACACAAAACCC → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o y = AAAACACAACCCCACCAAA → n AAAA,AAAC,AACA,AACC,ACAA,ACAC,ACCA,ACCC CAAA,CAAC,CACA,CACC,CCAA,CCAC,CCCA,CCCC o x, y: de Bruijn sequences, contain all 16 possible 4-mers once (σ!)σk−1 de Bruijn sequences of length σk + σ − 1 Jd(x, y) = Jw d (x, y) = 0 Ed(x, y) = 0.63 18
  • 37. Jaccard is different from edit distance Unlike edit distance, k-mer Jaccard is insensitive to: 1. k-mer repetitions 2. relative positions of k-mers • k-mer Jaccard is not an LSH for the edit distance • Still provides big computation saving: asymmetric error model 19
  • 38. Jaccard is different from edit distance Unlike edit distance, k-mer Jaccard is insensitive to: 1. k-mer repetitions 2. relative positions of k-mers • k-mer Jaccard is not an LSH for the edit distance • Still provides big computation saving: asymmetric error model 19
  • 39. OMH: Order Min Hash • minHash is an LSH for Jaccard • OMH is a refinement of minHash • OMH is sensitive to • repeated k-mers • relative order of k-mers 20
  • 40. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2 21
  • 41. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA GC CG GG AA AC AT CA CT GT TA TC 21
  • 42. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 21
  • 43. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 21
  • 44. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 21
  • 45. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 2 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 3 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 4 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 5 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 6 1 21
  • 46. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 2 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 3 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 4 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 5 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 6 1 21
  • 47. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 AG GT TT TG GA GC CG GG AA AG GT TT TG GA CG GG AA 2 AG GT TT TG GA GC CG GG AA 3 1 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 2 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG,11 AG, 0 GT, 1 TG,14 GA, 9 GG,12 3 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 4 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 5 AG,11 AG, 5 GT,13 TT, 2 TG, 3 GA, 4 GC, 6 CG, 7 GG, 8 AA,10 AG, 0 GT, 1 TG,14 GA, 9 GG,12 6 1 GA AG GG GC AG TG 21
  • 48. minHash OMH sketches S = AGTTGAGCGGAAGGTG, k = 2, L = 3, ℓ = 2 Jaccard: Sk(S) =     GC TG GT     OMH: Sk(S) =     GC CA AG GG AG TG     21
  • 49. OMH is a LSH for edit distance Theorem: OMH is a LSH for edit distance There exists (d1, d2, p1, p2) such that OMH is sensitive for the edit distance. • p1: related to probability of hash collisions of weighted Jaccard • p2: related to length of increasing sequence given weighted Jaccard 22
  • 50. Practical considerations with Jaccard sketches Jaccard: • Can use canonical k-mers • Difficult to find independent hashes: use bottom sketches (L ≪ n) OMH: • ℓ times as large (cost to encode order) • ℓ = 1: LSH / unbiased estimator of weighted Jaccard • Can’t use canonical k-mers: double sketch 23
  • 51. OMH has a large gap • |S| = 100, k = 5 • Current proof has a large gap • What is smallest gap possible? • OMH/minHash similar to embedding in Hamming space: gap probably unavoidable 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ` = 2 ` = 5 ` = 10 ` = 2 ` = 5 ` = 10 probability p similarity s p1 p2 24