LSH

Locality Sensitive Hashing
with Application to Near Neighbor Reporting
Hsiao-Fei Liu
2015.3.4

Motivation
• Real word applications
– Recommendation system
• Searching for similar items and users
– Malicious website detection
• Searching for websites similar to some know malicious websites
• The underlying core problem
Given:
• A large set P of high-dimensional data points in a metric space M
• A large set Q of high-dimensional query points in a metric space M
Goal:
• Find near neighbors in P for each query point in Q
• Avoid linearly scanning P for each query

Related Work
• Nearest Neighbor Searching
– Given: a set P of n points in a metric space M
– Goal: for any query q return a point p ∈ P minimizing dist(p,q)
• Classic Result
– Point location in arrangements of hyperplanes
• Meiser, IC’93
• In a d-dimensional Euclidean space under some Lp norm
• dO(1) logn query time and nO(d) space

Related Work
• Approximate Nearest Neighbor
– Given: a set P of n points in a metric space M and  > 0
– Goal: for any query q return a point p  P s.t. dist(p,q)  (1+) dist(p*,q),
where p* is the nearest neighbor to q
• Classic Result
– Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
• Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12
• In d-dimensional Euclidean space under Lp norm
• 𝑂(𝑑𝑛
1
1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+
1
1+𝜀) space
• Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor

Overview
1. Near Neighbor Reporting
– Formal problem formulation
2. Locality Sensitive Hashing
– Definition and example
– Algorithms for NNR based on LSH
– Query time decomposition
3. Performance tuning
– Gap amplification
– Parameter optimization

Near Neighbor Reporting
2015/3/4

Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A deterministic algorithm for building a data structure T s.t.
• T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)}
• 𝐸 𝑞 cvg(𝑞|𝑇)  𝑐 where q ~ f and cvg(q|T) =
𝑇.nbrs 𝑞
nbrs 𝑞
2015/3/4
hard to achieve 

(Relaxed) Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A randomized algorithm for building a data structure T s.t.
• T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)}
• 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇  𝑐, where q ~ f
• Fact:
– if 𝐸 𝑇 cvg(𝑞|𝑇)  𝑐 for any q
then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f
2015/3/4

2015/3/4

• Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
• Let h be chosen uniformly at random from H
– p and q are closer => Pr(h(p) = h(q)) is higher
• Formally, H is (r, r+, c, c’)-sensitive if
– r < r+ and c> c’
– Let h be chosen uniformly at random from H
– If dist(p, q)  r , then Pr(h(p) = h(q))  c
– If dist(p, q)  r+, then Pr(h(p) = h(q))  c’

LSH for angular distance
• Distance function: d(u, v) = arccos(u,v) = (u,v)
• Random Projection:
– Choose a random unit vector w
– h(u) = sgn(uw)
– Pr(h(u) = h(v)) = 1 - (u,v)/


u
v

LSH for Jaccard distance
• Distance function: d(A, B) = 1 - | A  B |/| A  B |
• MinHash:
–Pick a random permutation  on the universe U
–h(A) = argminaA (a)
–Pr(h(A) = h(B)) = | A  B |/| A  B | = 1 – d(A, B)
• Note:
–Finding the Jaccard median
• Very easy to understand, very hard to compute
• Studied from 1981
• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10
– NP-Hard and no FPTAS
– PTAS
A B

Algorithm for NNR Based on LSH
• Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M
• Consider the following randomized algorithm for NNR
– Uniformly at random choose a hash function h from H
– Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)}
– Define Th.nbrs(q) to return points in Th(q) with dist(p,q)  r
• For any q,
• 𝐸 𝑇ℎ
cvg 𝑞|𝑇 =
• 𝑇ℎ
Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) =
• ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 )
|nbrs(𝑞)|
 𝑐

Query Time
• Query time = time for computing h(q)
+
|{p P: h(p) = h(q) & dist(p,q) > r| d
+
|{p P: h(p) = h(q) & dist(p,q)  r| d
= timec + timeFP + |output|

Why Gap Amplification
• To lift coverage rate c
• Reduce false positive rate to improve timeFP
Jaccard
distance
0.2
c= 0.8
0.4
c’= 0.6
collision Prob.
Jaccard
distance
0.2
c= 0.9
0.4
c’= 0.1
collision Prob.
(r, r+, c, c’)-sensitive LSH
Gap amplification
(r, r+, c, c’)-sensitive LSH

17
How Gap Amplification
• Construct LSH G from the original LSH H
– LSH B = {b(x; h1, h2,…, hk): hi  H }
• b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q)
– LSH G = {g(x; b1, b2,… , bL): bi  B}
• g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q)
• Intuition:
– AND increases the gap
• Collision probabilities of distant points decrease
exponentially faster than near points
– OR increases the collision probabilities approx. linearly
• Let P = PrhH (h(p) = h(q))
=> PrbB (b(p) = b(q)) = Pk
=> PrgG (g(p)  g(q)) = (1 – Pk)L
=> PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk
L buckets
k hashes
or
or
or

18
Parameter Optimization
• Situation
– A (r, r+, c, c’)-sensitive LSH H is given
– After gap amplification we want c to be lifted to c
• Let c = 1 – (1-ck)L
⇒ 𝐿 =
log 1 – c
log 1 − ck
⇒ L is a strictly increasing function of k
• So we only need to select a good k

19
How to select a good k
• How to measure the “goodness”?
– Minimize the timec + E[timeFP ]under the space constraint
• Let’s investigate how the query time and space usage will
react when we increase k

k  => space 
• Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 +
log 1 – c
log 1 − ck
)
))

k  => timec 
• timec =O (dkL) = 𝑂(𝑑𝑘
log 1 – c
log 1 − ck )
– Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)

k  => timeFP 
c
P
Y
c
k=2
k=3
• Consider s-curves Y = 1 – (1-Pk)L passing through (c, c)
• Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP 
original collision prob.
of distant points

1. Determine the largest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
– In practice, timec and timeFP is measured experimentally by
• constructing a data structure T
• running several queries sampled from S on T
Procedure for optimizing k

• Let ∆ 𝑘time 𝑐 = timec(k) – timec(k-1)
• Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)]
• Observation:
– If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k*
would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can
be found using binary search.
• Question:
– in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing?
Observation & Question
k
∆ 𝑘time 𝑐
∆ 𝑘timeFP

Summary
• Near Neighbor Reporting
– Find many applications in practice
• Locality Sensitive Hashing
– Hash near points to the same value
– One of the most useful techniques for NNR
• Performance tuning
– Gap amplification for higher coverage and lower FP
– Parameter optimization for query time

Further Reading
• Dimensionality Reduction
– Variance preserving
– Principal Component Analysis
– Singular Value Decomposition
– Distance preserving
– Random Projection and the Johnson–Lindenstrauss lemma
– Locality preserving
– Locally Linear Embedding
– Multi-dimensional Scaling
– ISOMAP

References
1. Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH

Appendix
• Suppose that we have a (r, r+, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+, c, err)-sensitive LSH G.
• How does the bucket number L and the collision error err change
with k?

Increasing rate of bucket number
𝐿𝑒𝑡 1 − 1 − 𝑐 𝑘 𝐿 = 𝑐
⇒ 𝐿𝑐𝑘
≤
𝑐 and 𝑐  1 − 𝑒 𝐿𝑐 𝑘
⇒
𝑐
𝑐 𝑘
 𝐿 
−ln(1 − 𝑐)
𝑐 𝑘
⇒ 𝐿 = 𝜃(
1
𝑐 𝑘
)

Decreasing rate of collision error
• err = 1 − 1 − err 𝑘 𝐿  𝐿err 𝑘
 (
err
𝑐
) 𝑘 for some constant  −− −(1)
• err = 1 − 1 − err 𝑘 𝐿  1 − 𝑒−𝐿err 𝑘
 1 − 𝑒
−𝛽
err
𝑐
𝑘
for some constant 0 < 𝛽 < 1
= 𝛽
err
𝑐
𝑘 −
𝛽2 err
𝑐
2𝑘
2!
+
𝛽3 err
𝑐
3𝑘
3!
… = 
err
𝑐
𝑘 −− −(2)
• By (1) and (2), we have err = θ
err
𝑐
𝑘

LSH

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LSH

Similar to LSH (20)

Recently uploaded

Recently uploaded (20)

LSH