Locality sensitive hashing (LSH) is a technique to improve the efficiency of near neighbor searches in high-dimensional spaces. LSH works by hash functions that map similar items to the same buckets with high probability. The document discusses applications of near neighbor searching, defines the near neighbor reporting problem, and introduces LSH. It also covers techniques like gap amplification to improve LSH performance and parameter optimization to minimize query time.
Motivation
• Real wordapplications
– Recommendation system
• Searching for similar items and users
– Malicious website detection
• Searching for websites similar to some know malicious websites
• The underlying core problem
Given:
• A large set P of high-dimensional data points in a metric space M
• A large set Q of high-dimensional query points in a metric space M
Goal:
• Find near neighbors in P for each query point in Q
• Avoid linearly scanning P for each query
3.
Related Work
• NearestNeighbor Searching
– Given: a set P of n points in a metric space M
– Goal: for any query q return a point p ∈ P minimizing dist(p,q)
• Classic Result
– Point location in arrangements of hyperplanes
• Meiser, IC’93
• In a d-dimensional Euclidean space under some Lp norm
• dO(1) logn query time and nO(d) space
4.
Related Work
• ApproximateNearest Neighbor
– Given: a set P of n points in a metric space M and > 0
– Goal: for any query q return a point p P s.t. dist(p,q) (1+) dist(p*,q),
where p* is the nearest neighbor to q
• Classic Result
– Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
• Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12
• In d-dimensional Euclidean space under Lp norm
• 𝑂(𝑑𝑛
1
1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+
1
1+𝜀) space
• Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
5.
Overview
1. Near NeighborReporting
– Formal problem formulation
2. Locality Sensitive Hashing
– Definition and example
– Algorithms for NNR based on LSH
– Query time decomposition
3. Performance tuning
– Gap amplification
– Parameter optimization
Near Neighbor Reporting
•Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A deterministic algorithm for building a data structure T s.t.
• T.nbrs(q) nbrs(q) = {p P : dist(p,q)}
• 𝐸 𝑞 cvg(𝑞|𝑇) 𝑐 where q ~ f and cvg(q|T) =
𝑇.nbrs 𝑞
nbrs 𝑞
2015/3/4
hard to achieve
8.
(Relaxed) Near NeighborReporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A randomized algorithm for building a data structure T s.t.
• T.nbrs(q) nbrs(q) = {p P : dist(p,q)}
• 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇 𝑐, where q ~ f
• Fact:
– if 𝐸 𝑇 cvg(𝑞|𝑇) 𝑐 for any q
then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f
2015/3/4
Locality Sensitive Hashing
•Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
• Let h be chosen uniformly at random from H
– p and q are closer => Pr(h(p) = h(q)) is higher
• Formally, H is (r, r+, c, c’)-sensitive if
– r < r+ and c> c’
– Let h be chosen uniformly at random from H
– If dist(p, q) r , then Pr(h(p) = h(q)) c
– If dist(p, q) r+, then Pr(h(p) = h(q)) c’
11.
LSH for angulardistance
• Distance function: d(u, v) = arccos(u,v) = (u,v)
• Random Projection:
– Choose a random unit vector w
– h(u) = sgn(uw)
– Pr(h(u) = h(v)) = 1 - (u,v)/
u
v
12.
LSH for Jaccarddistance
• Distance function: d(A, B) = 1 - | A B |/| A B |
• MinHash:
–Pick a random permutation on the universe U
–h(A) = argminaA (a)
–Pr(h(A) = h(B)) = | A B |/| A B | = 1 – d(A, B)
• Note:
–Finding the Jaccard median
• Very easy to understand, very hard to compute
• Studied from 1981
• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10
– NP-Hard and no FPTAS
– PTAS
A B
13.
Algorithm for NNRBased on LSH
• Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M
• Consider the following randomized algorithm for NNR
– Uniformly at random choose a hash function h from H
– Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)}
– Define Th.nbrs(q) to return points in Th(q) with dist(p,q) r
• For any q,
• 𝐸 𝑇ℎ
cvg 𝑞|𝑇 =
• 𝑇ℎ
Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) =
• ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 )
|nbrs(𝑞)|
𝑐
14.
Query Time
• Querytime = time for computing h(q)
+
|{p P: h(p) = h(q) & dist(p,q) > r| d
+
|{p P: h(p) = h(q) & dist(p,q) r| d
= timec + timeFP + |output|
17
How Gap Amplification
•Construct LSH G from the original LSH H
– LSH B = {b(x; h1, h2,…, hk): hi H }
• b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q)
– LSH G = {g(x; b1, b2,… , bL): bi B}
• g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q)
• Intuition:
– AND increases the gap
• Collision probabilities of distant points decrease
exponentially faster than near points
– OR increases the collision probabilities approx. linearly
• Let P = PrhH (h(p) = h(q))
=> PrbB (b(p) = b(q)) = Pk
=> PrgG (g(p) g(q)) = (1 – Pk)L
=> PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk
L buckets
k hashes
or
or
or
18.
18
Parameter Optimization
• Situation
–A (r, r+, c, c’)-sensitive LSH H is given
– After gap amplification we want c to be lifted to c
• Let c = 1 – (1-ck)L
⇒ 𝐿 =
log 1 – c
log 1 − ck
⇒ L is a strictly increasing function of k
• So we only need to select a good k
19.
19
How to selecta good k
• How to measure the “goodness”?
– Minimize the timec + E[timeFP ]under the space constraint
• Let’s investigate how the query time and space usage will
react when we increase k
20.
k =>space
• Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 +
log 1 – c
log 1 − ck
)
))
21.
k =>timec
• timec =O (dkL) = 𝑂(𝑑𝑘
log 1 – c
log 1 − ck )
– Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)
22.
k =>timeFP
c
P
Y
c
k=2
k=3
• Consider s-curves Y = 1 – (1-Pk)L passing through (c, c)
• Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP
original collision prob.
of distant points
23.
1. Determine thelargest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
– In practice, timec and timeFP is measured experimentally by
• constructing a data structure T
• running several queries sampled from S on T
Procedure for optimizing k
24.
• Let ∆𝑘time 𝑐 = timec(k) – timec(k-1)
• Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)]
• Observation:
– If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k*
would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can
be found using binary search.
• Question:
– in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing?
Observation & Question
k
∆ 𝑘time 𝑐
∆ 𝑘timeFP
25.
Summary
• Near NeighborReporting
– Find many applications in practice
• Locality Sensitive Hashing
– Hash near points to the same value
– One of the most useful techniques for NNR
• Performance tuning
– Gap amplification for higher coverage and lower FP
– Parameter optimization for query time
26.
Further Reading
• DimensionalityReduction
– Variance preserving
– Principal Component Analysis
– Singular Value Decomposition
– Distance preserving
– Random Projection and the Johnson–Lindenstrauss lemma
– Locality preserving
– Locally Linear Embedding
– Multi-dimensional Scaling
– ISOMAP
27.
References
1. Approximate NearestNeighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH
28.
Appendix
• Suppose thatwe have a (r, r+, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+, c, err)-sensitive LSH G.
• How does the bucket number L and the collision error err change
with k?