Locality Sensitive Hashing
with Application to Near Neighbor Reporting
Hsiao-Fei Liu
2015.3.4
Motivation
• Real word applications
– Recommendation system
• Searching for similar items and users
– Malicious website detection
• Searching for websites similar to some know malicious websites
• The underlying core problem
Given:
• A large set P of high-dimensional data points in a metric space M
• A large set Q of high-dimensional query points in a metric space M
Goal:
• Find near neighbors in P for each query point in Q
• Avoid linearly scanning P for each query
Related Work
• Nearest Neighbor Searching
– Given: a set P of n points in a metric space M
– Goal: for any query q return a point p ∈ P minimizing dist(p,q)
• Classic Result
– Point location in arrangements of hyperplanes
• Meiser, IC’93
• In a d-dimensional Euclidean space under some Lp norm
• dO(1) logn query time and nO(d) space
Related Work
• Approximate Nearest Neighbor
– Given: a set P of n points in a metric space M and  > 0
– Goal: for any query q return a point p  P s.t. dist(p,q)  (1+) dist(p*,q),
where p* is the nearest neighbor to q
• Classic Result
– Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
• Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12
• In d-dimensional Euclidean space under Lp norm
• 𝑂(𝑑𝑛
1
1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+
1
1+𝜀) space
• Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
Overview
1. Near Neighbor Reporting
– Formal problem formulation
2. Locality Sensitive Hashing
– Definition and example
– Algorithms for NNR based on LSH
– Query time decomposition
3. Performance tuning
– Gap amplification
– Parameter optimization
Near Neighbor Reporting
2015/3/4
Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A deterministic algorithm for building a data structure T s.t.
• T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)}
• 𝐸 𝑞 cvg(𝑞|𝑇)  𝑐 where q ~ f and cvg(q|T) =
𝑇.nbrs 𝑞
nbrs 𝑞
2015/3/4
hard to achieve 
(Relaxed) Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A randomized algorithm for building a data structure T s.t.
• T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)}
• 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇  𝑐, where q ~ f
• Fact:
– if 𝐸 𝑇 cvg(𝑞|𝑇)  𝑐 for any q
then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f
2015/3/4
Locality Sensitive Hashing
2015/3/4
Locality Sensitive Hashing
• Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
• Let h be chosen uniformly at random from H
– p and q are closer => Pr(h(p) = h(q)) is higher
• Formally, H is (r, r+, c, c’)-sensitive if
– r < r+ and c> c’
– Let h be chosen uniformly at random from H
– If dist(p, q)  r , then Pr(h(p) = h(q))  c
– If dist(p, q)  r+, then Pr(h(p) = h(q))  c’
LSH for angular distance
• Distance function: d(u, v) = arccos(u,v) = (u,v)
• Random Projection:
– Choose a random unit vector w
– h(u) = sgn(uw)
– Pr(h(u) = h(v)) = 1 - (u,v)/


u
v
LSH for Jaccard distance
• Distance function: d(A, B) = 1 - | A  B |/| A  B |
• MinHash:
–Pick a random permutation  on the universe U
–h(A) = argminaA (a)
–Pr(h(A) = h(B)) = | A  B |/| A  B | = 1 – d(A, B)
• Note:
–Finding the Jaccard median
• Very easy to understand, very hard to compute
• Studied from 1981
• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10
– NP-Hard and no FPTAS
– PTAS
A B
Algorithm for NNR Based on LSH
• Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M
• Consider the following randomized algorithm for NNR
– Uniformly at random choose a hash function h from H
– Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)}
– Define Th.nbrs(q) to return points in Th(q) with dist(p,q)  r
• For any q,
• 𝐸 𝑇ℎ
cvg 𝑞|𝑇 =
• 𝑇ℎ
Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) =
• ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 )
|nbrs(𝑞)|
 𝑐
Query Time
• Query time = time for computing h(q)
+
|{p P: h(p) = h(q) & dist(p,q) > r| d
+
|{p P: h(p) = h(q) & dist(p,q)  r| d
= timec + timeFP + |output|
Performance Tuning
2015/3/4
Why Gap Amplification
• To lift coverage rate c
• Reduce false positive rate to improve timeFP
Jaccard
distance
0.2
c= 0.8
0.4
c’= 0.6
collision Prob.
Jaccard
distance
0.2
c= 0.9
0.4
c’= 0.1
collision Prob.
(r, r+, c, c’)-sensitive LSH
Gap amplification
(r, r+, c, c’)-sensitive LSH
17
How Gap Amplification
• Construct LSH G from the original LSH H
– LSH B = {b(x; h1, h2,…, hk): hi  H }
• b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q)
– LSH G = {g(x; b1, b2,… , bL): bi  B}
• g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q)
• Intuition:
– AND increases the gap
• Collision probabilities of distant points decrease
exponentially faster than near points
– OR increases the collision probabilities approx. linearly
• Let P = PrhH (h(p) = h(q))
=> PrbB (b(p) = b(q)) = Pk
=> PrgG (g(p)  g(q)) = (1 – Pk)L
=> PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk
L buckets
k hashes
or
or
or
18
Parameter Optimization
• Situation
– A (r, r+, c, c’)-sensitive LSH H is given
– After gap amplification we want c to be lifted to c
• Let c = 1 – (1-ck)L
⇒ 𝐿 =
log 1 – c
log 1 − ck
⇒ L is a strictly increasing function of k
• So we only need to select a good k
19
How to select a good k
• How to measure the “goodness”?
– Minimize the timec + E[timeFP ]under the space constraint
• Let’s investigate how the query time and space usage will
react when we increase k
k  => space 
• Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 +
log 1 – c
log 1 − ck
)
))
k  => timec 
• timec =O (dkL) = 𝑂(𝑑𝑘
log 1 – c
log 1 − ck )
– Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)
k  => timeFP 
c
P
Y
c
k=2
k=3
• Consider s-curves Y = 1 – (1-Pk)L passing through (c, c)
• Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP 
original collision prob.
of distant points
1. Determine the largest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
– In practice, timec and timeFP is measured experimentally by
• constructing a data structure T
• running several queries sampled from S on T
Procedure for optimizing k
• Let ∆ 𝑘time 𝑐 = timec(k) – timec(k-1)
• Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)]
• Observation:
– If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k*
would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can
be found using binary search.
• Question:
– in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing?
Observation & Question
k
∆ 𝑘time 𝑐
∆ 𝑘timeFP
Summary
• Near Neighbor Reporting
– Find many applications in practice
• Locality Sensitive Hashing
– Hash near points to the same value
– One of the most useful techniques for NNR
• Performance tuning
– Gap amplification for higher coverage and lower FP
– Parameter optimization for query time
Further Reading
• Dimensionality Reduction
– Variance preserving
– Principal Component Analysis
– Singular Value Decomposition
– Distance preserving
– Random Projection and the Johnson–Lindenstrauss lemma
– Locality preserving
– Locally Linear Embedding
– Multi-dimensional Scaling
– ISOMAP
References
1. Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH
Appendix
• Suppose that we have a (r, r+, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+, c, err)-sensitive LSH G.
• How does the bucket number L and the collision error err change
with k?
Increasing rate of bucket number
𝐿𝑒𝑡 1 − 1 − 𝑐 𝑘 𝐿 = 𝑐
⇒ 𝐿𝑐𝑘
≤
𝑐 and 𝑐  1 − 𝑒 𝐿𝑐 𝑘
⇒
𝑐
𝑐 𝑘
 𝐿 
−ln(1 − 𝑐)
𝑐 𝑘
⇒ 𝐿 = 𝜃(
1
𝑐 𝑘
)
Decreasing rate of collision error
• err = 1 − 1 − err 𝑘 𝐿  𝐿err 𝑘
 (
err
𝑐
) 𝑘 for some constant  −− −(1)
• err = 1 − 1 − err 𝑘 𝐿  1 − 𝑒−𝐿err 𝑘
 1 − 𝑒
−𝛽
err
𝑐
𝑘
for some constant 0 < 𝛽 < 1
= 𝛽
err
𝑐
𝑘 −
𝛽2 err
𝑐
2𝑘
2!
+
𝛽3 err
𝑐
3𝑘
3!
… = 
err
𝑐
𝑘 −− −(2)
• By (1) and (2), we have err = θ
err
𝑐
𝑘

LSH

  • 1.
    Locality Sensitive Hashing withApplication to Near Neighbor Reporting Hsiao-Fei Liu 2015.3.4
  • 2.
    Motivation • Real wordapplications – Recommendation system • Searching for similar items and users – Malicious website detection • Searching for websites similar to some know malicious websites • The underlying core problem Given: • A large set P of high-dimensional data points in a metric space M • A large set Q of high-dimensional query points in a metric space M Goal: • Find near neighbors in P for each query point in Q • Avoid linearly scanning P for each query
  • 3.
    Related Work • NearestNeighbor Searching – Given: a set P of n points in a metric space M – Goal: for any query q return a point p ∈ P minimizing dist(p,q) • Classic Result – Point location in arrangements of hyperplanes • Meiser, IC’93 • In a d-dimensional Euclidean space under some Lp norm • dO(1) logn query time and nO(d) space
  • 4.
    Related Work • ApproximateNearest Neighbor – Given: a set P of n points in a metric space M and  > 0 – Goal: for any query q return a point p  P s.t. dist(p,q)  (1+) dist(p*,q), where p* is the nearest neighbor to q • Classic Result – Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality • Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12 • In d-dimensional Euclidean space under Lp norm • 𝑂(𝑑𝑛 1 1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+ 1 1+𝜀) space • Technique1: Approximate Nearest Neighbor reduces to Approximate Near Neighbor with little overhead Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
  • 5.
    Overview 1. Near NeighborReporting – Formal problem formulation 2. Locality Sensitive Hashing – Definition and example – Algorithms for NNR based on LSH – Query time decomposition 3. Performance tuning – Gap amplification – Parameter optimization
  • 6.
  • 7.
    Near Neighbor Reporting •Input: – A set P of points in a metric space M – radius r > 0 – coverage rate c – A set S of points sampled from an unknown distribution f • Goal: – A deterministic algorithm for building a data structure T s.t. • T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)} • 𝐸 𝑞 cvg(𝑞|𝑇)  𝑐 where q ~ f and cvg(q|T) = 𝑇.nbrs 𝑞 nbrs 𝑞 2015/3/4 hard to achieve 
  • 8.
    (Relaxed) Near NeighborReporting • Input: – A set P of points in a metric space M – radius r > 0 – coverage rate c – A set S of points sampled from an unknown distribution f • Goal: – A randomized algorithm for building a data structure T s.t. • T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)} • 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇  𝑐, where q ~ f • Fact: – if 𝐸 𝑇 cvg(𝑞|𝑇)  𝑐 for any q then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f 2015/3/4
  • 9.
  • 10.
    Locality Sensitive Hashing •Informally, an LSH H is a set of hash functions over a metric space satisfying the following condition. • Let h be chosen uniformly at random from H – p and q are closer => Pr(h(p) = h(q)) is higher • Formally, H is (r, r+, c, c’)-sensitive if – r < r+ and c> c’ – Let h be chosen uniformly at random from H – If dist(p, q)  r , then Pr(h(p) = h(q))  c – If dist(p, q)  r+, then Pr(h(p) = h(q))  c’
  • 11.
    LSH for angulardistance • Distance function: d(u, v) = arccos(u,v) = (u,v) • Random Projection: – Choose a random unit vector w – h(u) = sgn(uw) – Pr(h(u) = h(v)) = 1 - (u,v)/   u v
  • 12.
    LSH for Jaccarddistance • Distance function: d(A, B) = 1 - | A  B |/| A  B | • MinHash: –Pick a random permutation  on the universe U –h(A) = argminaA (a) –Pr(h(A) = h(B)) = | A  B |/| A  B | = 1 – d(A, B) • Note: –Finding the Jaccard median • Very easy to understand, very hard to compute • Studied from 1981 • Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10 – NP-Hard and no FPTAS – PTAS A B
  • 13.
    Algorithm for NNRBased on LSH • Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M • Consider the following randomized algorithm for NNR – Uniformly at random choose a hash function h from H – Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)} – Define Th.nbrs(q) to return points in Th(q) with dist(p,q)  r • For any q, • 𝐸 𝑇ℎ cvg 𝑞|𝑇 = • 𝑇ℎ Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) = • ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞)) |nbrs(𝑞)| = • 𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞)) |nbrs(𝑞)| = • 𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 ) |nbrs(𝑞)|  𝑐
  • 14.
    Query Time • Querytime = time for computing h(q) + |{p P: h(p) = h(q) & dist(p,q) > r| d + |{p P: h(p) = h(q) & dist(p,q)  r| d = timec + timeFP + |output|
  • 15.
  • 16.
    Why Gap Amplification •To lift coverage rate c • Reduce false positive rate to improve timeFP Jaccard distance 0.2 c= 0.8 0.4 c’= 0.6 collision Prob. Jaccard distance 0.2 c= 0.9 0.4 c’= 0.1 collision Prob. (r, r+, c, c’)-sensitive LSH Gap amplification (r, r+, c, c’)-sensitive LSH
  • 17.
    17 How Gap Amplification •Construct LSH G from the original LSH H – LSH B = {b(x; h1, h2,…, hk): hi  H } • b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q) – LSH G = {g(x; b1, b2,… , bL): bi  B} • g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q) • Intuition: – AND increases the gap • Collision probabilities of distant points decrease exponentially faster than near points – OR increases the collision probabilities approx. linearly • Let P = PrhH (h(p) = h(q)) => PrbB (b(p) = b(q)) = Pk => PrgG (g(p)  g(q)) = (1 – Pk)L => PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk L buckets k hashes or or or
  • 18.
    18 Parameter Optimization • Situation –A (r, r+, c, c’)-sensitive LSH H is given – After gap amplification we want c to be lifted to c • Let c = 1 – (1-ck)L ⇒ 𝐿 = log 1 – c log 1 − ck ⇒ L is a strictly increasing function of k • So we only need to select a good k
  • 19.
    19 How to selecta good k • How to measure the “goodness”? – Minimize the timec + E[timeFP ]under the space constraint • Let’s investigate how the query time and space usage will react when we increase k
  • 20.
    k  =>space  • Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 + log 1 – c log 1 − ck ) ))
  • 21.
    k  =>timec  • timec =O (dkL) = 𝑂(𝑑𝑘 log 1 – c log 1 − ck ) – Interested reader can refer to E2LSH for how to reduce timec to O(dkL1/2)
  • 22.
    k  =>timeFP  c P Y c k=2 k=3 • Consider s-curves Y = 1 – (1-Pk)L passing through (c, c) • Larger k => steeper s-curve, => collision prob. drop faster for distant points => less false positive => timeFP  original collision prob. of distant points
  • 23.
    1. Determine thelargest possible value kmax for k without violating the space constraint 2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)] – In practice, timec and timeFP is measured experimentally by • constructing a data structure T • running several queries sampled from S on T Procedure for optimizing k
  • 24.
    • Let ∆𝑘time 𝑐 = timec(k) – timec(k-1) • Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] • Observation: – If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k* would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can be found using binary search. • Question: – in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing? Observation & Question k ∆ 𝑘time 𝑐 ∆ 𝑘timeFP
  • 25.
    Summary • Near NeighborReporting – Find many applications in practice • Locality Sensitive Hashing – Hash near points to the same value – One of the most useful techniques for NNR • Performance tuning – Gap amplification for higher coverage and lower FP – Parameter optimization for query time
  • 26.
    Further Reading • DimensionalityReduction – Variance preserving – Principal Component Analysis – Singular Value Decomposition – Distance preserving – Random Projection and the Johnson–Lindenstrauss lemma – Locality preserving – Locally Linear Embedding – Multi-dimensional Scaling – ISOMAP
  • 27.
    References 1. Approximate NearestNeighbors: Towards Removing the Curse of Dimensionality 2. Similarity Search in High Dimensions via Hashing 3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf 4. E2LSH
  • 28.
    Appendix • Suppose thatwe have a (r, r+, c, err)-sensitive LSH H and want to amplify H to get a (r, r+, c, err)-sensitive LSH G. • How does the bucket number L and the collision error err change with k?
  • 29.
    Increasing rate ofbucket number 𝐿𝑒𝑡 1 − 1 − 𝑐 𝑘 𝐿 = 𝑐 ⇒ 𝐿𝑐𝑘 ≤ 𝑐 and 𝑐  1 − 𝑒 𝐿𝑐 𝑘 ⇒ 𝑐 𝑐 𝑘  𝐿  −ln(1 − 𝑐) 𝑐 𝑘 ⇒ 𝐿 = 𝜃( 1 𝑐 𝑘 )
  • 30.
    Decreasing rate ofcollision error • err = 1 − 1 − err 𝑘 𝐿  𝐿err 𝑘  ( err 𝑐 ) 𝑘 for some constant  −− −(1) • err = 1 − 1 − err 𝑘 𝐿  1 − 𝑒−𝐿err 𝑘  1 − 𝑒 −𝛽 err 𝑐 𝑘 for some constant 0 < 𝛽 < 1 = 𝛽 err 𝑐 𝑘 − 𝛽2 err 𝑐 2𝑘 2! + 𝛽3 err 𝑐 3𝑘 3! … =  err 𝑐 𝑘 −− −(2) • By (1) and (2), we have err = θ err 𝑐 𝑘