2. Motivation
β’ Real word applications
β Recommendation system
β’ Searching for similar items and users
β Malicious website detection
β’ Searching for websites similar to some know malicious websites
β’ The underlying core problem
Given:
β’ A large set P of high-dimensional data points in a metric space M
β’ A large set Q of high-dimensional query points in a metric space M
Goal:
β’ Find near neighbors in P for each query point in Q
β’ Avoid linearly scanning P for each query
3. Related Work
β’ Nearest Neighbor Searching
β Given: a set P of n points in a metric space M
β Goal: for any query q return a point p β P minimizing dist(p,q)
β’ Classic Result
β Point location in arrangements of hyperplanes
β’ Meiser, ICβ93
β’ In a d-dimensional Euclidean space under some Lp norm
β’ dO(1) logn query time and nO(d) space
4. Related Work
β’ Approximate Nearest Neighbor
β Given: a set P of n points in a metric space M and ο₯ > 0
β Goal: for any query q return a point p ο P s.t. dist(p,q) ο£ (1+ο₯) dist(p*,q),
where p* is the nearest neighbor to q
β’ Classic Result
β Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
β’ Har-Peled, Indyk and Motwani, STOCβ98, FOCSβ01, ToCβ12
β’ In d-dimensional Euclidean space under Lp norm
β’ π(ππ
1
1+π) query time and π(ππ + π1+
1
1+π) space
β’ Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
5. Overview
1. Near Neighbor Reporting
β Formal problem formulation
2. Locality Sensitive Hashing
β Definition and example
β Algorithms for NNR based on LSH
β Query time decomposition
3. Performance tuning
β Gap amplification
β Parameter optimization
7. Near Neighbor Reporting
β’ Input:
β A set P of points in a metric space M
β radius r > 0
β coverage rate c
β A set S of points sampled from an unknown distribution f
β’ Goal:
β A deterministic algorithm for building a data structure T s.t.
β’ T.nbrs(q) ο nbrs(q) = {p ο P : dist(p,q)}
β’ πΈ π cvg(π|π) ο³ π where q ~ f and cvg(q|T) =
π.nbrs π
nbrs π
2015/3/4
hard to achieve ο
8. (Relaxed) Near Neighbor Reporting
β’ Input:
β A set P of points in a metric space M
β radius r > 0
β coverage rate c
β A set S of points sampled from an unknown distribution f
β’ Goal:
β A randomized algorithm for building a data structure T s.t.
β’ T.nbrs(q) ο nbrs(q) = {p ο P : dist(p,q)}
β’ πΈ π πΈ π cvg π π = TPr π ππ ππ’πππ‘ πΈ π cvg π π ο³ π, where q ~ f
β’ Fact:
β if πΈ π cvg(π|π) ο³ π for any q
then πΈ π πΈ π cvg π π β₯ π where q ~ f
2015/3/4
10. Locality Sensitive Hashing
β’ Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
β’ Let h be chosen uniformly at random from H
β p and q are closer => Pr(h(p) = h(q)) is higher
β’ Formally, H is (r, r+ο₯, c, cβ)-sensitive if
β r < r+ο₯ and c> cβ
β Let h be chosen uniformly at random from H
β If dist(p, q) ο£ r , then Pr(h(p) = h(q)) ο³ c
β If dist(p, q) ο³ r+ο₯, then Pr(h(p) = h(q)) ο£ cβ
11. LSH for angular distance
β’ Distance function: d(u, v) = arccos(u,v) = ο±(u,v)
β’ Random Projection:
β Choose a random unit vector w
β h(u) = sgn(uοw)
β Pr(h(u) = h(v)) = 1 - ο±(u,v)/ο°
ο±
ο±
u
v
12. LSH for Jaccard distance
β’ Distance function: d(A, B) = 1 - | A ο B |/| A ο B |
β’ MinHash:
βPick a random permutation ο° on the universe U
βh(A) = argminaοA ο°(a)
βPr(h(A) = h(B)) = | A ο B |/| A ο B | = 1 β d(A, B)
β’ Note:
βFinding the Jaccard median
β’ Very easy to understand, very hard to compute
β’ Studied from 1981
β’ Chierichetti, Kumar, Pandey, Vassilvitskii, SODAβ10
β NP-Hard and no FPTAS
β PTAS
A B
13. Algorithm for NNR Based on LSH
β’ Let H be a (r, r+ο₯, c, cβ)-sensitive LSH over a metric space M
β’ Consider the following randomized algorithm for NNR
β Uniformly at random choose a hash function h from H
β Build a hash table Th s.t. Th(q) = { p οP : h(p) = h(q)}
β Define Th.nbrs(q) to return points in Th(q) with dist(p,q) ο£ r
β’ For any q,
β’ πΈ πβ
cvg π|π =
β’ πβ
Pr πβ ππ ππ’πππ‘ cvg (π|π) =
β’ β Pr β ππ πβππ ππ
πβnbrs(π) πΏ(β π =β(π))
|nbrs(π)|
=
β’
πβnbrs(π) β Pr β ππ πππ ππ πΏ(β π =β(π))
|nbrs(π)|
=
β’
πβnbrs(π) Pr(β π =β π )
|nbrs(π)|
ο³ π
14. Query Time
β’ Query time = time for computing h(q)
+
|{p οP: h(p) = h(q) & dist(p,q) > r|ο d
+
|{p οP: h(p) = h(q) & dist(p,q) ο£ r|ο d
= timec + timeFP + |output|
16. Why Gap Amplification
β’ To lift coverage rate c
β’ Reduce false positive rate to improve timeFP
Jaccard
distance
0.2
c= 0.8
0.4
cβ= 0.6
collision Prob.
Jaccard
distance
0.2
c= 0.9
0.4
cβ= 0.1
collision Prob.
(r, r+ο₯, c, cβ)-sensitive LSH
Gap amplification
(r, r+ο₯, cο, cβο―)-sensitive LSH
17. 17
How Gap Amplification
β’ Construct LSH G from the original LSH H
β LSH B = {b(x; h1, h2,β¦, hk): hi ο H }
β’ b(p; h1, h2,β¦, hk) = b(q; h1, h2,β¦, hk) iff ANDi=1,β¦,k hi(p) = hi(q)
β LSH G = {g(x; b1, b2,β¦ , bL): bi ο B}
β’ g(p; b1, b2,β¦ , bL ) = g(q; b1, b2,β¦ , bL) iff ORi=1,β¦,L bi(p) = bi(q)
β’ Intuition:
β AND increases the gap
β’ Collision probabilities of distant points decrease
exponentially faster than near points
β OR increases the collision probabilities approx. linearly
β’ Let P = PrhοH (h(p) = h(q))
=> PrbοB (b(p) = b(q)) = Pk
=> PrgοG (g(p) οΉ g(q)) = (1 β Pk)L
=> PrgοG (g(p) = g(q)) = 1 - (1 β Pk)L ~ LPk
L buckets
k hashes
or
or
or
18. 18
Parameter Optimization
β’ Situation
β A (r, r+ο₯, c, cβ)-sensitive LSH H is given
β After gap amplification we want c to be lifted to cο
β’ Let cο = 1 β (1-ck)L
β πΏ =
log 1 β cο
log 1 β ck
β L is a strictly increasing function of k
β’ So we only need to select a good k
19. 19
How to select a good k
β’ How to measure the βgoodnessβ?
β Minimize the timec + E[timeFP ]under the space constraint
β’ Letβs investigate how the query time and space usage will
react when we increase k
20. k ο => space ο
β’ Space = π(ππ + ππΏ) = π(π(π +
log 1 β cο
log 1 β ck
)
))
21. k ο => timec ο
β’ timec =O (dkL) = π(ππ
log 1 β cο
log 1 β ck )
β Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)
22. k ο => timeFP ο―
c
P
Y
cο
k=2
k=3
β’ Consider s-curves Y = 1 β (1-Pk)L passing through (c, cο)
β’ Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP ο―
original collision prob.
of distant points
23. 1. Determine the largest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
β In practice, timec and timeFP is measured experimentally by
β’ constructing a data structure T
β’ running several queries sampled from S on T
Procedure for optimizing k
24. β’ Let β πtime π = timec(k) β timec(k-1)
β’ Let β πtimeFP = E[timeFP(k-1)] β E[timeFP(k)]
β’ Observation:
β If β πtimeπ is increasing and β πtimeFP is decreasing, then k*
would be the largest k such that β πtimeFP > β πtime π and can
be found using binary search.
β’ Question:
β in which situation will β πtimeFP = E[timeFP(k-1)] β E[timeFP(k)] be increasing?
Observation & Question
k
β πtime π
β πtimeFP
25. Summary
β’ Near Neighbor Reporting
β Find many applications in practice
β’ Locality Sensitive Hashing
β Hash near points to the same value
β One of the most useful techniques for NNR
β’ Performance tuning
β Gap amplification for higher coverage and lower FP
β Parameter optimization for query time
26. Further Reading
β’ Dimensionality Reduction
β Variance preserving
β Principal Component Analysis
β Singular Value Decomposition
β Distance preserving
β Random Projection and the JohnsonβLindenstrauss lemma
β Locality preserving
β Locally Linear Embedding
β Multi-dimensional Scaling
β ISOMAP
27. References
1. Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH
28. Appendix
β’ Suppose that we have a (r, r+ο₯, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+ο₯, cο, errο―)-sensitive LSH G.
β’ How does the bucket number L and the collision error errο― change
with k?