Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Locality Sensitive Hashing
Randomized Algorithm
Problem Statement
β€’ Given a query point q,
β€’ Find closest items to the query
point with the probability of 1 βˆ’ 𝛿
β€’ Iterati...
Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi
Diagram
Hash
Approximate
L...
Approximate LSH
β€’ Simple Idea
β€’ if two points are close together, then after a β€œprojection” operation these two
points wil...
LSH Requirement
β€’ For any given points 𝑝, π‘ž ∈ 𝑅 𝑑
𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž β‰₯ 𝑃1 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž ≀ 𝑑1
𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž ≀ 𝑃2 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž β‰₯ 𝑐𝑑1 = 𝑑...
P
d
2d
c.d
q
q
β‰₯ P(1)
β‰₯ P(2)
β‰₯ P(c) P(1) β‰₯P(2) β‰₯P(3)
q
Probability vs. Distance on candidate pairs
Hash Function(Random)
β€’ Locality-preserving
β€’ Independent
β€’ Deterministic
β€’ Family of Hash Function per various distance m...
LSH Family for Euclidean distance (2d)
β€’ When d. cos πœƒ ≀ π‘Ž,
β€’ Chance of colliding
β€’ But not certain
β€’ But can guarantee,
β€’...
How to define the projection?
β€’ Scalar projection (Dot product)
β„Ž
𝑣
=
𝑣
.
π‘₯
;
𝑣
= π‘žπ‘’π‘’π‘Ÿπ‘¦ π‘π‘œπ‘–π‘›π‘‘ 𝑖𝑛 𝑑 βˆ’ π‘‘π‘–π‘šπ‘’π‘›π‘ π‘–π‘œπ‘› π‘ π‘π‘Žπ‘π‘’
π‘₯
= 𝑣...
How to define the projection?
β€’ K-dot product, that
(
𝑃1
𝑃2
) π‘˜> (
𝑃1
𝑃2
)
points at different separations will fall into ...
Multiple-projections
β€’ L independent projections
β€’ True near neighbor will be unlikely to be unlucky in all the projection...
Accuracy
β€’ Two close points p and q,
β€’ Separated by 𝑒 = 𝑝 βˆ’ π‘ž
β€’ Probability of collision 𝑃 𝐻 𝑒 ,
𝑃 𝐻 𝑒 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(π‘ž))
...
Time complexity
β€’ For a query point q,
β€’ To Find the near neighbor: (𝑇𝑔+𝑇𝑐)
β€’ Calculate & hash the projections (𝑇𝑔)
β€’ O(Dk...
How many projections(L)?
β€’ For query point p & neighbor q,
β€’ For single projection,
β€’ Success probability of collisions: β‰₯...
LSH in MAXDIVREL Diversity
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
#1 #2 #3 … #k dot
product
1 1 1 ...
REFERENCES
[1] Anand Rajaraman and Jeff Ullman, β€œChapter Three of β€˜Mining of
Massive Datasets,’” pp. 72–130.
[2] M. Slaney...
Upcoming SlideShare
Loading in …5
×

Locality sensitive hashing

Locality Sensitive Hashing technique to demonstrate the feasibility of randomized algorithms

  • Be the first to comment

Locality sensitive hashing

  1. 1. Locality Sensitive Hashing Randomized Algorithm
  2. 2. Problem Statement β€’ Given a query point q, β€’ Find closest items to the query point with the probability of 1 βˆ’ 𝛿 β€’ Iterative methods? β€’ Large volume of data β€’ Curse of dimensionality
  3. 3. Taxonomy – Near Neighbor Query (NN) NN Trees K-d Tree Range Tree B Tree Cover Tree Grid Voronoi Diagram Hash Approximate LSH
  4. 4. Approximate LSH β€’ Simple Idea β€’ if two points are close together, then after a β€œprojection” operation these two points will remain close together
  5. 5. LSH Requirement β€’ For any given points 𝑝, π‘ž ∈ 𝑅 𝑑 𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž β‰₯ 𝑃1 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž ≀ 𝑑1 𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž ≀ 𝑃2 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž β‰₯ 𝑐𝑑1 = 𝑑2 β€’ Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need β€’ (𝑃1βˆ’π‘ƒ2) to be large β€’ (𝑑1βˆ’π‘‘2) to be small
  6. 6. P d 2d c.d q q β‰₯ P(1) β‰₯ P(2) β‰₯ P(c) P(1) β‰₯P(2) β‰₯P(3) q
  7. 7. Probability vs. Distance on candidate pairs
  8. 8. Hash Function(Random) β€’ Locality-preserving β€’ Independent β€’ Deterministic β€’ Family of Hash Function per various distance measures β€’ Euclidean β€’ Jaccard β€’ Cosine Similarity β€’ Hamming
  9. 9. LSH Family for Euclidean distance (2d) β€’ When d. cos πœƒ ≀ π‘Ž, β€’ Chance of colliding β€’ But not certain β€’ But can guarantee, β€’ If 𝑑 ≀ π‘Ž/2, β€’ 90 β‰₯ πœƒ β‰₯ 45 to have d. cos πœƒ ≀ π‘Ž β€’ ∴ 𝑃1 β‰₯ 1/2 β€’ If 𝑑 β‰₯ 2π‘Ž, β€’ 90 β‰₯ πœƒ β‰₯ 60 to have d. cos πœƒ ≀ π‘Ž β€’ ∴ 𝑃2 ≀ 1/3 β€’ As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive β€’ (π‘Ž, 2π‘Ž, 1 2 , 1 3 )
  10. 10. How to define the projection? β€’ Scalar projection (Dot product) β„Ž 𝑣 = 𝑣 . π‘₯ ; 𝑣 = π‘žπ‘’π‘’π‘Ÿπ‘¦ π‘π‘œπ‘–π‘›π‘‘ 𝑖𝑛 𝑑 βˆ’ π‘‘π‘–π‘šπ‘’π‘›π‘ π‘–π‘œπ‘› π‘ π‘π‘Žπ‘π‘’ π‘₯ = π‘£π‘’π‘π‘‘π‘œπ‘Ÿ π‘€π‘–π‘‘β„Ž π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š π‘π‘œπ‘šπ‘π‘œπ‘›π‘’π‘›π‘‘π‘  π‘“π‘Ÿπ‘œπ‘š 𝑁(0,1) β„Ž 𝑣 = 𝑣 . π‘₯ + 𝑏 𝑀 ; 𝑀 βˆ’ π‘€π‘–π‘‘π‘‘β„Ž π‘œπ‘“ π‘žπ‘’π‘Žπ‘›π‘‘π‘–π‘§π‘Žπ‘‘π‘–π‘œπ‘› 𝑏𝑖𝑛 𝑏 βˆ’ random variable uniformly distributed between 0 and w
  11. 11. How to define the projection? β€’ K-dot product, that ( 𝑃1 𝑃2 ) π‘˜> ( 𝑃1 𝑃2 ) points at different separations will fall into the same quantization bin β€’ Perform k independent dot products β€’ Achieve success, β€’ if the query and the nearest neighbor are in the same bin in all k dot products β€’ Success probability = 𝑃1 π‘˜ ; decreases as we include more dot products
  12. 12. Multiple-projections β€’ L independent projections β€’ True near neighbor will be unlikely to be unlucky in all the projections β€’ By increasing L, β€’ we can find the true nearest neighbor with arbitrarily high probability
  13. 13. Accuracy β€’ Two close points p and q, β€’ Separated by 𝑒 = 𝑝 βˆ’ π‘ž β€’ Probability of collision 𝑃 𝐻 𝑒 , 𝑃 𝐻 𝑒 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(π‘ž)) = 0 𝑀 1 𝑒 . 𝑓𝑠 𝑑 𝑒 . 1 βˆ’ 𝑑 𝑀 𝑑𝑑 𝑓𝑠- probability density function of H β€’ As distance u increases, 𝑃 𝐻 𝑒 decreases
  14. 14. Time complexity β€’ For a query point q, β€’ To Find the near neighbor: (𝑇𝑔+𝑇𝑐) β€’ Calculate & hash the projections (𝑇𝑔) β€’ O(DkL); Dβˆ’dimension, kL projections β€’ Search the bucket for collisions (𝑇𝑐) β€’ O(DL𝑁𝑐); D-dimension, L projections, and β€’ where 𝑁𝑐 = π‘žβ€²βˆˆπ· 𝑝 π‘˜ . | π‘ž βˆ’ π‘žβ€² |; 𝑁𝑐 - expected number of collisions for single projection β€’ Analyze β€’ 𝑇𝑔 increases as k & L increase β€’ 𝑇𝑐 decreases as k increases since 𝑝 π‘˜ < 𝑝
  15. 15. How many projections(L)? β€’ For query point p & neighbor q, β€’ For single projection, β€’ Success probability of collisions: β‰₯ 𝑃1 π‘˜ β€’ For L projections, β€’ Failure probability of collisions: ≀ (1 βˆ’ 𝑃1 π‘˜ ) 𝐿 ∴ (1 βˆ’ 𝑃1 π‘˜ ) 𝐿= 𝛿 𝐿 = log 𝛿 log(1 βˆ’ 𝑃1 π‘˜ )
  16. 16. LSH in MAXDIVREL Diversity #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0 #1 #2 #3 … #k dot product 1 1 1 0 .. 1 2 1 0 1 … 1 w 0 1 1 … 0 #1 #2 #3 … #k dot product 1 1 0 1 .. 0 2 0 0 1 … 0 w 0 1 0 … 0 #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0
  17. 17. REFERENCES [1] Anand Rajaraman and Jeff Ullman, β€œChapter Three of β€˜Mining of Massive Datasets,’” pp. 72–130. [2] M. Slaney and M. Casey, β€œLecture Note: LSH,” 2008. [3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, β€œStreaming similarity search over one billion tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol. 6, no. 14, pp. 1930–1941, Sep. 2013.

Γ—