Successfully reported this slideshow.
Upcoming SlideShare
×

# Locality sensitive hashing

Locality Sensitive Hashing technique to demonstrate the feasibility of randomized algorithms

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Locality sensitive hashing

1. 1. Locality Sensitive Hashing Randomized Algorithm
2. 2. Problem Statement β’ Given a query point q, β’ Find closest items to the query point with the probability of 1 β πΏ β’ Iterative methods? β’ Large volume of data β’ Curse of dimensionality
3. 3. Taxonomy β Near Neighbor Query (NN) NN Trees K-d Tree Range Tree B Tree Cover Tree Grid Voronoi Diagram Hash Approximate LSH
4. 4. Approximate LSH β’ Simple Idea β’ if two points are close together, then after a βprojectionβ operation these two points will remain close together
5. 5. LSH Requirement β’ For any given points π, π β π π π π» β π = β π β₯ π1 πππ π β π β€ π1 π π» β π = β π β€ π2 πππ π β π β₯ ππ1 = π2 β’ Hash function h is (π1, π2, π1, π2) sensitive, Ideally we need β’ (π1βπ2) to be large β’ (π1βπ2) to be small
6. 6. P d 2d c.d q q β₯ P(1) β₯ P(2) β₯ P(c) P(1) β₯P(2) β₯P(3) q
7. 7. Probability vs. Distance on candidate pairs
8. 8. Hash Function(Random) β’ Locality-preserving β’ Independent β’ Deterministic β’ Family of Hash Function per various distance measures β’ Euclidean β’ Jaccard β’ Cosine Similarity β’ Hamming
9. 9. LSH Family for Euclidean distance (2d) β’ When d. cos π β€ π, β’ Chance of colliding β’ But not certain β’ But can guarantee, β’ If π β€ π/2, β’ 90 β₯ π β₯ 45 to have d. cos π β€ π β’ β΄ π1 β₯ 1/2 β’ If π β₯ 2π, β’ 90 β₯ π β₯ 60 to have d. cos π β€ π β’ β΄ π2 β€ 1/3 β’ As LSH (π1, π2, π1, π2) sensitive β’ (π, 2π, 1 2 , 1 3 )
10. 10. How to define the projection? β’ Scalar projection (Dot product) β π£ = π£ . π₯ ; π£ = ππ’πππ¦ πππππ‘ ππ π β ππππππ πππ π ππππ π₯ = π£πππ‘ππ π€ππ‘β ππππππ πππππππππ‘π  ππππ π(0,1) β π£ = π£ . π₯ + π π€ ; π€ β π€πππ‘β ππ ππ’πππ‘ππ§ππ‘πππ πππ π β random variable uniformly distributed between 0 and w
11. 11. How to define the projection? β’ K-dot product, that ( π1 π2 ) π> ( π1 π2 ) points at different separations will fall into the same quantization bin β’ Perform k independent dot products β’ Achieve success, β’ if the query and the nearest neighbor are in the same bin in all k dot products β’ Success probability = π1 π ; decreases as we include more dot products
12. 12. Multiple-projections β’ L independent projections β’ True near neighbor will be unlikely to be unlucky in all the projections β’ By increasing L, β’ we can find the true nearest neighbor with arbitrarily high probability
13. 13. Accuracy β’ Two close points p and q, β’ Separated by π’ = π β π β’ Probability of collision π π» π’ , π π» π’ = (π π»(π» π = π»(π)) = 0 π€ 1 π’ . ππ  π‘ π’ . 1 β π‘ π€ ππ‘ ππ - probability density function of H β’ As distance u increases, π π» π’ decreases
14. 14. Time complexity β’ For a query point q, β’ To Find the near neighbor: (ππ+ππ) β’ Calculate & hash the projections (ππ) β’ O(DkL); Dβdimension, kL projections β’ Search the bucket for collisions (ππ) β’ O(DLππ); D-dimension, L projections, and β’ where ππ = πβ²βπ· π π . | π β πβ² |; ππ - expected number of collisions for single projection β’ Analyze β’ ππ increases as k & L increase β’ ππ decreases as k increases since π π < π
15. 15. How many projections(L)? β’ For query point p & neighbor q, β’ For single projection, β’ Success probability of collisions: β₯ π1 π β’ For L projections, β’ Failure probability of collisions: β€ (1 β π1 π ) πΏ β΄ (1 β π1 π ) πΏ= πΏ πΏ = log πΏ log(1 β π1 π )
16. 16. LSH in MAXDIVREL Diversity #1 #2 #3 β¦ #k dot product 1 1 0 0 .. 1 2 0 1 1 β¦ 1 w 0 0 1 β¦ 0 #1 #2 #3 β¦ #k dot product 1 1 1 0 .. 1 2 1 0 1 β¦ 1 w 0 1 1 β¦ 0 #1 #2 #3 β¦ #k dot product 1 1 0 1 .. 0 2 0 0 1 β¦ 0 w 0 1 0 β¦ 0 #1 #2 #3 β¦ #k dot product 1 1 0 0 .. 1 2 0 1 1 β¦ 1 w 0 0 1 β¦ 0
17. 17. REFERENCES [1] Anand Rajaraman and Jeff Ullman, βChapter Three of βMining of Massive Datasets,ββ pp. 72β130. [2] M. Slaney and M. Casey, βLecture Note: LSH,β 2008. [3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, βStreaming similarity search over one billion tweets using parallel locality-sensitive hashing,β Proc. VLDB Endow., vol. 6, no. 14, pp. 1930β1941, Sep. 2013.