Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

Locality Sensitive Hashing technique to demonstrate the feasibility of randomized algorithms

No Downloads

Total views

1,721

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

36

Comments

0

Likes

2

No notes for slide

K-d tree algorithm - The problem with multidimensional algorithms such as k-d trees is that they break down when the dimensionality of the search space is greater than a few dimensions O(N)

Grid: Close points should be in same grid cell. But some can always lay across the boundary (no matter how close). Some may be further than 1 grid cell, but still close. And in high dimensions, the number of neighboring grid cells grows exponentially. One option is to randomly shift (and rotate) and try again

Hash β O(1) search, while O(N) memory

However, suppose d is larger than a. In order for there to be any chance of the two points falling in the same bucket, we need d cos ΞΈ β€ a

- 1. Locality Sensitive Hashing Randomized Algorithm
- 2. Problem Statement β’ Given a query point q, β’ Find closest items to the query point with the probability of 1 β πΏ β’ Iterative methods? β’ Large volume of data β’ Curse of dimensionality
- 3. Taxonomy β Near Neighbor Query (NN) NN Trees K-d Tree Range Tree B Tree Cover Tree Grid Voronoi Diagram Hash Approximate LSH
- 4. Approximate LSH β’ Simple Idea β’ if two points are close together, then after a βprojectionβ operation these two points will remain close together
- 5. LSH Requirement β’ For any given points π, π β π π π π» β π = β π β₯ π1 πππ π β π β€ π1 π π» β π = β π β€ π2 πππ π β π β₯ ππ1 = π2 β’ Hash function h is (π1, π2, π1, π2) sensitive, Ideally we need β’ (π1βπ2) to be large β’ (π1βπ2) to be small
- 6. P d 2d c.d q q β₯ P(1) β₯ P(2) β₯ P(c) P(1) β₯P(2) β₯P(3) q
- 7. Probability vs. Distance on candidate pairs
- 8. Hash Function(Random) β’ Locality-preserving β’ Independent β’ Deterministic β’ Family of Hash Function per various distance measures β’ Euclidean β’ Jaccard β’ Cosine Similarity β’ Hamming
- 9. LSH Family for Euclidean distance (2d) β’ When d. cos π β€ π, β’ Chance of colliding β’ But not certain β’ But can guarantee, β’ If π β€ π/2, β’ 90 β₯ π β₯ 45 to have d. cos π β€ π β’ β΄ π1 β₯ 1/2 β’ If π β₯ 2π, β’ 90 β₯ π β₯ 60 to have d. cos π β€ π β’ β΄ π2 β€ 1/3 β’ As LSH (π1, π2, π1, π2) sensitive β’ (π, 2π, 1 2 , 1 3 )
- 10. How to define the projection? β’ Scalar projection (Dot product) β π£ = π£ . π₯ ; π£ = ππ’πππ¦ πππππ‘ ππ π β ππππππ πππ π ππππ π₯ = π£πππ‘ππ π€ππ‘β ππππππ πππππππππ‘π ππππ π(0,1) β π£ = π£ . π₯ + π π€ ; π€ β π€πππ‘β ππ ππ’πππ‘ππ§ππ‘πππ πππ π β random variable uniformly distributed between 0 and w
- 11. How to define the projection? β’ K-dot product, that ( π1 π2 ) π> ( π1 π2 ) points at different separations will fall into the same quantization bin β’ Perform k independent dot products β’ Achieve success, β’ if the query and the nearest neighbor are in the same bin in all k dot products β’ Success probability = π1 π ; decreases as we include more dot products
- 12. Multiple-projections β’ L independent projections β’ True near neighbor will be unlikely to be unlucky in all the projections β’ By increasing L, β’ we can find the true nearest neighbor with arbitrarily high probability
- 13. Accuracy β’ Two close points p and q, β’ Separated by π’ = π β π β’ Probability of collision π π» π’ , π π» π’ = (π π»(π» π = π»(π)) = 0 π€ 1 π’ . ππ π‘ π’ . 1 β π‘ π€ ππ‘ ππ - probability density function of H β’ As distance u increases, π π» π’ decreases
- 14. Time complexity β’ For a query point q, β’ To Find the near neighbor: (ππ+ππ) β’ Calculate & hash the projections (ππ) β’ O(DkL); Dβdimension, kL projections β’ Search the bucket for collisions (ππ) β’ O(DLππ); D-dimension, L projections, and β’ where ππ = πβ²βπ· π π . | π β πβ² |; ππ - expected number of collisions for single projection β’ Analyze β’ ππ increases as k & L increase β’ ππ decreases as k increases since π π < π
- 15. How many projections(L)? β’ For query point p & neighbor q, β’ For single projection, β’ Success probability of collisions: β₯ π1 π β’ For L projections, β’ Failure probability of collisions: β€ (1 β π1 π ) πΏ β΄ (1 β π1 π ) πΏ= πΏ πΏ = log πΏ log(1 β π1 π )
- 16. LSH in MAXDIVREL Diversity #1 #2 #3 β¦ #k dot product 1 1 0 0 .. 1 2 0 1 1 β¦ 1 w 0 0 1 β¦ 0 #1 #2 #3 β¦ #k dot product 1 1 1 0 .. 1 2 1 0 1 β¦ 1 w 0 1 1 β¦ 0 #1 #2 #3 β¦ #k dot product 1 1 0 1 .. 0 2 0 0 1 β¦ 0 w 0 1 0 β¦ 0 #1 #2 #3 β¦ #k dot product 1 1 0 0 .. 1 2 0 1 1 β¦ 1 w 0 0 1 β¦ 0
- 17. REFERENCES [1] Anand Rajaraman and Jeff Ullman, βChapter Three of βMining of Massive Datasets,ββ pp. 72β130. [2] M. Slaney and M. Casey, βLecture Note: LSH,β 2008. [3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, βStreaming similarity search over one billion tweets using parallel locality-sensitive hashing,β Proc. VLDB Endow., vol. 6, no. 14, pp. 1930β1941, Sep. 2013.

No public clipboards found for this slide

Be the first to comment