Your SlideShare is downloading. ×
Mining of massive datasets using locality sensitive hashing (LSH)
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mining of massive datasets using locality sensitive hashing (LSH)


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Mining of Massive Datasets using Locality Sensitive Hashing (LSH) J Singh January 9, 2014
  • 2. The problems • Large scale image search: • Large scale source repo search: – We have a candidate image – Search the internet to find similar images – We have a candidate source repo – Search github to find similar source repos • Large scale document search: • Large scale X search: – We have a candidate document – Search for similar documents to find possible plagiarism – We have a candidate X – Search for similar X’s © DataThinks 2013-14 2
  • 3. A Motivating Example • People Like You – Characterize your Facebook Friends – Find Facebook friends and friends-of-friends who like the same things you do. • Disclosure – This is a pedagogical example, loosely patterned after ShoutFlow – I have no knowledge of how Shoutflow actually worked – I have no connection with the people involved © DataThinks 2013-14 3
  • 4. A Likeness Score is… • A number from 1 to 100% – Likeness between Harry and Sally is 100% if they like exactly the same things – Technically, the Jaccard distance = ( LikesHarry LikesSally ) / ( LikesHarry LikesSally) • But mind the n2 problem: 1 Billion users © DataThinks 2013-14 4 5 1017 pairs! 4
  • 5. Basic Algorithm 1. Walk the graph – – Build a data set of all users and their friends If access denied, skip 2. Cluster all Billion users into “hash buckets” with similar likes 3. When a new user logs in, hash their likes and compare their similarity with other users in that bucket. • The magic is in the hashing! © DataThinks 2013-14 5
  • 6. The LSH Idea • Treat n-valued items as vectors in n-dimensional space. • Draw k random hyperplanes in that space. • For each hyper-plane: – Is each vector above it (1) or below it (0)? • Hash(Item1) = 011 • Hash(Item2) = 001 • The magic is in choosing h1, h2, etc. © DataThinks 2013-14 6 6
  • 7. The LSH Hash Code was a Lie… • …But the idea of boiling down a complex object into something that is quickly and easily compared with other complex objects is what matters. • Each purple block represents a person Buckets – Each Bucket represents a group of people who are alike • Members within each bucket still need to be compared to see which ones are the “closest” © DataThinks 2013-14 7
  • 8. Choosing hash functions • Introducing minhash 1. 2. 3. 4. Gather the LikeIDs for a person Calculate the hash value for every LikeID. Store the minimum hash value found in step 2. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values. • The resulting minhashes are 200 integer values representing a random selection of Likes. – Property of minhashes: If the minhashes for two people are the same, their Likes are likely to be the same © DataThinks 2013-14 8 8
  • 9. All 200 minhashes must match? • There is a lot of sampling going on in the algorithm. • Make sure we catch most cases – Don’t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for ≥ 1 band. – Sometimes one band will reject a pair and another band will consider it a candidate. © DataThinks 2013-14 9 9
  • 10. But 200 was just a guess, no? • Actually, the parameters of the algorithm need to be tuned – Tune b (number of bands) and r (number of hash functions per band) to catch most similar pairs, but few non-similar pairs. © DataThinks 2013-14 10 10
  • 11. LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. – False positives need to examine more pairs that are not really similar. More processing resources, more time. – False negatives failed to examine pairs that were similar, didn’t find all similar results. But got done faster! © DataThinks 2013-14 11 11
  • 12. LSH Tradeoff Example • If we had fewer than 20 bands, (and more rows / band) – – – – fewer pairs would be selected for comparison, the number of false positives would go down, but the number of false negatives would go up, Performance would go up but so would the error rate! © DataThinks 2013-14 12 12
  • 13. Running LSH on a cluster of machines • Can be implemented on a Map Reduce Architecture Buckets Map Step Reduce Step © DataThinks 2013-14 13
  • 14. Summary • Mine the data and place members into hash buckets • When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets. • Algorithm performance O(n) © DataThinks 2013-14 14 14
  • 15. Thank you • J Singh – Principal, DataThinks • – Adj. Prof, WPI • References: – Mining of Massive Datasets, Chapter 3 by Anand Rajaraman and Jeff Ullman. – Matt’s Blog, Minhash for Dummies © DataThinks 2013-14 15 15