Like this presentation? Why not share!

- MinHashing: Fast Similarity Search by Jano Suchal 3546 views
- Real-time Data De-duplication using... by DECK36 2212 views
- OpenLSH - a framework for locality... by J Singh 628 views
- Locality sensitive hashing by Yasanka Sameera H... 121 views
- Open LSH - september 2014 update by J Singh 603 views
- Deduplication Using Solr: Presented... by Lucidworks 1627 views

5,278

-1

-1

Published on

No Downloads

Total Views

5,278

On Slideshare

0

From Embeds

0

Number of Embeds

2

Shares

0

Downloads

0

Comments

0

Likes

12

No embeds

No notes for slide

- 1. Mining of Massive Datasets using Locality Sensitive Hashing (LSH) J Singh January 9, 2014
- 2. The problems • Large scale image search: • Large scale source repo search: – We have a candidate image – Search the internet to find similar images – We have a candidate source repo – Search github to find similar source repos • Large scale document search: • Large scale X search: – We have a candidate document – Search for similar documents to find possible plagiarism – We have a candidate X – Search for similar X’s © DataThinks 2013-14 2
- 3. A Motivating Example • People Like You – Characterize your Facebook Friends – Find Facebook friends and friends-of-friends who like the same things you do. • Disclosure – This is a pedagogical example, loosely patterned after ShoutFlow – I have no knowledge of how Shoutflow actually worked – I have no connection with the people involved © DataThinks 2013-14 3
- 4. A Likeness Score is… • A number from 1 to 100% – Likeness between Harry and Sally is 100% if they like exactly the same things – Technically, the Jaccard distance = ( LikesHarry LikesSally ) / ( LikesHarry LikesSally) • But mind the n2 problem: 1 Billion users © DataThinks 2013-14 4 5 1017 pairs! 4
- 5. Basic Algorithm 1. Walk the graph – – Build a data set of all users and their friends If access denied, skip 2. Cluster all Billion users into “hash buckets” with similar likes 3. When a new user logs in, hash their likes and compare their similarity with other users in that bucket. • The magic is in the hashing! © DataThinks 2013-14 5
- 6. The LSH Idea • Treat n-valued items as vectors in n-dimensional space. • Draw k random hyperplanes in that space. • For each hyper-plane: – Is each vector above it (1) or below it (0)? • Hash(Item1) = 011 • Hash(Item2) = 001 • The magic is in choosing h1, h2, etc. © DataThinks 2013-14 6 6
- 7. The LSH Hash Code was a Lie… • …But the idea of boiling down a complex object into something that is quickly and easily compared with other complex objects is what matters. • Each purple block represents a person Buckets – Each Bucket represents a group of people who are alike • Members within each bucket still need to be compared to see which ones are the “closest” © DataThinks 2013-14 7
- 8. Choosing hash functions • Introducing minhash 1. 2. 3. 4. Gather the LikeIDs for a person Calculate the hash value for every LikeID. Store the minimum hash value found in step 2. Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values. • The resulting minhashes are 200 integer values representing a random selection of Likes. – Property of minhashes: If the minhashes for two people are the same, their Likes are likely to be the same © DataThinks 2013-14 8 8
- 9. All 200 minhashes must match? • There is a lot of sampling going on in the algorithm. • Make sure we catch most cases – Don’t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for ≥ 1 band. – Sometimes one band will reject a pair and another band will consider it a candidate. © DataThinks 2013-14 9 9
- 10. But 200 was just a guess, no? • Actually, the parameters of the algorithm need to be tuned – Tune b (number of bands) and r (number of hash functions per band) to catch most similar pairs, but few non-similar pairs. © DataThinks 2013-14 10 10
- 11. LSH Involves a Tradeoff • Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. – False positives need to examine more pairs that are not really similar. More processing resources, more time. – False negatives failed to examine pairs that were similar, didn’t find all similar results. But got done faster! © DataThinks 2013-14 11 11
- 12. LSH Tradeoff Example • If we had fewer than 20 bands, (and more rows / band) – – – – fewer pairs would be selected for comparison, the number of false positives would go down, but the number of false negatives would go up, Performance would go up but so would the error rate! © DataThinks 2013-14 12 12
- 13. Running LSH on a cluster of machines • Can be implemented on a Map Reduce Architecture Buckets Map Step Reduce Step © DataThinks 2013-14 13
- 14. Summary • Mine the data and place members into hash buckets • When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets. • Algorithm performance O(n) © DataThinks 2013-14 14 14
- 15. Thank you • J Singh – Principal, DataThinks • j.singh@datathinks.org – Adj. Prof, WPI • References: – Mining of Massive Datasets, Chapter 3 by Anand Rajaraman and Jeff Ullman. http://infolab.stanford.edu/~ullman/mmds/ch3.pdf – Matt’s Blog, Minhash for Dummies http://matthewcasperson.blogspot.com/2013/11/minhash-fordummies.html © DataThinks 2013-14 15 15

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment