Upcoming SlideShare
Loading in …5
×

# MinHashing: Fast Similarity Search

5,097 views

Published on

How do you find similar movies, articles or users? Calculating similarity in db of 2M movies and 40M checkins in miliseconds. Minhashing trick with fulltext search engine.

http://lanyrd.com/2013/rubyslava-july/sckrzm/

1 Comment
15 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Blog on similar topic: http://deduplication.tumblr.com/

Are you sure you want to  Yes  No
Your message goes here
No Downloads
Views
Total views
5,097
On SlideShare
0
From Embeds
0
Number of Embeds
165
Actions
Shares
0
Downloads
55
Comments
1
Likes
15
Embeds 0
No embeds

No notes for slide

### MinHashing: Fast Similarity Search

1. 1. MinHashing: Fast Similarity Search @jsuchal @rubyslava #30 @synopsitv
2. 2. Problem Users that like this also like...
3. 3. Jaccard Similarity movie1 = {user1, user2, user3, user4} movie2 = {user3, user4, user5, user6} J(A, B) = |A ∩ B| / |A ∪ B| J(m1, m2) = |{user3, user4}| / |{user1 ... user6}| = 2 / 6 = 0.33
4. 4. Jaccard Similarity movie1 = {user1, user2, user3, user4} 30K oops! movie2 = {user3, user4, user5, user6} 2M oops! J(A, B) = |A ∩ B| / |A ∪ B| J(m1, m2) = |{user3, user4}| / |{user1 ... user6}| = 2 / 6 = 0.33
5. 5. MinHashing Key idea: What is the probability that two sets have the same minimum value?
6. 6. MinHashing Key idea: What is the probability that two sets have the same minimum value? P( s(A) = s(B) ) = J(A, B)
7. 7. MinHashing def calculate_minhash(hash_function, set) minhash = Infinity set.each do |item| value = hash_function.call(item) minhash = value if value < minhash end minhash end def generate_signature(set) @hash_functions.map do |hash_function| calculate_minhash(hash_function, set) end end
8. 8. MinHashing def similarity(signature1, signature2) matches = 0 signature1.each_with_index do |minhash, idx| matches += 1 if minhash == signature2[idx] end matches / signature1.size.to_f end
9. 9. Demo http://j.mp/194KB7X
10. 10. Top-k search ● fast top-k search = fulltext search ○ elasticsearch.org ● curl -XGET localhost: 9200/movies/_search? q=likes. signature:(123 456 789)
11. 11. Features ● easily updatable on insertion ○ calculate minhash of new element ○ update existing set minhash if new minhash is lower ● tunable precision at query time! ○ use bigger/smaller part of precalculated signature
12. 12. Extensions ● Weighting ○ not all items in set have same weight ● Locality sensitive hashing ○ shingling ○ boosting high similarity matching ○ e.g. near duplicate detection
13. 13. Resources ● Finding similar items using minhashing http://www.toao.com/posts/finding-similar-items-key-store-minhashing.html ● Mining of Massive Datasets http://infolab.stanford.edu/~ullman/mmds.html