• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
MinHashing: Fast Similarity Search
 

MinHashing: Fast Similarity Search

on

  • 1,921 views

How do you find similar movies, articles or users? Calculating similarity in db of 2M movies and 40M checkins in miliseconds. Minhashing trick with fulltext search engine. ...

How do you find similar movies, articles or users? Calculating similarity in db of 2M movies and 40M checkins in miliseconds. Minhashing trick with fulltext search engine.

http://lanyrd.com/2013/rubyslava-july/sckrzm/

Statistics

Views

Total Views
1,921
Views on SlideShare
1,860
Embed Views
61

Actions

Likes
6
Downloads
12
Comments
0

6 Embeds 61

http://lanyrd.com 48
http://www.linkedin.com 3
https://twitter.com 3
https://www.linkedin.com 3
http://www.feedspot.com 2
http://feeds.feedburner.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    MinHashing: Fast Similarity Search MinHashing: Fast Similarity Search Presentation Transcript

    • MinHashing: Fast Similarity Search @jsuchal @rubyslava #30 @synopsitv
    • Problem Users that like this also like...
    • Jaccard Similarity movie1 = {user1, user2, user3, user4} movie2 = {user3, user4, user5, user6} J(A, B) = |A ∩ B| / |A ∪ B| J(m1, m2) = |{user3, user4}| / |{user1 ... user6}| = 2 / 6 = 0.33
    • Jaccard Similarity movie1 = {user1, user2, user3, user4} 30K oops! movie2 = {user3, user4, user5, user6} 2M oops! J(A, B) = |A ∩ B| / |A ∪ B| J(m1, m2) = |{user3, user4}| / |{user1 ... user6}| = 2 / 6 = 0.33
    • MinHashing Key idea: What is the probability that two sets have the same minimum value?
    • MinHashing Key idea: What is the probability that two sets have the same minimum value? P( s(A) = s(B) ) = J(A, B)
    • MinHashing def calculate_minhash(hash_function, set) minhash = Infinity set.each do |item| value = hash_function.call(item) minhash = value if value < minhash end minhash end def generate_signature(set) @hash_functions.map do |hash_function| calculate_minhash(hash_function, set) end end
    • MinHashing def similarity(signature1, signature2) matches = 0 signature1.each_with_index do |minhash, idx| matches += 1 if minhash == signature2[idx] end matches / signature1.size.to_f end
    • Demo http://j.mp/194KB7X
    • Top-k search ● fast top-k search = fulltext search ○ elasticsearch.org ● curl -XGET localhost: 9200/movies/_search? q=likes. signature:(123 456 789)
    • Features ● easily updatable on insertion ○ calculate minhash of new element ○ update existing set minhash if new minhash is lower ● tunable precision at query time! ○ use bigger/smaller part of precalculated signature
    • Extensions ● Weighting ○ not all items in set have same weight ● Locality sensitive hashing ○ shingling ○ boosting high similarity matching ○ e.g. near duplicate detection
    • Resources ● Finding similar items using minhashing http://www.toao.com/posts/finding-similar-items-key-store-minhashing.html ● Mining of Massive Datasets http://infolab.stanford.edu/~ullman/mmds.html