Minwise hashing (MinHash) has become a standard tool for calculating signatures (fingerprints) of sets that is used in many applications for similarity estimation and nearest neighbor search. Generalizations have been proposed that are able to calculate signatures for weighted sets and allow estimating either the weighted Jaccard similarity or the probability Jaccard similarity. While there are already very fast algorithms for calculating signatures of unweighted sets, until recently there were no such algorithms for weighted sets. In this talk, the basic ideas of the latest weighted minwise hashing algorithms BagMinHash, DartMinHash, TreeMinHash, and ProbMinHash are presented. All of them have been developed only in the last two years and can reduce the computation costs by many orders of magnitude.
4. 4
• Useful if objects can be represented as sets of features
• and Jaccard similarity is an appropriate similarity measure
coronavirus
hate
the
“I hate the coronavirus!”
I
“I hate lockdowns!”
25 21 18 41 98 12 15 41
25 32 18 11 98 56 33 72
Set representation
lockdowns
hateI
Object Signature Similarity estimation
Minwise hashing
Minwise hashing
used for deduplication of similar web pages
5. 5
I 25 63 98
hate 67 41 18
the 79 34 35
coronavirus 36 21 52
25 21 18
input set
signature
minimum hash value
defines signature component
independent hash functions