SlideShare a Scribd company logo
2
•
•
•
•
•
•
•
•
•
•
•
•
3
4
• Useful if objects can be represented as sets of features
• and Jaccard similarity is an appropriate similarity measure
coronavirus
hate
the
“I hate the coronavirus!”
I
“I hate lockdowns!”
25 21 18 41 98 12 15 41
25 32 18 11 98 56 33 72
Set representation
lockdowns
hateI
Object Signature Similarity estimation
Minwise hashing
Minwise hashing
used for deduplication of similar web pages
5
I 25 63 98
hate 67 41 18
the 79 34 35
coronavirus 36 21 52
25 21 18
input set
signature
minimum hash value
defines signature component
independent hash functions
6
7
8
hate
the
I
coronavirus
9
10
11
12
13
14
Step 1
Step 2
Step 1
Step 2
15
claims that Ioffe’s algorithm is wrong!
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
“Bagminhash - Minwise hashing algorithm for weighted sets” (Ertl, KDD 2018)
32
“DartMinHash: Fast Sketching for Weighted Sets” (Christiani, 2020)
33
“DartMinHash: Fast Sketching for Weighted Sets” (Christiani, 2020)
34https://github.com/oertl/treeminhash
35https://github.com/oertl/treeminhash
36http://www.nrbook.com/devroye/Devroye_files/chapter_five.pdf
37https://github.com/oertl/treeminhash
38https://github.com/oertl/treeminhash
39https://github.com/oertl/treeminhash
40
DartMinHash performs
best if weights are
normalized
Performance of
DartMinHash depends
on total weight
https://github.com/oertl/treeminhash
41https://github.com/oertl/treeminhash
42
43
“Maximally consistent sampling and the Jaccard index of probability distributions” (Moulton & Jiang, ICDMW 2018)
44
“ProbMinHash–A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity” (Ertl, TKDE 2020)
45
46
47
48
ProbMinHash4ProbMinHash3
ProbMinHash2ProbMinHash1
with replacement w/o replacement
Label sampling
uncorrelatedcorrelated
Pointsampling
49
50
Correlated point generation of ProbMinHash3/4 may reduce estimation error for small sets!
51
52

More Related Content

What's hot

Advanced Methods in Network Science: Community Detection Algorithms
Advanced Methods in Network Science: Community Detection Algorithms Advanced Methods in Network Science: Community Detection Algorithms
Advanced Methods in Network Science: Community Detection Algorithms
Daniel Katz
 

What's hot (20)

Human Computer Interaction
Human Computer InteractionHuman Computer Interaction
Human Computer Interaction
 
Community Detection with Networkx
Community Detection with NetworkxCommunity Detection with Networkx
Community Detection with Networkx
 
Implicit Human-Computer Interaction - Lecture 11 - Next Generation User Inter...
Implicit Human-Computer Interaction - Lecture 11 - Next Generation User Inter...Implicit Human-Computer Interaction - Lecture 11 - Next Generation User Inter...
Implicit Human-Computer Interaction - Lecture 11 - Next Generation User Inter...
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
Embeddings! embeddings everywhere!
Embeddings! embeddings everywhere!Embeddings! embeddings everywhere!
Embeddings! embeddings everywhere!
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Analizando la efectividad de ataques de correlación pasivos en la red de ano...
Analizando la efectividad de ataques de correlación pasivos en la red de ano...Analizando la efectividad de ataques de correlación pasivos en la red de ano...
Analizando la efectividad de ataques de correlación pasivos en la red de ano...
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Analysis of Short RSA Secret Exponent d
Analysis of Short RSA Secret Exponent dAnalysis of Short RSA Secret Exponent d
Analysis of Short RSA Secret Exponent d
 
Imaginer un nouveau modèle de bibliothèque
Imaginer un nouveau modèle de bibliothèqueImaginer un nouveau modèle de bibliothèque
Imaginer un nouveau modèle de bibliothèque
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Happy Go Programming
Happy Go ProgrammingHappy Go Programming
Happy Go Programming
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexing
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Final thesis presentation
Final thesis presentationFinal thesis presentation
Final thesis presentation
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
Advanced Methods in Network Science: Community Detection Algorithms
Advanced Methods in Network Science: Community Detection Algorithms Advanced Methods in Network Science: Community Detection Algorithms
Advanced Methods in Network Science: Community Detection Algorithms
 
LeNet & GoogLeNet
LeNet & GoogLeNetLeNet & GoogLeNet
LeNet & GoogLeNet
 

Recently uploaded

Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
PirithiRaju
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
Sérgio Sacani
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
Sérgio Sacani
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
RUDYLUMAPINET2
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
Jocelyn Atis
 

Recently uploaded (20)

Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdfPests of sugarcane_Binomics_IPM_Dr.UPR.pdf
Pests of sugarcane_Binomics_IPM_Dr.UPR.pdf
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
SAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniquesSAMPLING.pptx for analystical chemistry sample techniques
SAMPLING.pptx for analystical chemistry sample techniques
 
NuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent UniversityNuGOweek 2024 full programme - hosted by Ghent University
NuGOweek 2024 full programme - hosted by Ghent University
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptx
 
Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...Climate extremes likely to drive land mammal extinction during next supercont...
Climate extremes likely to drive land mammal extinction during next supercont...
 
A Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on EarthA Giant Impact Origin for the First Subduction on Earth
A Giant Impact Origin for the First Subduction on Earth
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Seminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptxSeminar on Halal AGriculture and Fisheries.pptx
Seminar on Halal AGriculture and Fisheries.pptx
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
THYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursingTHYROID-PARATHYROID medical surgical nursing
THYROID-PARATHYROID medical surgical nursing
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 

Speeding Up Minwise Hashing for Weighted Sets