SlideShare a Scribd company logo
Efficient Estimation for High Similarities
using Odd Sketches
Michael Mitzenmacher Rasmus Pagh Ninh Pham
Harvard University IT University of Copenhagen IT University of Copenhagen
Reported by
Souop Fotso Jocelyn Axel
Softskills Seminar, January 2018
Abstract
This paper present the implementation and the evaluation of Odd Sketch,
a compact binary sketch for estimating the Jaccard similarity of two sets.
This method provide a highly space-efficient and time-efficient estimator for
sets of high similarity, which is relevant in applications such as web duplicate
detection, collaborative filtering, and association rule learning. The method
extends to weighted Jaccard similarity. Experimental results show that the
Odd Sketche is more efficient than b-bit minwise hashing schemes on associ-
ation rule learning and web duplicate detection tasks.
1. Introduction
The estimation of the Jaccard similarity is a fondamental problem in
many computer applications in which we deal with collections of sets con-
taining thousands (sometimes even billions) of items.
Given two sets S1 and S1 ( S1, S2 ⊆ Ω={0, 1, ..., D − 1} ) their similarity
can be quantified using the Jaccard similarity coeffcient:
J(S1, S2) =
|S1 ∩ S2|
|S1 ∪ S2|
The main challenge in many computer applications is to have an quick esti-
mate of J. Existing solutions while highly efficient in general, are not optimal
1
when J is close to 1. The paper present a novel solution, the Odd Sketch,
that yields improved precision in the high similarity regime.
2. Previous works
2.1. Minwise Hashing
Minwise hashing is a powerful algorithmic technique to estimate set sim-
ilarities, originally proposed by Broder et al. [1].
Given a random permutation π : Ω → Ω, the Jaccard similarity of S1 and S2 is
J(S1, S2) = Pr[min(π(S1)) = min(π(S2))]
where min(π (S1)) denotes the minhash of S1. Therefore we get an esti-
mator for J by considering a sequence of permutations π1,...,πk and storing
the annotated minhashes.
S1 = (i, min(πi(S1))) | i = 1, . . . , k ,
S1 = (i, min(πi(S2))) | i = 1, . . . , k .
We estimate J by the fraction:
ˆJ =
|S1 ∩ S2|
k
This estimator is unbiased, and by independence of the permutations it
can be shown that
V ar(ˆJ) =
J(J − 1)
k
2.2. b-bit Minwise Hashing
Li and Konig [2] proposed a time and space efficient version of the original
minwise hashing scheme. Instead of storing b = 32 or b = 64 bits for each
minhashes, this approach suggested using the lowest b bits. It is based on
the intuition that the same hash values give the same lowest b bits whereas
the different hash values give different lowest b bits with probability 1-1/2b
.
2
Proceeding similarly as done for the minhash but saving only the lowest b
bit for each set, we can have an estimate of J and its variance:
However for similarity close to 1, b-bit minhash will produce almost identical
sketches, which reveal very little about *how* close to 1 the similarity is.
Therefore this approach is non optimal in a high similarity regime.
3. Proposed solution
The authors proposed the Odd Sketch, a compact binary sketch similar
to a Bloom filter with one hash function, constructed on the original min-
hashes with the ”odd” feature that the usual disjunction is replaced by an
exclusive-or operation.
Given a set S, the odd sketch of set S that we denote by odd(S) is a binary
array of size n (n>2) that records in the ith position the parity of the number
of elements of set S that are hashed (by a fully random hash function) in
position i.
Here is a pseudo code of the Odd sketch construction:
Algorithm 1 Odd sketch (S,n)
Require: The set S and the size of sketch in bits n
1: Initialize the array A of size n to zero
2: Pick a random hash function h: Ω →[n]
3: for each set element x S do
4: A[h(x)]=A[h(x)] 1 //flip the bit in the ith=h(x) position
5: end for
6: return A
Because odd(S) records the parity of the number of elements that hash
to a location, it follows that :
3
The authors proved that if we construct the the Odd sketches Odd(S1) and
Odd(S2) from the Minhashes S1 and S2 derived from the original sets S1
and S2 we can estimate the Jaccard similarity coeffcient J( S1, S2) as follow:
Where k is the numbrer of permutation used during the minhash step.
Both Odd Sketches and b-bit minwise hashing can be viewed as variations of
the original minwise hashing scheme that reduce the number of bits used. The
quality of their estimators is dependent on the quality of the original minwise
estimators. In practice, both Odd Sketches and b-bit minwise hashing need
to use more permutations but less storage space than the original minwise
hashing scheme.
4. Evaluation Highlights
In oder to evaluate the performances, the authors implemented b-bit min-
wise hashing and odd sketch in matlab and compared the performances of
both approaches on Association rule learning and web duplication detection
tasks. It emerges that:
• Comparing the accuracy (-log(MSE)) of both approaches on a sparse
data set we note that Odd Sketch provides a smaller error than the
b-bit minwise approach even when both the approaches use the same
number of permutation. The difference is more dramatic when J is very
high
• Association rule learning: The authors measured the precision-
recall ratio of both approaches on detecting the pairwise items that
have Jaccard similarity larger than a threshold J0 =0.9 . The results
obtained demonstrate the superiority of Odd Sketch compared to 1/2-
bit minwise hashing with respect to precision. The Odd Sketch achieved
up to 20% higher precision while providing similar recall.
4
• Web duplicate detection:
In this experiment, the authors compared the performance of the two
approaches on web duplicate detection tasks on the bag of words dataset
. They picked three high dimensional datasets and computed all pair-
wise Jaccard similarities among documents, and retrieved every pair
with J ≥ J0. For the sake of comparison, they used the same number
of permutations and considered the thresholds J0 = 0.85 and J0 = 0.90.
The precision-recall ratio were used again as the standard measure. It
comes out that Odd Sketch is still better in precision but slightly worse
in recall.
5. CONCLUSION
The paper presented the Odd Sketch, a compact binary sketch for esti-
mating similarity of two sets. Odd Sketch is time and space efficient and gives
good results even in the high similarity regime. Experiments on synthetic
and real world datasets demonstrate the efficiency of Odd Sketches in com-
parison with b-bit minwise hashing schemes on association rule learning and
web duplicate detection tasks. From the authors, there is great expectation
that the odd sketch will bee used for other applications.
6. RFERENCES
[1] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise
independent permutations. J. Comput. Syst. Sci., 60(3):630659, 2000.
[2] P. Li and A. C. K¨onig. b-bit minwise hashing. In WWW, pages 671680,
2010
5

More Related Content

What's hot

Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
Ajay Bidyarthy
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
Laura Petrosanu
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
ANIRBANMAJUMDAR18
 
Dijkstra s algorithm
Dijkstra s algorithmDijkstra s algorithm
Dijkstra s algorithm
mansab MIRZA
 
Dijkstra's Algorithm
Dijkstra's Algorithm Dijkstra's Algorithm
Dijkstra's Algorithm
Rashik Ishrak Nahian
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
Rajesh Piryani
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
talktoharry
 
d
dd
Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)
Anshul gour
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
ArijitDhali
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Ashish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish garg research paper 660_CamReady
Ashish garg research paper 660_CamReady
Ashish Garg
 
Vector quantization
Vector quantizationVector quantization
Vector quantization
Rajani Sharma
 
Color
ColorColor
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Anna Fensel
 
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREEA NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
ijscmc
 
Networks dijkstra's algorithm- pgsr
Networks  dijkstra's algorithm- pgsrNetworks  dijkstra's algorithm- pgsr
Networks dijkstra's algorithm- pgsr
Linawati Adiman
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
SSA KPI
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
IJERA Editor
 
Dijkstra algorithm
Dijkstra algorithmDijkstra algorithm
Dijkstra algorithm
A. S. M. Shafi
 

What's hot (20)

Optimization Techniques
Optimization TechniquesOptimization Techniques
Optimization Techniques
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...Linear regression [Theory and Application (In physics point of view) using py...
Linear regression [Theory and Application (In physics point of view) using py...
 
Dijkstra s algorithm
Dijkstra s algorithmDijkstra s algorithm
Dijkstra s algorithm
 
Dijkstra's Algorithm
Dijkstra's Algorithm Dijkstra's Algorithm
Dijkstra's Algorithm
 
Optics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structureOptics ordering points to identify the clustering structure
Optics ordering points to identify the clustering structure
 
Clustering techniques
Clustering techniquesClustering techniques
Clustering techniques
 
d
dd
d
 
Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)Dijkstra & flooding ppt(Routing algorithm)
Dijkstra & flooding ppt(Routing algorithm)
 
Dijkstra's Algorithm
Dijkstra's AlgorithmDijkstra's Algorithm
Dijkstra's Algorithm
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Ashish garg research paper 660_CamReady
Ashish garg research paper 660_CamReadyAshish garg research paper 660_CamReady
Ashish garg research paper 660_CamReady
 
Vector quantization
Vector quantizationVector quantization
Vector quantization
 
Color
ColorColor
Color
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREEA NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
A NEW PARALLEL ALGORITHM FOR COMPUTING MINIMUM SPANNING TREE
 
Networks dijkstra's algorithm- pgsr
Networks  dijkstra's algorithm- pgsrNetworks  dijkstra's algorithm- pgsr
Networks dijkstra's algorithm- pgsr
 
Graph Based Clustering
Graph Based ClusteringGraph Based Clustering
Graph Based Clustering
 
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed ArithmeticLow Power Adaptive FIR Filter Based on Distributed Arithmetic
Low Power Adaptive FIR Filter Based on Distributed Arithmetic
 
Dijkstra algorithm
Dijkstra algorithmDijkstra algorithm
Dijkstra algorithm
 

Similar to Report on Efficient Estimation for High Similarities using Odd Sketches

Joint3DShapeMatching
Joint3DShapeMatchingJoint3DShapeMatching
Joint3DShapeMatching
Mamoon Ismail Khalid
 
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
Mamoon Ismail Khalid
 
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
CSCJournals
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
lakshmidkurup
 
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Karlos Svoboda
 
Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with Pregel
Sqrrl
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
matrixMultiplication
matrixMultiplicationmatrixMultiplication
matrixMultiplication
CNP Slagle
 
240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...
240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...
240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...
thanhdowork
 
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
Nexgen Technology
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
Nexgen Technology
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
nexgentech15
 
50120130406039
5012013040603950120130406039
50120130406039
IAEME Publication
 
High-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHigh-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECL
HPCC Systems
 
RS
RSRS
10.1.1.630.8055
10.1.1.630.805510.1.1.630.8055
10.1.1.630.8055
Christian Uldall Pedersen
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Kyong-Ha Lee
 
Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)
Mumbai Academisc
 
Bt9301, computer graphics
Bt9301, computer graphicsBt9301, computer graphics
Bt9301, computer graphics
smumbahelp
 
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
SSA KPI
 

Similar to Report on Efficient Estimation for High Similarities using Odd Sketches (20)

Joint3DShapeMatching
Joint3DShapeMatchingJoint3DShapeMatching
Joint3DShapeMatching
 
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...Joint3DShapeMatching  - a fast approach to 3D model matching using MatchALS 3...
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
 
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism HardwarePerformance Improvement of Vector Quantization with Bit-parallelism Hardware
Performance Improvement of Vector Quantization with Bit-parallelism Hardware
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse Codin
 
Scalable Graph Clustering with Pregel
Scalable Graph Clustering with PregelScalable Graph Clustering with Pregel
Scalable Graph Clustering with Pregel
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
matrixMultiplication
matrixMultiplicationmatrixMultiplication
matrixMultiplication
 
240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...
240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...
240603_Thanh_LabSeminar[Imagine That! Abstract-to-Intricate Text-to-Image Syn...
 
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE  - IEEE PROJE...
SUBGRAPH MATCHING WITH SET SIMILARITY IN A LARGE GRAPH DATABASE - IEEE PROJE...
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
 
Subgraph matching with set similarity in a
Subgraph matching with set similarity in aSubgraph matching with set similarity in a
Subgraph matching with set similarity in a
 
50120130406039
5012013040603950120130406039
50120130406039
 
High-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECLHigh-Dimensional Network Estimation using ECL
High-Dimensional Network Estimation using ECL
 
RS
RSRS
RS
 
10.1.1.630.8055
10.1.1.630.805510.1.1.630.8055
10.1.1.630.8055
 
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
 
Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)Face recognition using laplacianfaces (synopsis)
Face recognition using laplacianfaces (synopsis)
 
Bt9301, computer graphics
Bt9301, computer graphicsBt9301, computer graphics
Bt9301, computer graphics
 
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
Efficient Solution of Two-Stage Stochastic Linear Programs Using Interior Poi...
 

Recently uploaded

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 

Recently uploaded (20)

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 

Report on Efficient Estimation for High Similarities using Odd Sketches

  • 1. Efficient Estimation for High Similarities using Odd Sketches Michael Mitzenmacher Rasmus Pagh Ninh Pham Harvard University IT University of Copenhagen IT University of Copenhagen Reported by Souop Fotso Jocelyn Axel Softskills Seminar, January 2018 Abstract This paper present the implementation and the evaluation of Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. This method provide a highly space-efficient and time-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity. Experimental results show that the Odd Sketche is more efficient than b-bit minwise hashing schemes on associ- ation rule learning and web duplicate detection tasks. 1. Introduction The estimation of the Jaccard similarity is a fondamental problem in many computer applications in which we deal with collections of sets con- taining thousands (sometimes even billions) of items. Given two sets S1 and S1 ( S1, S2 ⊆ Ω={0, 1, ..., D − 1} ) their similarity can be quantified using the Jaccard similarity coeffcient: J(S1, S2) = |S1 ∩ S2| |S1 ∪ S2| The main challenge in many computer applications is to have an quick esti- mate of J. Existing solutions while highly efficient in general, are not optimal 1
  • 2. when J is close to 1. The paper present a novel solution, the Odd Sketch, that yields improved precision in the high similarity regime. 2. Previous works 2.1. Minwise Hashing Minwise hashing is a powerful algorithmic technique to estimate set sim- ilarities, originally proposed by Broder et al. [1]. Given a random permutation π : Ω → Ω, the Jaccard similarity of S1 and S2 is J(S1, S2) = Pr[min(π(S1)) = min(π(S2))] where min(π (S1)) denotes the minhash of S1. Therefore we get an esti- mator for J by considering a sequence of permutations π1,...,πk and storing the annotated minhashes. S1 = (i, min(πi(S1))) | i = 1, . . . , k , S1 = (i, min(πi(S2))) | i = 1, . . . , k . We estimate J by the fraction: ˆJ = |S1 ∩ S2| k This estimator is unbiased, and by independence of the permutations it can be shown that V ar(ˆJ) = J(J − 1) k 2.2. b-bit Minwise Hashing Li and Konig [2] proposed a time and space efficient version of the original minwise hashing scheme. Instead of storing b = 32 or b = 64 bits for each minhashes, this approach suggested using the lowest b bits. It is based on the intuition that the same hash values give the same lowest b bits whereas the different hash values give different lowest b bits with probability 1-1/2b . 2
  • 3. Proceeding similarly as done for the minhash but saving only the lowest b bit for each set, we can have an estimate of J and its variance: However for similarity close to 1, b-bit minhash will produce almost identical sketches, which reveal very little about *how* close to 1 the similarity is. Therefore this approach is non optimal in a high similarity regime. 3. Proposed solution The authors proposed the Odd Sketch, a compact binary sketch similar to a Bloom filter with one hash function, constructed on the original min- hashes with the ”odd” feature that the usual disjunction is replaced by an exclusive-or operation. Given a set S, the odd sketch of set S that we denote by odd(S) is a binary array of size n (n>2) that records in the ith position the parity of the number of elements of set S that are hashed (by a fully random hash function) in position i. Here is a pseudo code of the Odd sketch construction: Algorithm 1 Odd sketch (S,n) Require: The set S and the size of sketch in bits n 1: Initialize the array A of size n to zero 2: Pick a random hash function h: Ω →[n] 3: for each set element x S do 4: A[h(x)]=A[h(x)] 1 //flip the bit in the ith=h(x) position 5: end for 6: return A Because odd(S) records the parity of the number of elements that hash to a location, it follows that : 3
  • 4. The authors proved that if we construct the the Odd sketches Odd(S1) and Odd(S2) from the Minhashes S1 and S2 derived from the original sets S1 and S2 we can estimate the Jaccard similarity coeffcient J( S1, S2) as follow: Where k is the numbrer of permutation used during the minhash step. Both Odd Sketches and b-bit minwise hashing can be viewed as variations of the original minwise hashing scheme that reduce the number of bits used. The quality of their estimators is dependent on the quality of the original minwise estimators. In practice, both Odd Sketches and b-bit minwise hashing need to use more permutations but less storage space than the original minwise hashing scheme. 4. Evaluation Highlights In oder to evaluate the performances, the authors implemented b-bit min- wise hashing and odd sketch in matlab and compared the performances of both approaches on Association rule learning and web duplication detection tasks. It emerges that: • Comparing the accuracy (-log(MSE)) of both approaches on a sparse data set we note that Odd Sketch provides a smaller error than the b-bit minwise approach even when both the approaches use the same number of permutation. The difference is more dramatic when J is very high • Association rule learning: The authors measured the precision- recall ratio of both approaches on detecting the pairwise items that have Jaccard similarity larger than a threshold J0 =0.9 . The results obtained demonstrate the superiority of Odd Sketch compared to 1/2- bit minwise hashing with respect to precision. The Odd Sketch achieved up to 20% higher precision while providing similar recall. 4
  • 5. • Web duplicate detection: In this experiment, the authors compared the performance of the two approaches on web duplicate detection tasks on the bag of words dataset . They picked three high dimensional datasets and computed all pair- wise Jaccard similarities among documents, and retrieved every pair with J ≥ J0. For the sake of comparison, they used the same number of permutations and considered the thresholds J0 = 0.85 and J0 = 0.90. The precision-recall ratio were used again as the standard measure. It comes out that Odd Sketch is still better in precision but slightly worse in recall. 5. CONCLUSION The paper presented the Odd Sketch, a compact binary sketch for esti- mating similarity of two sets. Odd Sketch is time and space efficient and gives good results even in the high similarity regime. Experiments on synthetic and real world datasets demonstrate the efficiency of Odd Sketches in com- parison with b-bit minwise hashing schemes on association rule learning and web duplicate detection tasks. From the authors, there is great expectation that the odd sketch will bee used for other applications. 6. RFERENCES [1] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630659, 2000. [2] P. Li and A. C. K¨onig. b-bit minwise hashing. In WWW, pages 671680, 2010 5