SlideShare a Scribd company logo
1 of 2
Download to read offline
3.1.2 MinHash for similarities
The adjacent matrix is helpful, but there is difficulty with applications when the data is large. They are
primarily large and increase complexity. We are estimating the similarity of all pairs in <<Q(n2
). This is
problematic if we want to use, let us take an example, a commerce site with 12 million products. We
want to identify and provide a ranking similarity score. The total pairs will be 12 million elements, 144
x 1012
pairs. Each pair has a 64-bit float, and we need 1.152 x 1015
bytes to store the adjacency matrix
at the memory. In such large measurements make it difficult to use these data. Things get rough when
we have to go to a more extensive dataset like a social network dataset or web data set. Besides, the
data is highly likely to have many features - columns. So, it is exceedingly challenging to store this data
at the memory and perform similarity checks; we have to find an alternative technique to locate groups
of high similarity pairs. We cannot check all the pairs. The MinHash allows us to compress all these
features to a smaller dimensional space that works well and maintains high dimensionality [20,9].
The basic idea is that the compressed feature spaces maintain similarities among the two objects. Small
signatures will be smaller than the full feature vector. The similarity between these signatures is
equivalent or very similar to the full feature space. Then with Jaccard similarity, because we have a set,
we can find a similar set. The MinHash lets us evaluate similarity in low dimensional space. The locality
sensitive hashing allows (LSH) us to deal with the pair problem. We only evaluate similarity for some
candidates set. Some pairs only matter if they exceed a threshold, which lets us skip a lot of pair
checking. While computing the small signature, we do not have to store the full feature vector.
Similarities of two pairs are equal with similarities to their signature. Moreover, the final step is to check
the pairs with similar signature to measure the similarity with the feature vector. The key idea is to hash
each element with a hash function.
Hashing is converting input of any length into a fixed-size string of text using a mathematical function.
Any text can be converting into an array of numbers and letters through the algorithm. The messages
will be hashed the input. The algorithm is called hashed function, and the output is called hashed values.
The hashed values must be unique; it should be impossible to produce the same hashed values to any
different input. The same message should always produce the same hashed values. The hash speed is an
essential factor. The hash function should always produce quick hash values.
The hash value has to be small enough that the signature fits in memory, and Sim(C1,C2) are the same
with h(C1) and h(C2), also; if Sim(C1) and Sim(C2) are high, then the probability to h(C1) and h(C2) is
high. We have to know that not all similarity hash a suitable function. For example, Jaccard similarity is
suitable for MinHash. The similarities of the two signatures are the fraction of the hash function in which
they agree. Finally, with MinHash, we compressed long vectors into a short signature[20,9,21].

More Related Content

Similar to MinHash_similarities.pdf

Real timefrauddetectiononbigdata
Real timefrauddetectiononbigdataReal timefrauddetectiononbigdata
Real timefrauddetectiononbigdata
Pranab Ghosh
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashing
chidabdu
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
Kushaal Singla
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list ApproachOn Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list Approach
Patrick Nguyen
 
Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)
Xiaojuan (Kathleen) WANG
 

Similar to MinHash_similarities.pdf (20)

Real timefrauddetectiononbigdata
Real timefrauddetectiononbigdataReal timefrauddetectiononbigdata
Real timefrauddetectiononbigdata
 
Sienna 9 hashing
Sienna 9 hashingSienna 9 hashing
Sienna 9 hashing
 
handle data with DHT and load balnce over P2P network
handle data with DHT and load balnce over P2P networkhandle data with DHT and load balnce over P2P network
handle data with DHT and load balnce over P2P network
 
SPIE-2014
SPIE-2014SPIE-2014
SPIE-2014
 
How Hashing Algorithms Work
How Hashing Algorithms WorkHow Hashing Algorithms Work
How Hashing Algorithms Work
 
Hashing 1
Hashing 1Hashing 1
Hashing 1
 
Chapter 12 ds
Chapter 12 dsChapter 12 ds
Chapter 12 ds
 
C interview-questions-techpreparation
C interview-questions-techpreparationC interview-questions-techpreparation
C interview-questions-techpreparation
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Hashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdfHashing and File Structures in Data Structure.pdf
Hashing and File Structures in Data Structure.pdf
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Dnssec tutorial-crypto-defs
Dnssec tutorial-crypto-defsDnssec tutorial-crypto-defs
Dnssec tutorial-crypto-defs
 
Simple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In CloudSimple Load Rebalancing For Distributed Hash Tables In Cloud
Simple Load Rebalancing For Distributed Hash Tables In Cloud
 
Tapestry
TapestryTapestry
Tapestry
 
On Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list ApproachOn Improving the Performance of Data Leak Prevention using White-list Approach
On Improving the Performance of Data Leak Prevention using White-list Approach
 
Data types ,variables,array
Data types ,variables,arrayData types ,variables,array
Data types ,variables,array
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)Supervised Quantization for Similarity Search (camera-ready)
Supervised Quantization for Similarity Search (camera-ready)
 

Recently uploaded

Car Seat Covers and Seat Protection Guide
Car Seat Covers and Seat Protection GuideCar Seat Covers and Seat Protection Guide
Car Seat Covers and Seat Protection Guide
AskXX.com
 
5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION .ppt
5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION  .ppt5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION  .ppt
5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION .ppt
hiren65650
 
CBC used in Indian Railways for train coupling.pptx
CBC used in Indian Railways for train coupling.pptxCBC used in Indian Railways for train coupling.pptx
CBC used in Indian Railways for train coupling.pptx
Gaurav Singh
 

Recently uploaded (18)

Automotive Bootloader Complete Guide with UDS Frame Format
Automotive Bootloader Complete Guide with UDS Frame FormatAutomotive Bootloader Complete Guide with UDS Frame Format
Automotive Bootloader Complete Guide with UDS Frame Format
 
Tips for Securing Manufacturing Opertaions.pdf
Tips for Securing Manufacturing Opertaions.pdfTips for Securing Manufacturing Opertaions.pdf
Tips for Securing Manufacturing Opertaions.pdf
 
Quicker and better: South Korea’s new high-speed train 'EMU-320'
Quicker and better: South Korea’s new high-speed train 'EMU-320'Quicker and better: South Korea’s new high-speed train 'EMU-320'
Quicker and better: South Korea’s new high-speed train 'EMU-320'
 
Advanced Technology for Auto Part Industry Inventory Solutions
Advanced Technology for Auto Part Industry Inventory SolutionsAdvanced Technology for Auto Part Industry Inventory Solutions
Advanced Technology for Auto Part Industry Inventory Solutions
 
Introduction to UDS over CAN | UDS Service
Introduction to UDS over CAN | UDS ServiceIntroduction to UDS over CAN | UDS Service
Introduction to UDS over CAN | UDS Service
 
technical report on EV. EVs can offer benefitssuch as lower operating costs a...
technical report on EV. EVs can offer benefitssuch as lower operating costs a...technical report on EV. EVs can offer benefitssuch as lower operating costs a...
technical report on EV. EVs can offer benefitssuch as lower operating costs a...
 
Basic of Firmware & Embedded Software Programming in C
Basic of Firmware & Embedded Software Programming in CBasic of Firmware & Embedded Software Programming in C
Basic of Firmware & Embedded Software Programming in C
 
CAMIONES TOYOTA N04C- Engine y HINO 300.
CAMIONES TOYOTA N04C- Engine y HINO 300.CAMIONES TOYOTA N04C- Engine y HINO 300.
CAMIONES TOYOTA N04C- Engine y HINO 300.
 
What Should You Do If Your Jaguar XF Bluetooth Isn't Working
What Should You Do If Your Jaguar XF Bluetooth Isn't WorkingWhat Should You Do If Your Jaguar XF Bluetooth Isn't Working
What Should You Do If Your Jaguar XF Bluetooth Isn't Working
 
Car Seat Covers and Seat Protection Guide
Car Seat Covers and Seat Protection GuideCar Seat Covers and Seat Protection Guide
Car Seat Covers and Seat Protection Guide
 
-VDA-Special-Characteristics Special characteristics.pdf
-VDA-Special-Characteristics Special characteristics.pdf-VDA-Special-Characteristics Special characteristics.pdf
-VDA-Special-Characteristics Special characteristics.pdf
 
5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION .ppt
5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION  .ppt5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION  .ppt
5s-5S 5S 5S 5S 5S 5S 5S PRESENTATION .ppt
 
What Should BMW Owners Know About Steptronic Transmission Problems
What Should BMW Owners Know About Steptronic Transmission ProblemsWhat Should BMW Owners Know About Steptronic Transmission Problems
What Should BMW Owners Know About Steptronic Transmission Problems
 
Why Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
Why Won't Your Audi A3 Shift Into Reverse Gear Let's InvestigateWhy Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
Why Won't Your Audi A3 Shift Into Reverse Gear Let's Investigate
 
Essential Maintenance Tips For Commercial Vans.
Essential Maintenance Tips For Commercial Vans.Essential Maintenance Tips For Commercial Vans.
Essential Maintenance Tips For Commercial Vans.
 
CBC used in Indian Railways for train coupling.pptx
CBC used in Indian Railways for train coupling.pptxCBC used in Indian Railways for train coupling.pptx
CBC used in Indian Railways for train coupling.pptx
 
Timer Handling in UDS | S3 Server Timer | P2 and P2 Start Timer
Timer Handling in UDS | S3 Server Timer | P2 and P2 Start TimerTimer Handling in UDS | S3 Server Timer | P2 and P2 Start Timer
Timer Handling in UDS | S3 Server Timer | P2 and P2 Start Timer
 
Introduction to Automotive Bootloader | Programming Sequence
Introduction to Automotive Bootloader | Programming SequenceIntroduction to Automotive Bootloader | Programming Sequence
Introduction to Automotive Bootloader | Programming Sequence
 

MinHash_similarities.pdf

  • 1. 3.1.2 MinHash for similarities The adjacent matrix is helpful, but there is difficulty with applications when the data is large. They are primarily large and increase complexity. We are estimating the similarity of all pairs in <<Q(n2 ). This is problematic if we want to use, let us take an example, a commerce site with 12 million products. We want to identify and provide a ranking similarity score. The total pairs will be 12 million elements, 144 x 1012 pairs. Each pair has a 64-bit float, and we need 1.152 x 1015 bytes to store the adjacency matrix at the memory. In such large measurements make it difficult to use these data. Things get rough when we have to go to a more extensive dataset like a social network dataset or web data set. Besides, the data is highly likely to have many features - columns. So, it is exceedingly challenging to store this data at the memory and perform similarity checks; we have to find an alternative technique to locate groups of high similarity pairs. We cannot check all the pairs. The MinHash allows us to compress all these features to a smaller dimensional space that works well and maintains high dimensionality [20,9]. The basic idea is that the compressed feature spaces maintain similarities among the two objects. Small signatures will be smaller than the full feature vector. The similarity between these signatures is equivalent or very similar to the full feature space. Then with Jaccard similarity, because we have a set, we can find a similar set. The MinHash lets us evaluate similarity in low dimensional space. The locality sensitive hashing allows (LSH) us to deal with the pair problem. We only evaluate similarity for some candidates set. Some pairs only matter if they exceed a threshold, which lets us skip a lot of pair checking. While computing the small signature, we do not have to store the full feature vector. Similarities of two pairs are equal with similarities to their signature. Moreover, the final step is to check the pairs with similar signature to measure the similarity with the feature vector. The key idea is to hash each element with a hash function. Hashing is converting input of any length into a fixed-size string of text using a mathematical function. Any text can be converting into an array of numbers and letters through the algorithm. The messages will be hashed the input. The algorithm is called hashed function, and the output is called hashed values. The hashed values must be unique; it should be impossible to produce the same hashed values to any
  • 2. different input. The same message should always produce the same hashed values. The hash speed is an essential factor. The hash function should always produce quick hash values. The hash value has to be small enough that the signature fits in memory, and Sim(C1,C2) are the same with h(C1) and h(C2), also; if Sim(C1) and Sim(C2) are high, then the probability to h(C1) and h(C2) is high. We have to know that not all similarity hash a suitable function. For example, Jaccard similarity is suitable for MinHash. The similarities of the two signatures are the fraction of the hash function in which they agree. Finally, with MinHash, we compressed long vectors into a short signature[20,9,21].