SlideShare a Scribd company logo
Advanced Topics in Artificial
Intelligence
Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyky, Rajeev Motwaniz
Presenter
Maruf Aytekin
PhD Student
Computer Engineering Department
Bahcesehir University
Apr 21, 2015
Outline
• LSH
• Locality-Sensitive Functions
• Banding Technique
• LSH Families for Cosine
• Applications of LSH
• Conclusion
LSH
One general approach to LSH
• “Hash” items several times, in such a way that similar
items are more likely to be hashed to the same bucket
than dissimilar items are.
• We then consider any pair that hashed to the same
bucket for any of the hashings to be a candidate pair.
• We check only the candidate pairs for similarity.
LSH
• Most of the dissimilar pairs will never hash to the same
bucket, and therefore will never be checked.
• Those dissimilar pairs that do hash to the same bucket are
false positives: a small fraction of all pairs.
• We also hope that most of the truly similar pairs will hash to
the same bucket under at least one of the hash functions.
• Those that do not are false negatives; only a small fraction of
the truly similar pairs.


Locality-Sensitive Functions
In many cases, the function f will “hash” items, and the
decision will be based on whether or not the result is equal.
• f(x) = f(y) to mean that f(x,y) is “yes; make x and y a
candidate pair.”
• f(x) ≠ f(y) to mean “do not make x and y a candidate pair.”
A collection of functions of this form will be called a family of
functions.
Locality-Sensitive Functions
Let d1 < d2 be two distances according to some distance
measure d. A family F of functions is said to be
(d1, d2, p1, p2)-sensitive if for every f in F:
1. If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at
least p1.
2. If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at
most p2.
Locality-Sensitive Functions
Behavior of a (d1, d2, p1, p2)-sensitive function
• d1 and d2 can be made as close possible
• The penalty is that p1 and p2 becomes close as well.
Banding Technique
An effective way to choose the hashings is to divide the
signature matrix into b bands consisting of r rows each.
Dividing a signature matrix into four bands of three rows per band
Analysis of the Banding
Technique
The probability that the signatures becomes candidate
pair at least one band: 1 − (1 − s r ) b
This function has the form of an S-curve:
The threshold (the value
of similarity s) at which the
probability of becoming a
candidate
is 1/2, is a function of b
and r (b = 16, r = 4).
Analysis of the Banding
Technique
Values of the S-curve for b = 20 and r = 5
Analysis of the Banding
Technique
• Choose a threshold t that defines how similar items
have to be in order for them to be “candidate pair.”
• Pick b and r such that br = n, and the threshold t is
approximately (1/b)1/r.
• If avoiding false negatives is important, select b and r
to produce a threshold lower than t.
• if speed is important and you wish to limit false
positives, select b and r to produce a higher threshold.
LSH for Cosine
Let u be user u's rating vector and v be user v's
rating vector and r is a random generated vector. The family of hash
functions H:
, where
which shows the probability of u and v being declared as a candidate pair.
LSH for Cosine
A new family G of hash functions g is defined, where
each function g is obtained by concatenating (AND)
functions of
h1, h2, , ...., hr from family of functions F:
g(t) = [h1(t),........, hr(t)].
We then generate random functions of g(t) for each band
(hash table) and construct b hash tables.
LSH for Cosine
Example:
r1 = [-1, 1,1,-1,-1]
r2 = [1, 1,1,-1,-1]
r3 = [-1, -1,1,-1,1]
r4 = [-1, 1, -1,1, -1]
u1.r1 = -6 => hr1
(u1) = 0
u1.r2 = 4 => hr2
(u1) = 1
u1.r3 = -12 => hr3
(u1) = 0
u1.r4 = 2 => hr (u1) = 1
u1 = [5, 4, 0, 4, 1]
u2 = [2, 1, 1, 1, 4]
u3 = [4, 3, 0, 5, 2]
u4 = [2, 1, 2, 1, 4]
g(u1) = 0101
g(u2) = 0010
g(u3) = 0101
g(u4) = 0110
g(u1) = 0101
Applications of LSH
• Near neighbor search
• Entity Resolution
• Matching Fingerprints
• Matching Newspaper Articles
Thank You
Q & A

More Related Content

What's hot

Pattern matching programs
Pattern matching programsPattern matching programs
Pattern matching programs
akruthi k
 
Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6
HyeonSeok Choi
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
Albert Bifet
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
Andrii Gakhov
 
Query processing System
Query processing SystemQuery processing System
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
J Singh
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
"FENG "GEORGE"" YU
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 
Query optimisation
Query optimisationQuery optimisation
Query optimisation
WBUTTUTORIALS
 
H1076875
H1076875H1076875
H1076875
IJERD Editor
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
vwchu
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
Shuyo Nakatani
 
Hybrid acquisition of temporal scopes for rdf data
Hybrid acquisition of temporal scopes for rdf dataHybrid acquisition of temporal scopes for rdf data
Hybrid acquisition of temporal scopes for rdf data
Anisa Rula
 
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web TrackEvaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
Twitter Inc.
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
Albert Bifet
 
Path based Algorithms(Term Paper)
Path based Algorithms(Term Paper)Path based Algorithms(Term Paper)
Path based Algorithms(Term Paper)
pankaj kumar
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
koolkampus
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
Sanjay Mishra
 

What's hot (20)

Pattern matching programs
Pattern matching programsPattern matching programs
Pattern matching programs
 
Mining the social web 6
Mining the social web 6Mining the social web 6
Mining the social web 6
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
CS 542 -- Query Optimization
CS 542 -- Query OptimizationCS 542 -- Query Optimization
CS 542 -- Query Optimization
 
Query Optimization - Brandon Latronica
Query Optimization - Brandon LatronicaQuery Optimization - Brandon Latronica
Query Optimization - Brandon Latronica
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Query optimisation
Query optimisationQuery optimisation
Query optimisation
 
H1076875
H1076875H1076875
H1076875
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
Hybrid acquisition of temporal scopes for rdf data
Hybrid acquisition of temporal scopes for rdf dataHybrid acquisition of temporal scopes for rdf data
Hybrid acquisition of temporal scopes for rdf data
 
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web TrackEvaluating the Effectiveness of Axiomatic Approaches in Web Track
Evaluating the Effectiveness of Axiomatic Approaches in Web Track
 
A Short Course in Data Stream Mining
A Short Course in Data Stream MiningA Short Course in Data Stream Mining
A Short Course in Data Stream Mining
 
Path based Algorithms(Term Paper)
Path based Algorithms(Term Paper)Path based Algorithms(Term Paper)
Path based Algorithms(Term Paper)
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 

Similar to Similarity Search in High Dimensions via Hashing

Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
Dmitriy Selivanov
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Mail.ru Group
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
ssuser2be88c
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
Local sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friendLocal sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friend
Chengeng Ma
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
zukun
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
source{d}
 
Function Basics Math Wiki
Function Basics   Math WikiFunction Basics   Math Wiki
Function Basics Math Wiki
Alec Kargodorian
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Paris Women in Machine Learning and Data Science
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
Andrii Gakhov
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptx
MukhtiarKhan5
 
Hashing
HashingHashing
Hashing
amoldkul
 
2 Cryptographic_Hash_Functions.pptx
2 Cryptographic_Hash_Functions.pptx2 Cryptographic_Hash_Functions.pptx
2 Cryptographic_Hash_Functions.pptx
Chinnu Chinnu
 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptx
kratika64
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
Rafi Dar
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
Holistic Benchmarking of Big Linked Data
 
Dstar Lite
Dstar LiteDstar Lite
Dstar Lite
Adrian Sotelo
 
Declare Your Language: Name Resolution
Declare Your Language: Name ResolutionDeclare Your Language: Name Resolution
Declare Your Language: Name Resolution
Eelco Visser
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
hoangminhdong
 

Similar to Similarity Search in High Dimensions via Hashing (20)

Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Local sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friendLocal sensitive hashing &amp; minhash on facebook friend
Local sensitive hashing &amp; minhash on facebook friend
 
Skiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sortingSkiena algorithm 2007 lecture06 sorting
Skiena algorithm 2007 lecture06 sorting
 
Deduplication on large amounts of code
Deduplication on large amounts of codeDeduplication on large amounts of code
Deduplication on large amounts of code
 
Function Basics Math Wiki
Function Basics   Math WikiFunction Basics   Math Wiki
Function Basics Math Wiki
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Probabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. SimilarityProbabilistic data structures. Part 4. Similarity
Probabilistic data structures. Part 4. Similarity
 
Digital Image Processing.pptx
Digital Image Processing.pptxDigital Image Processing.pptx
Digital Image Processing.pptx
 
Hashing
HashingHashing
Hashing
 
2 Cryptographic_Hash_Functions.pptx
2 Cryptographic_Hash_Functions.pptx2 Cryptographic_Hash_Functions.pptx
2 Cryptographic_Hash_Functions.pptx
 
Hashing.pptx
Hashing.pptxHashing.pptx
Hashing.pptx
 
Concept of hashing
Concept of hashingConcept of hashing
Concept of hashing
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Dstar Lite
Dstar LiteDstar Lite
Dstar Lite
 
Declare Your Language: Name Resolution
Declare Your Language: Name ResolutionDeclare Your Language: Name Resolution
Declare Your Language: Name Resolution
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 

Recently uploaded

smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
um7474492
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
AI-Based Home Security System : Home security
AI-Based Home Security System : Home securityAI-Based Home Security System : Home security
AI-Based Home Security System : Home security
AIRCC Publishing Corporation
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
AnasAhmadNoor
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
PreethaV16
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
b0754201
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
Roger Rozario
 

Recently uploaded (20)

smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
AI-Based Home Security System : Home security
AI-Based Home Security System : Home securityAI-Based Home Security System : Home security
AI-Based Home Security System : Home security
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
P5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civilP5 Working Drawings.pdf floor plan, civil
P5 Working Drawings.pdf floor plan, civil
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
OOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming languageOOPS_Lab_Manual - programs using C++ programming language
OOPS_Lab_Manual - programs using C++ programming language
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptxSENTIMENT ANALYSIS ON PPT AND Project template_.pptx
SENTIMENT ANALYSIS ON PPT AND Project template_.pptx
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Transformers design and coooling methods
Transformers design and coooling methodsTransformers design and coooling methods
Transformers design and coooling methods
 

Similarity Search in High Dimensions via Hashing

  • 1. Advanced Topics in Artificial Intelligence Similarity Search in High Dimensions via Hashing Aristides Gionis, Piotr Indyky, Rajeev Motwaniz Presenter Maruf Aytekin PhD Student Computer Engineering Department Bahcesehir University Apr 21, 2015
  • 2. Outline • LSH • Locality-Sensitive Functions • Banding Technique • LSH Families for Cosine • Applications of LSH • Conclusion
  • 3. LSH One general approach to LSH • “Hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are. • We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair. • We check only the candidate pairs for similarity.
  • 4. LSH • Most of the dissimilar pairs will never hash to the same bucket, and therefore will never be checked. • Those dissimilar pairs that do hash to the same bucket are false positives: a small fraction of all pairs. • We also hope that most of the truly similar pairs will hash to the same bucket under at least one of the hash functions. • Those that do not are false negatives; only a small fraction of the truly similar pairs.
  • 5. 
 Locality-Sensitive Functions In many cases, the function f will “hash” items, and the decision will be based on whether or not the result is equal. • f(x) = f(y) to mean that f(x,y) is “yes; make x and y a candidate pair.” • f(x) ≠ f(y) to mean “do not make x and y a candidate pair.” A collection of functions of this form will be called a family of functions.
  • 6. Locality-Sensitive Functions Let d1 < d2 be two distances according to some distance measure d. A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f in F: 1. If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at least p1. 2. If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at most p2.
  • 7. Locality-Sensitive Functions Behavior of a (d1, d2, p1, p2)-sensitive function • d1 and d2 can be made as close possible • The penalty is that p1 and p2 becomes close as well.
  • 8. Banding Technique An effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each. Dividing a signature matrix into four bands of three rows per band
  • 9. Analysis of the Banding Technique The probability that the signatures becomes candidate pair at least one band: 1 − (1 − s r ) b This function has the form of an S-curve: The threshold (the value of similarity s) at which the probability of becoming a candidate is 1/2, is a function of b and r (b = 16, r = 4).
  • 10. Analysis of the Banding Technique Values of the S-curve for b = 20 and r = 5
  • 11. Analysis of the Banding Technique • Choose a threshold t that defines how similar items have to be in order for them to be “candidate pair.” • Pick b and r such that br = n, and the threshold t is approximately (1/b)1/r. • If avoiding false negatives is important, select b and r to produce a threshold lower than t. • if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.
  • 12. LSH for Cosine Let u be user u's rating vector and v be user v's rating vector and r is a random generated vector. The family of hash functions H: , where which shows the probability of u and v being declared as a candidate pair.
  • 13. LSH for Cosine A new family G of hash functions g is defined, where each function g is obtained by concatenating (AND) functions of h1, h2, , ...., hr from family of functions F: g(t) = [h1(t),........, hr(t)]. We then generate random functions of g(t) for each band (hash table) and construct b hash tables.
  • 14. LSH for Cosine Example: r1 = [-1, 1,1,-1,-1] r2 = [1, 1,1,-1,-1] r3 = [-1, -1,1,-1,1] r4 = [-1, 1, -1,1, -1] u1.r1 = -6 => hr1 (u1) = 0 u1.r2 = 4 => hr2 (u1) = 1 u1.r3 = -12 => hr3 (u1) = 0 u1.r4 = 2 => hr (u1) = 1 u1 = [5, 4, 0, 4, 1] u2 = [2, 1, 1, 1, 4] u3 = [4, 3, 0, 5, 2] u4 = [2, 1, 2, 1, 4] g(u1) = 0101 g(u2) = 0010 g(u3) = 0101 g(u4) = 0110 g(u1) = 0101
  • 15. Applications of LSH • Near neighbor search • Entity Resolution • Matching Fingerprints • Matching Newspaper Articles