SlideShare a Scribd company logo
Locality Sensitive Hashing
with Application to Near Neighbor Reporting
Hsiao-Fei Liu
2015.3.4
Motivation
• Real word applications
– Recommendation system
• Searching for similar items and users
– Malicious website detection
• Searching for websites similar to some know malicious websites
• The underlying core problem
Given:
• A large set P of high-dimensional data points in a metric space M
• A large set Q of high-dimensional query points in a metric space M
Goal:
• Find near neighbors in P for each query point in Q
• Avoid linearly scanning P for each query
Related Work
• Nearest Neighbor Searching
– Given: a set P of n points in a metric space M
– Goal: for any query q return a point p ∈ P minimizing dist(p,q)
• Classic Result
– Point location in arrangements of hyperplanes
• Meiser, IC’93
• In a d-dimensional Euclidean space under some Lp norm
• dO(1) logn query time and nO(d) space
Related Work
• Approximate Nearest Neighbor
– Given: a set P of n points in a metric space M and  > 0
– Goal: for any query q return a point p  P s.t. dist(p,q)  (1+) dist(p*,q),
where p* is the nearest neighbor to q
• Classic Result
– Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
• Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12
• In d-dimensional Euclidean space under Lp norm
• 𝑂(𝑑𝑛
1
1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+
1
1+𝜀) space
• Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
Overview
1. Near Neighbor Reporting
– Formal problem formulation
2. Locality Sensitive Hashing
– Definition and example
– Algorithms for NNR based on LSH
– Query time decomposition
3. Performance tuning
– Gap amplification
– Parameter optimization
Near Neighbor Reporting
2015/3/4
Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A deterministic algorithm for building a data structure T s.t.
• T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)}
• 𝐸 𝑞 cvg(𝑞|𝑇)  𝑐 where q ~ f and cvg(q|T) =
𝑇.nbrs 𝑞
nbrs 𝑞
2015/3/4
hard to achieve 
(Relaxed) Near Neighbor Reporting
• Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
• Goal:
– A randomized algorithm for building a data structure T s.t.
• T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)}
• 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇  𝑐, where q ~ f
• Fact:
– if 𝐸 𝑇 cvg(𝑞|𝑇)  𝑐 for any q
then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f
2015/3/4
Locality Sensitive Hashing
2015/3/4
Locality Sensitive Hashing
• Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
• Let h be chosen uniformly at random from H
– p and q are closer => Pr(h(p) = h(q)) is higher
• Formally, H is (r, r+, c, c’)-sensitive if
– r < r+ and c> c’
– Let h be chosen uniformly at random from H
– If dist(p, q)  r , then Pr(h(p) = h(q))  c
– If dist(p, q)  r+, then Pr(h(p) = h(q))  c’
LSH for angular distance
• Distance function: d(u, v) = arccos(u,v) = (u,v)
• Random Projection:
– Choose a random unit vector w
– h(u) = sgn(uw)
– Pr(h(u) = h(v)) = 1 - (u,v)/


u
v
LSH for Jaccard distance
• Distance function: d(A, B) = 1 - | A  B |/| A  B |
• MinHash:
–Pick a random permutation  on the universe U
–h(A) = argminaA (a)
–Pr(h(A) = h(B)) = | A  B |/| A  B | = 1 – d(A, B)
• Note:
–Finding the Jaccard median
• Very easy to understand, very hard to compute
• Studied from 1981
• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10
– NP-Hard and no FPTAS
– PTAS
A B
Algorithm for NNR Based on LSH
• Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M
• Consider the following randomized algorithm for NNR
– Uniformly at random choose a hash function h from H
– Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)}
– Define Th.nbrs(q) to return points in Th(q) with dist(p,q)  r
• For any q,
• 𝐸 𝑇ℎ
cvg 𝑞|𝑇 =
• 𝑇ℎ
Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) =
• ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛
𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞))
|nbrs(𝑞)|
=
•
𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 )
|nbrs(𝑞)|
 𝑐
Query Time
• Query time = time for computing h(q)
+
|{p P: h(p) = h(q) & dist(p,q) > r| d
+
|{p P: h(p) = h(q) & dist(p,q)  r| d
= timec + timeFP + |output|
Performance Tuning
2015/3/4
Why Gap Amplification
• To lift coverage rate c
• Reduce false positive rate to improve timeFP
Jaccard
distance
0.2
c= 0.8
0.4
c’= 0.6
collision Prob.
Jaccard
distance
0.2
c= 0.9
0.4
c’= 0.1
collision Prob.
(r, r+, c, c’)-sensitive LSH
Gap amplification
(r, r+, c, c’)-sensitive LSH
17
How Gap Amplification
• Construct LSH G from the original LSH H
– LSH B = {b(x; h1, h2,…, hk): hi  H }
• b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q)
– LSH G = {g(x; b1, b2,… , bL): bi  B}
• g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q)
• Intuition:
– AND increases the gap
• Collision probabilities of distant points decrease
exponentially faster than near points
– OR increases the collision probabilities approx. linearly
• Let P = PrhH (h(p) = h(q))
=> PrbB (b(p) = b(q)) = Pk
=> PrgG (g(p)  g(q)) = (1 – Pk)L
=> PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk
L buckets
k hashes
or
or
or
18
Parameter Optimization
• Situation
– A (r, r+, c, c’)-sensitive LSH H is given
– After gap amplification we want c to be lifted to c
• Let c = 1 – (1-ck)L
⇒ 𝐿 =
log 1 – c
log 1 − ck
⇒ L is a strictly increasing function of k
• So we only need to select a good k
19
How to select a good k
• How to measure the “goodness”?
– Minimize the timec + E[timeFP ]under the space constraint
• Let’s investigate how the query time and space usage will
react when we increase k
k  => space 
• Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 +
log 1 – c
log 1 − ck
)
))
k  => timec 
• timec =O (dkL) = 𝑂(𝑑𝑘
log 1 – c
log 1 − ck )
– Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)
k  => timeFP 
c
P
Y
c
k=2
k=3
• Consider s-curves Y = 1 – (1-Pk)L passing through (c, c)
• Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP 
original collision prob.
of distant points
1. Determine the largest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
– In practice, timec and timeFP is measured experimentally by
• constructing a data structure T
• running several queries sampled from S on T
Procedure for optimizing k
• Let ∆ 𝑘time 𝑐 = timec(k) – timec(k-1)
• Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)]
• Observation:
– If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k*
would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can
be found using binary search.
• Question:
– in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing?
Observation & Question
k
∆ 𝑘time 𝑐
∆ 𝑘timeFP
Summary
• Near Neighbor Reporting
– Find many applications in practice
• Locality Sensitive Hashing
– Hash near points to the same value
– One of the most useful techniques for NNR
• Performance tuning
– Gap amplification for higher coverage and lower FP
– Parameter optimization for query time
Further Reading
• Dimensionality Reduction
– Variance preserving
– Principal Component Analysis
– Singular Value Decomposition
– Distance preserving
– Random Projection and the Johnson–Lindenstrauss lemma
– Locality preserving
– Locally Linear Embedding
– Multi-dimensional Scaling
– ISOMAP
References
1. Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH
Appendix
• Suppose that we have a (r, r+, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+, c, err)-sensitive LSH G.
• How does the bucket number L and the collision error err change
with k?
Increasing rate of bucket number
𝐿𝑒𝑡 1 − 1 − 𝑐 𝑘 𝐿 = 𝑐
⇒ 𝐿𝑐𝑘
≤
𝑐 and 𝑐  1 − 𝑒 𝐿𝑐 𝑘
⇒
𝑐
𝑐 𝑘
 𝐿 
−ln(1 − 𝑐)
𝑐 𝑘
⇒ 𝐿 = 𝜃(
1
𝑐 𝑘
)
Decreasing rate of collision error
• err = 1 − 1 − err 𝑘 𝐿  𝐿err 𝑘
 (
err
𝑐
) 𝑘 for some constant  −− −(1)
• err = 1 − 1 − err 𝑘 𝐿  1 − 𝑒−𝐿err 𝑘
 1 − 𝑒
−𝛽
err
𝑐
𝑘
for some constant 0 < 𝛽 < 1
= 𝛽
err
𝑐
𝑘 −
𝛽2 err
𝑐
2𝑘
2!
+
𝛽3 err
𝑐
3𝑘
3!
… = 
err
𝑐
𝑘 −− −(2)
• By (1) and (2), we have err = θ
err
𝑐
𝑘

More Related Content

What's hot

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
ananth
 
AI_Planning.pdf
AI_Planning.pdfAI_Planning.pdf
AI_Planning.pdf
SUSHMARATHI3
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Simplilearn
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
Bill Liu
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Erik Bernhardsson
 
Recurrent neural networks
Recurrent neural networksRecurrent neural networks
Recurrent neural networks
Viacheslav Khomenko
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
Illia Polosukhin
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
Sung Yub Kim
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
Po-Hsiang (Barnett) Chiu
 
Qiskit advocate demo qsvm
Qiskit advocate demo qsvmQiskit advocate demo qsvm
Qiskit advocate demo qsvm
Yuma Nakamura
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explained
jdhaar
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
Nikola Milosevic
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...butest
 
LSTM
LSTMLSTM
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Sergey Karayev
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
健程 杨
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
Yao-Chieh Hu
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
Daniel Schneiter
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
Farzan Hajian
 

What's hot (20)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
AI_Planning.pdf
AI_Planning.pdfAI_Planning.pdf
AI_Planning.pdf
 
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
Deep Learning Tutorial | Deep Learning TensorFlow | Deep Learning With Neural...
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Recurrent neural networks
Recurrent neural networksRecurrent neural networks
Recurrent neural networks
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
Qiskit advocate demo qsvm
Qiskit advocate demo qsvmQiskit advocate demo qsvm
Qiskit advocate demo qsvm
 
Pagerank Algorithm Explained
Pagerank Algorithm ExplainedPagerank Algorithm Explained
Pagerank Algorithm Explained
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
 
. An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic .... An introduction to machine learning and probabilistic ...
. An introduction to machine learning and probabilistic ...
 
LSTM
LSTMLSTM
LSTM
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
RNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential DataRNN & LSTM: Neural Network for Sequential Data
RNN & LSTM: Neural Network for Sequential Data
 
Made to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using ElasticsearchMade to Measure: Ranking Evaluation using Elasticsearch
Made to Measure: Ranking Evaluation using Elasticsearch
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 

Similar to LSH

Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016
Ram Sriharsha
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraDBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
LinaCovington707
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Charles Martin
 
Optim_methods.pdf
Optim_methods.pdfOptim_methods.pdf
Optim_methods.pdf
SantiagoGarridoBulln
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
ssuser2be88c
 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
George Valkanas
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Vivek Sharma
 
1 chayes
1 chayes1 chayes
1 chayes
Yandex
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
MapR Technologies
 
StringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdfStringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdf
bhagabatijenadukura
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
Vissarion Fisikopoulos
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Sean Moran
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Spark Summit
 
ClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPair
Shanmuganathan C
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Vibrant Technologies & Computers
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
Hwa Pyung Kim
 

Similar to LSH (20)

Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraDBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
Optim_methods.pdf
Optim_methods.pdfOptim_methods.pdf
Optim_methods.pdf
 
Lecture24
Lecture24Lecture24
Lecture24
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883
 
1 chayes
1 chayes1 chayes
1 chayes
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
StringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdfStringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdf
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
 
Fp12_Efficient_SCM
Fp12_Efficient_SCMFp12_Efficient_SCM
Fp12_Efficient_SCM
 
ClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPair
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 

LSH

  • 1. Locality Sensitive Hashing with Application to Near Neighbor Reporting Hsiao-Fei Liu 2015.3.4
  • 2. Motivation • Real word applications – Recommendation system • Searching for similar items and users – Malicious website detection • Searching for websites similar to some know malicious websites • The underlying core problem Given: • A large set P of high-dimensional data points in a metric space M • A large set Q of high-dimensional query points in a metric space M Goal: • Find near neighbors in P for each query point in Q • Avoid linearly scanning P for each query
  • 3. Related Work • Nearest Neighbor Searching – Given: a set P of n points in a metric space M – Goal: for any query q return a point p ∈ P minimizing dist(p,q) • Classic Result – Point location in arrangements of hyperplanes • Meiser, IC’93 • In a d-dimensional Euclidean space under some Lp norm • dO(1) logn query time and nO(d) space
  • 4. Related Work • Approximate Nearest Neighbor – Given: a set P of n points in a metric space M and  > 0 – Goal: for any query q return a point p  P s.t. dist(p,q)  (1+) dist(p*,q), where p* is the nearest neighbor to q • Classic Result – Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality • Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12 • In d-dimensional Euclidean space under Lp norm • 𝑂(𝑑𝑛 1 1+𝜀) query time and 𝑂(𝑑𝑛 + 𝑛1+ 1 1+𝜀) space • Technique1: Approximate Nearest Neighbor reduces to Approximate Near Neighbor with little overhead Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
  • 5. Overview 1. Near Neighbor Reporting – Formal problem formulation 2. Locality Sensitive Hashing – Definition and example – Algorithms for NNR based on LSH – Query time decomposition 3. Performance tuning – Gap amplification – Parameter optimization
  • 7. Near Neighbor Reporting • Input: – A set P of points in a metric space M – radius r > 0 – coverage rate c – A set S of points sampled from an unknown distribution f • Goal: – A deterministic algorithm for building a data structure T s.t. • T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)} • 𝐸 𝑞 cvg(𝑞|𝑇)  𝑐 where q ~ f and cvg(q|T) = 𝑇.nbrs 𝑞 nbrs 𝑞 2015/3/4 hard to achieve 
  • 8. (Relaxed) Near Neighbor Reporting • Input: – A set P of points in a metric space M – radius r > 0 – coverage rate c – A set S of points sampled from an unknown distribution f • Goal: – A randomized algorithm for building a data structure T s.t. • T.nbrs(q)  nbrs(q) = {p  P : dist(p,q)} • 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 𝐸 𝑞 cvg 𝑞 𝑇  𝑐, where q ~ f • Fact: – if 𝐸 𝑇 cvg(𝑞|𝑇)  𝑐 for any q then 𝐸 𝑇 𝐸 𝑞 cvg 𝑞 𝑇 ≥ 𝑐 where q ~ f 2015/3/4
  • 10. Locality Sensitive Hashing • Informally, an LSH H is a set of hash functions over a metric space satisfying the following condition. • Let h be chosen uniformly at random from H – p and q are closer => Pr(h(p) = h(q)) is higher • Formally, H is (r, r+, c, c’)-sensitive if – r < r+ and c> c’ – Let h be chosen uniformly at random from H – If dist(p, q)  r , then Pr(h(p) = h(q))  c – If dist(p, q)  r+, then Pr(h(p) = h(q))  c’
  • 11. LSH for angular distance • Distance function: d(u, v) = arccos(u,v) = (u,v) • Random Projection: – Choose a random unit vector w – h(u) = sgn(uw) – Pr(h(u) = h(v)) = 1 - (u,v)/   u v
  • 12. LSH for Jaccard distance • Distance function: d(A, B) = 1 - | A  B |/| A  B | • MinHash: –Pick a random permutation  on the universe U –h(A) = argminaA (a) –Pr(h(A) = h(B)) = | A  B |/| A  B | = 1 – d(A, B) • Note: –Finding the Jaccard median • Very easy to understand, very hard to compute • Studied from 1981 • Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10 – NP-Hard and no FPTAS – PTAS A B
  • 13. Algorithm for NNR Based on LSH • Let H be a (r, r+, c, c’)-sensitive LSH over a metric space M • Consider the following randomized algorithm for NNR – Uniformly at random choose a hash function h from H – Build a hash table Th s.t. Th(q) = { p P : h(p) = h(q)} – Define Th.nbrs(q) to return points in Th(q) with dist(p,q)  r • For any q, • 𝐸 𝑇ℎ cvg 𝑞|𝑇 = • 𝑇ℎ Pr 𝑇ℎ 𝑖𝑠 𝑏𝑢𝑖𝑙𝑡 cvg (𝑞|𝑇) = • ℎ Pr ℎ 𝑖𝑠 𝑐ℎ𝑜𝑠𝑒𝑛 𝑝∈nbrs(𝑞) 𝛿(ℎ 𝑝 =ℎ(𝑞)) |nbrs(𝑞)| = • 𝑝∈nbrs(𝑞) ℎ Pr ℎ 𝑖𝑠 𝑐𝑜𝑠𝑒𝑛 𝛿(ℎ 𝑝 =ℎ(𝑞)) |nbrs(𝑞)| = • 𝑝∈nbrs(𝑞) Pr(ℎ 𝑝 =ℎ 𝑞 ) |nbrs(𝑞)|  𝑐
  • 14. Query Time • Query time = time for computing h(q) + |{p P: h(p) = h(q) & dist(p,q) > r| d + |{p P: h(p) = h(q) & dist(p,q)  r| d = timec + timeFP + |output|
  • 16. Why Gap Amplification • To lift coverage rate c • Reduce false positive rate to improve timeFP Jaccard distance 0.2 c= 0.8 0.4 c’= 0.6 collision Prob. Jaccard distance 0.2 c= 0.9 0.4 c’= 0.1 collision Prob. (r, r+, c, c’)-sensitive LSH Gap amplification (r, r+, c, c’)-sensitive LSH
  • 17. 17 How Gap Amplification • Construct LSH G from the original LSH H – LSH B = {b(x; h1, h2,…, hk): hi  H } • b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q) – LSH G = {g(x; b1, b2,… , bL): bi  B} • g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q) • Intuition: – AND increases the gap • Collision probabilities of distant points decrease exponentially faster than near points – OR increases the collision probabilities approx. linearly • Let P = PrhH (h(p) = h(q)) => PrbB (b(p) = b(q)) = Pk => PrgG (g(p)  g(q)) = (1 – Pk)L => PrgG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk L buckets k hashes or or or
  • 18. 18 Parameter Optimization • Situation – A (r, r+, c, c’)-sensitive LSH H is given – After gap amplification we want c to be lifted to c • Let c = 1 – (1-ck)L ⇒ 𝐿 = log 1 – c log 1 − ck ⇒ L is a strictly increasing function of k • So we only need to select a good k
  • 19. 19 How to select a good k • How to measure the “goodness”? – Minimize the timec + E[timeFP ]under the space constraint • Let’s investigate how the query time and space usage will react when we increase k
  • 20. k  => space  • Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 + log 1 – c log 1 − ck ) ))
  • 21. k  => timec  • timec =O (dkL) = 𝑂(𝑑𝑘 log 1 – c log 1 − ck ) – Interested reader can refer to E2LSH for how to reduce timec to O(dkL1/2)
  • 22. k  => timeFP  c P Y c k=2 k=3 • Consider s-curves Y = 1 – (1-Pk)L passing through (c, c) • Larger k => steeper s-curve, => collision prob. drop faster for distant points => less false positive => timeFP  original collision prob. of distant points
  • 23. 1. Determine the largest possible value kmax for k without violating the space constraint 2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)] – In practice, timec and timeFP is measured experimentally by • constructing a data structure T • running several queries sampled from S on T Procedure for optimizing k
  • 24. • Let ∆ 𝑘time 𝑐 = timec(k) – timec(k-1) • Let ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] • Observation: – If ∆ 𝑘time𝑐 is increasing and ∆ 𝑘timeFP is decreasing, then k* would be the largest k such that ∆ 𝑘timeFP > ∆ 𝑘time 𝑐 and can be found using binary search. • Question: – in which situation will ∆ 𝑘timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing? Observation & Question k ∆ 𝑘time 𝑐 ∆ 𝑘timeFP
  • 25. Summary • Near Neighbor Reporting – Find many applications in practice • Locality Sensitive Hashing – Hash near points to the same value – One of the most useful techniques for NNR • Performance tuning – Gap amplification for higher coverage and lower FP – Parameter optimization for query time
  • 26. Further Reading • Dimensionality Reduction – Variance preserving – Principal Component Analysis – Singular Value Decomposition – Distance preserving – Random Projection and the Johnson–Lindenstrauss lemma – Locality preserving – Locally Linear Embedding – Multi-dimensional Scaling – ISOMAP
  • 27. References 1. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality 2. Similarity Search in High Dimensions via Hashing 3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf 4. E2LSH
  • 28. Appendix • Suppose that we have a (r, r+, c, err)-sensitive LSH H and want to amplify H to get a (r, r+, c, err)-sensitive LSH G. • How does the bucket number L and the collision error err change with k?
  • 29. Increasing rate of bucket number 𝐿𝑒𝑡 1 − 1 − 𝑐 𝑘 𝐿 = 𝑐 ⇒ 𝐿𝑐𝑘 ≤ 𝑐 and 𝑐  1 − 𝑒 𝐿𝑐 𝑘 ⇒ 𝑐 𝑐 𝑘  𝐿  −ln(1 − 𝑐) 𝑐 𝑘 ⇒ 𝐿 = 𝜃( 1 𝑐 𝑘 )
  • 30. Decreasing rate of collision error • err = 1 − 1 − err 𝑘 𝐿  𝐿err 𝑘  ( err 𝑐 ) 𝑘 for some constant  −− −(1) • err = 1 − 1 − err 𝑘 𝐿  1 − 𝑒−𝐿err 𝑘  1 − 𝑒 −𝛽 err 𝑐 𝑘 for some constant 0 < 𝛽 < 1 = 𝛽 err 𝑐 𝑘 − 𝛽2 err 𝑐 2𝑘 2! + 𝛽3 err 𝑐 3𝑘 3! … =  err 𝑐 𝑘 −− −(2) • By (1) and (2), we have err = θ err 𝑐 𝑘