SlideShare a Scribd company logo
1 of 30
Locality Sensitive Hashing
with Application to Near Neighbor Reporting
Hsiao-Fei Liu
2015.3.4
Motivation
β€’ Real word applications
– Recommendation system
β€’ Searching for similar items and users
– Malicious website detection
β€’ Searching for websites similar to some know malicious websites
β€’ The underlying core problem
Given:
β€’ A large set P of high-dimensional data points in a metric space M
β€’ A large set Q of high-dimensional query points in a metric space M
Goal:
β€’ Find near neighbors in P for each query point in Q
β€’ Avoid linearly scanning P for each query
Related Work
β€’ Nearest Neighbor Searching
– Given: a set P of n points in a metric space M
– Goal: for any query q return a point p ∈ P minimizing dist(p,q)
β€’ Classic Result
– Point location in arrangements of hyperplanes
β€’ Meiser, IC’93
β€’ In a d-dimensional Euclidean space under some Lp norm
β€’ dO(1) logn query time and nO(d) space
Related Work
β€’ Approximate Nearest Neighbor
– Given: a set P of n points in a metric space M and ο₯ > 0
– Goal: for any query q return a point p οƒŽ P s.t. dist(p,q) ο‚£ (1+ο₯) dist(p*,q),
where p* is the nearest neighbor to q
β€’ Classic Result
– Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
β€’ Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12
β€’ In d-dimensional Euclidean space under Lp norm
β€’ 𝑂(𝑑𝑛
1
1+πœ€) query time and 𝑂(𝑑𝑛 + 𝑛1+
1
1+πœ€) space
β€’ Technique1: Approximate Nearest Neighbor reduces to
Approximate Near Neighbor with little overhead
Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
Overview
1. Near Neighbor Reporting
– Formal problem formulation
2. Locality Sensitive Hashing
– Definition and example
– Algorithms for NNR based on LSH
– Query time decomposition
3. Performance tuning
– Gap amplification
– Parameter optimization
Near Neighbor Reporting
2015/3/4
Near Neighbor Reporting
β€’ Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
β€’ Goal:
– A deterministic algorithm for building a data structure T s.t.
β€’ T.nbrs(q)  nbrs(q) = {p οƒŽ P : dist(p,q)}
β€’ 𝐸 π‘ž cvg(π‘ž|𝑇) ο‚³ 𝑐 where q ~ f and cvg(q|T) =
𝑇.nbrs π‘ž
nbrs π‘ž
2015/3/4
hard to achieve 
(Relaxed) Near Neighbor Reporting
β€’ Input:
– A set P of points in a metric space M
– radius r > 0
– coverage rate c
– A set S of points sampled from an unknown distribution f
β€’ Goal:
– A randomized algorithm for building a data structure T s.t.
β€’ T.nbrs(q)  nbrs(q) = {p οƒŽ P : dist(p,q)}
β€’ 𝐸 𝑇 𝐸 π‘ž cvg π‘ž 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑒𝑖𝑙𝑑 𝐸 π‘ž cvg π‘ž 𝑇 ο‚³ 𝑐, where q ~ f
β€’ Fact:
– if 𝐸 𝑇 cvg(π‘ž|𝑇) ο‚³ 𝑐 for any q
then 𝐸 𝑇 𝐸 π‘ž cvg π‘ž 𝑇 β‰₯ 𝑐 where q ~ f
2015/3/4
Locality Sensitive Hashing
2015/3/4
Locality Sensitive Hashing
β€’ Informally, an LSH H is a set of hash functions over a metric
space satisfying the following condition.
β€’ Let h be chosen uniformly at random from H
– p and q are closer => Pr(h(p) = h(q)) is higher
β€’ Formally, H is (r, r+ο₯, c, c’)-sensitive if
– r < r+ο₯ and c> c’
– Let h be chosen uniformly at random from H
– If dist(p, q) ο‚£ r , then Pr(h(p) = h(q)) ο‚³ c
– If dist(p, q) ο‚³ r+ο₯, then Pr(h(p) = h(q)) ο‚£ c’
LSH for angular distance
β€’ Distance function: d(u, v) = arccos(u,v) = (u,v)
β€’ Random Projection:
– Choose a random unit vector w
– h(u) = sgn(uοƒ—w)
– Pr(h(u) = h(v)) = 1 - (u,v)/


u
v
LSH for Jaccard distance
β€’ Distance function: d(A, B) = 1 - | A  B |/| A οƒˆ B |
β€’ MinHash:
–Pick a random permutation  on the universe U
–h(A) = argminaοƒŽA (a)
–Pr(h(A) = h(B)) = | A  B |/| A οƒˆ B | = 1 – d(A, B)
β€’ Note:
–Finding the Jaccard median
β€’ Very easy to understand, very hard to compute
β€’ Studied from 1981
β€’ Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10
– NP-Hard and no FPTAS
– PTAS
A B
Algorithm for NNR Based on LSH
β€’ Let H be a (r, r+ο₯, c, c’)-sensitive LSH over a metric space M
β€’ Consider the following randomized algorithm for NNR
– Uniformly at random choose a hash function h from H
– Build a hash table Th s.t. Th(q) = { p οƒŽP : h(p) = h(q)}
– Define Th.nbrs(q) to return points in Th(q) with dist(p,q) ο‚£ r
β€’ For any q,
β€’ 𝐸 π‘‡β„Ž
cvg π‘ž|𝑇 =
β€’ π‘‡β„Ž
Pr π‘‡β„Ž 𝑖𝑠 𝑏𝑒𝑖𝑙𝑑 cvg (π‘ž|𝑇) =
β€’ β„Ž Pr β„Ž 𝑖𝑠 π‘β„Žπ‘œπ‘ π‘’π‘›
π‘βˆˆnbrs(π‘ž) 𝛿(β„Ž 𝑝 =β„Ž(π‘ž))
|nbrs(π‘ž)|
=
β€’
π‘βˆˆnbrs(π‘ž) β„Ž Pr β„Ž 𝑖𝑠 π‘π‘œπ‘ π‘’π‘› 𝛿(β„Ž 𝑝 =β„Ž(π‘ž))
|nbrs(π‘ž)|
=
β€’
π‘βˆˆnbrs(π‘ž) Pr(β„Ž 𝑝 =β„Ž π‘ž )
|nbrs(π‘ž)|
ο‚³ 𝑐
Query Time
β€’ Query time = time for computing h(q)
+
|{p οƒŽP: h(p) = h(q) & dist(p,q) > r|οƒ— d
+
|{p οƒŽP: h(p) = h(q) & dist(p,q) ο‚£ r|οƒ— d
= timec + timeFP + |output|
Performance Tuning
2015/3/4
Why Gap Amplification
β€’ To lift coverage rate c
β€’ Reduce false positive rate to improve timeFP
Jaccard
distance
0.2
c= 0.8
0.4
c’= 0.6
collision Prob.
Jaccard
distance
0.2
c= 0.9
0.4
c’= 0.1
collision Prob.
(r, r+ο₯, c, c’)-sensitive LSH
Gap amplification
(r, r+ο₯, cο‚­, c’)-sensitive LSH
17
How Gap Amplification
β€’ Construct LSH G from the original LSH H
– LSH B = {b(x; h1, h2,…, hk): hi οƒŽ H }
β€’ b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q)
– LSH G = {g(x; b1, b2,… , bL): bi οƒŽ B}
β€’ g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q)
β€’ Intuition:
– AND increases the gap
β€’ Collision probabilities of distant points decrease
exponentially faster than near points
– OR increases the collision probabilities approx. linearly
β€’ Let P = PrhοƒŽH (h(p) = h(q))
=> PrbοƒŽB (b(p) = b(q)) = Pk
=> PrgοƒŽG (g(p) ο‚Ή g(q)) = (1 – Pk)L
=> PrgοƒŽG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk
L buckets
k hashes
or
or
or
18
Parameter Optimization
β€’ Situation
– A (r, r+ο₯, c, c’)-sensitive LSH H is given
– After gap amplification we want c to be lifted to cο‚­
β€’ Let cο‚­ = 1 – (1-ck)L
β‡’ 𝐿 =
log 1 – cο‚­
log 1 βˆ’ ck
β‡’ L is a strictly increasing function of k
β€’ So we only need to select a good k
19
How to select a good k
β€’ How to measure the β€œgoodness”?
– Minimize the timec + E[timeFP ]under the space constraint
β€’ Let’s investigate how the query time and space usage will
react when we increase k
k ο‚­ => space ο‚­
β€’ Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 +
log 1 – cο‚­
log 1 βˆ’ ck
)
))
k ο‚­ => timec ο‚­
β€’ timec =O (dkL) = 𝑂(π‘‘π‘˜
log 1 – cο‚­
log 1 βˆ’ ck )
– Interested reader can refer to E2LSH for how to reduce timec to
O(dkL1/2)
k ο‚­ => timeFP ο‚―
c
P
Y
cο‚­
k=2
k=3
β€’ Consider s-curves Y = 1 – (1-Pk)L passing through (c, cο‚­)
β€’ Larger k => steeper s-curve,
=> collision prob. drop faster for distant points
=> less false positive
=> timeFP ο‚―
original collision prob.
of distant points
1. Determine the largest possible value kmax for k without
violating the space constraint
2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)]
– In practice, timec and timeFP is measured experimentally by
β€’ constructing a data structure T
β€’ running several queries sampled from S on T
Procedure for optimizing k
β€’ Let βˆ† π‘˜time 𝑐 = timec(k) – timec(k-1)
β€’ Let βˆ† π‘˜timeFP = E[timeFP(k-1)] – E[timeFP(k)]
β€’ Observation:
– If βˆ† π‘˜time𝑐 is increasing and βˆ† π‘˜timeFP is decreasing, then k*
would be the largest k such that βˆ† π‘˜timeFP > βˆ† π‘˜time 𝑐 and can
be found using binary search.
β€’ Question:
– in which situation will βˆ† π‘˜timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing?
Observation & Question
k
βˆ† π‘˜time 𝑐
βˆ† π‘˜timeFP
Summary
β€’ Near Neighbor Reporting
– Find many applications in practice
β€’ Locality Sensitive Hashing
– Hash near points to the same value
– One of the most useful techniques for NNR
β€’ Performance tuning
– Gap amplification for higher coverage and lower FP
– Parameter optimization for query time
Further Reading
β€’ Dimensionality Reduction
– Variance preserving
– Principal Component Analysis
– Singular Value Decomposition
– Distance preserving
– Random Projection and the Johnson–Lindenstrauss lemma
– Locality preserving
– Locally Linear Embedding
– Multi-dimensional Scaling
– ISOMAP
References
1. Approximate Nearest Neighbors: Towards Removing the Curse of
Dimensionality
2. Similarity Search in High Dimensions via Hashing
3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
4. E2LSH
Appendix
β€’ Suppose that we have a (r, r+ο₯, c, err)-sensitive LSH H and want to
amplify H to get a (r, r+ο₯, cο‚­, errο‚―)-sensitive LSH G.
β€’ How does the bucket number L and the collision error errο‚― change
with k?
Increasing rate of bucket number
𝐿𝑒𝑑 1 βˆ’ 1 βˆ’ 𝑐 π‘˜ 𝐿 = 𝑐
β‡’ πΏπ‘π‘˜
≀
𝑐 and 𝑐 ο‚£ 1 βˆ’ 𝑒 𝐿𝑐 π‘˜
β‡’
𝑐
𝑐 π‘˜
ο‚£ 𝐿 ο‚£
βˆ’ln(1 βˆ’ 𝑐)
𝑐 π‘˜
β‡’ 𝐿 = πœƒ(
1
𝑐 π‘˜
)
Decreasing rate of collision error
β€’ errο‚― = 1 βˆ’ 1 βˆ’ err π‘˜ 𝐿 ο‚£ 𝐿err π‘˜
ο‚£ (
err
𝑐
) π‘˜ for some constant  βˆ’βˆ’ βˆ’(1)
β€’ errο‚― = 1 βˆ’ 1 βˆ’ err π‘˜ 𝐿 ο‚³ 1 βˆ’ π‘’βˆ’πΏerr π‘˜
ο‚³ 1 βˆ’ 𝑒
βˆ’π›½
err
𝑐
π‘˜
for some constant 0 < 𝛽 < 1
= 𝛽
err
𝑐
π‘˜ βˆ’
𝛽2 err
𝑐
2π‘˜
2!
+
𝛽3 err
𝑐
3π‘˜
3!
… = 
err
𝑐
π‘˜ βˆ’βˆ’ βˆ’(2)
β€’ By (1) and (2), we have errο‚― = ΞΈ
err
𝑐
π‘˜

More Related Content

What's hot

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
Β 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prologChaps 1-3-ai-prolog
Chaps 1-3-ai-prologsaru40
Β 
ProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) IntroductionProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) Introductionwahab khan
Β 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Dr. C.V. Suresh Babu
Β 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
Β 
Randomized Algorithm- Advanced Algorithm
Randomized Algorithm- Advanced AlgorithmRandomized Algorithm- Advanced Algorithm
Randomized Algorithm- Advanced AlgorithmMahbubur Rahman
Β 
5 csp
5 csp5 csp
5 cspMhd Sb
Β 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
Β 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILPPierre de Lacaze
Β 
Algorithm analysis and design
Algorithm analysis and designAlgorithm analysis and design
Algorithm analysis and designMegha V
Β 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayesDhwaj Raj
Β 
Genetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceGenetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceSahil Kumar
Β 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
Β 
Transition Based Dependency Parsing
Transition Based Dependency ParsingTransition Based Dependency Parsing
Transition Based Dependency ParsingDavid Przybilla
Β 
16890 unit 2 heuristic search techniques
16890 unit 2 heuristic  search techniques16890 unit 2 heuristic  search techniques
16890 unit 2 heuristic search techniquesJais Balta
Β 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Mohammed Bennamoun
Β 

What's hot (20)

Natural language processing
Natural language processingNatural language processing
Natural language processing
Β 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
Β 
Chaps 1-3-ai-prolog
Chaps 1-3-ai-prologChaps 1-3-ai-prolog
Chaps 1-3-ai-prolog
Β 
ProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) IntroductionProLog (Artificial Intelligence) Introduction
ProLog (Artificial Intelligence) Introduction
Β 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
Β 
AI: AI & Searching
AI: AI & SearchingAI: AI & Searching
AI: AI & Searching
Β 
Branch and bound
Branch and boundBranch and bound
Branch and bound
Β 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Β 
Randomized Algorithm- Advanced Algorithm
Randomized Algorithm- Advanced AlgorithmRandomized Algorithm- Advanced Algorithm
Randomized Algorithm- Advanced Algorithm
Β 
5 csp
5 csp5 csp
5 csp
Β 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
Β 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
Β 
Algorithm analysis and design
Algorithm analysis and designAlgorithm analysis and design
Algorithm analysis and design
Β 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Β 
Genetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial IntelligenceGenetic Algorithms - Artificial Intelligence
Genetic Algorithms - Artificial Intelligence
Β 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
Β 
DP
DPDP
DP
Β 
Transition Based Dependency Parsing
Transition Based Dependency ParsingTransition Based Dependency Parsing
Transition Based Dependency Parsing
Β 
16890 unit 2 heuristic search techniques
16890 unit 2 heuristic  search techniques16890 unit 2 heuristic  search techniques
16890 unit 2 heuristic search techniques
Β 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Β 

Similar to LSH

Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016Ram Sriharsha
Β 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25MapR Technologies
Β 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraDBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraLinaCovington707
Β 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Charles Martin
Β 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentssuser2be88c
Β 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
Β 
Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Vivek Sharma
Β 
1 chayes
1 chayes1 chayes
1 chayesYandex
Β 
StringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdfStringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdfbhagabatijenadukura
Β 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
Β 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Sean Moran
Β 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaSpark Summit
Β 
ClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPairShanmuganathan C
Β 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Hwa Pyung Kim
Β 

Similar to LSH (20)

Sparksummitny2016
Sparksummitny2016Sparksummitny2016
Sparksummitny2016
Β 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
Β 
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStoraDBMS ArchitectureQuery ExecutorBuffer ManagerStora
DBMS ArchitectureQuery ExecutorBuffer ManagerStora
Β 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Β 
Optim_methods.pdf
Optim_methods.pdfOptim_methods.pdf
Optim_methods.pdf
Β 
Lecture24
Lecture24Lecture24
Lecture24
Β 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
Β 
Dp idp exploredb
Dp idp exploredbDp idp exploredb
Dp idp exploredb
Β 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
Β 
Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883Paper2_CSE6331_Vivek_1001053883
Paper2_CSE6331_Vivek_1001053883
Β 
1 chayes
1 chayes1 chayes
1 chayes
Β 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
Β 
StringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdfStringMatching-Rabikarp algorithmddd.pdf
StringMatching-Rabikarp algorithmddd.pdf
Β 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
Β 
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Learning to Project and Binarise for Hashing-based Approximate Nearest Neighb...
Β 
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram SriharshaMagellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
Β 
Fp12_Efficient_SCM
Fp12_Efficient_SCMFp12_Efficient_SCM
Fp12_Efficient_SCM
Β 
ClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPairClosestPair
ClosestPairClosestPairClosestPairClosestPair
Β 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
Β 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
Β 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraΓΊjo
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Β 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
Β 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
Β 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
Β 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
Β 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
Β 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
Β 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
Β 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Β 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Β 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Β 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Β 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Β 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Β 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
Β 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Β 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
Β 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Β 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
Β 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Β 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Β 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Β 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Β 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
Β 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Β 

LSH

  • 1. Locality Sensitive Hashing with Application to Near Neighbor Reporting Hsiao-Fei Liu 2015.3.4
  • 2. Motivation β€’ Real word applications – Recommendation system β€’ Searching for similar items and users – Malicious website detection β€’ Searching for websites similar to some know malicious websites β€’ The underlying core problem Given: β€’ A large set P of high-dimensional data points in a metric space M β€’ A large set Q of high-dimensional query points in a metric space M Goal: β€’ Find near neighbors in P for each query point in Q β€’ Avoid linearly scanning P for each query
  • 3. Related Work β€’ Nearest Neighbor Searching – Given: a set P of n points in a metric space M – Goal: for any query q return a point p ∈ P minimizing dist(p,q) β€’ Classic Result – Point location in arrangements of hyperplanes β€’ Meiser, IC’93 β€’ In a d-dimensional Euclidean space under some Lp norm β€’ dO(1) logn query time and nO(d) space
  • 4. Related Work β€’ Approximate Nearest Neighbor – Given: a set P of n points in a metric space M and ο₯ > 0 – Goal: for any query q return a point p οƒŽ P s.t. dist(p,q) ο‚£ (1+ο₯) dist(p*,q), where p* is the nearest neighbor to q β€’ Classic Result – Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality β€’ Har-Peled, Indyk and Motwani, STOC’98, FOCS’01, ToC’12 β€’ In d-dimensional Euclidean space under Lp norm β€’ 𝑂(𝑑𝑛 1 1+πœ€) query time and 𝑂(𝑑𝑛 + 𝑛1+ 1 1+πœ€) space β€’ Technique1: Approximate Nearest Neighbor reduces to Approximate Near Neighbor with little overhead Technique2: Locality Sensitive Hashing for Approximate Near Neighbor
  • 5. Overview 1. Near Neighbor Reporting – Formal problem formulation 2. Locality Sensitive Hashing – Definition and example – Algorithms for NNR based on LSH – Query time decomposition 3. Performance tuning – Gap amplification – Parameter optimization
  • 7. Near Neighbor Reporting β€’ Input: – A set P of points in a metric space M – radius r > 0 – coverage rate c – A set S of points sampled from an unknown distribution f β€’ Goal: – A deterministic algorithm for building a data structure T s.t. β€’ T.nbrs(q)  nbrs(q) = {p οƒŽ P : dist(p,q)} β€’ 𝐸 π‘ž cvg(π‘ž|𝑇) ο‚³ 𝑐 where q ~ f and cvg(q|T) = 𝑇.nbrs π‘ž nbrs π‘ž 2015/3/4 hard to achieve 
  • 8. (Relaxed) Near Neighbor Reporting β€’ Input: – A set P of points in a metric space M – radius r > 0 – coverage rate c – A set S of points sampled from an unknown distribution f β€’ Goal: – A randomized algorithm for building a data structure T s.t. β€’ T.nbrs(q)  nbrs(q) = {p οƒŽ P : dist(p,q)} β€’ 𝐸 𝑇 𝐸 π‘ž cvg π‘ž 𝑇 = TPr 𝑇 𝑖𝑠 𝑏𝑒𝑖𝑙𝑑 𝐸 π‘ž cvg π‘ž 𝑇 ο‚³ 𝑐, where q ~ f β€’ Fact: – if 𝐸 𝑇 cvg(π‘ž|𝑇) ο‚³ 𝑐 for any q then 𝐸 𝑇 𝐸 π‘ž cvg π‘ž 𝑇 β‰₯ 𝑐 where q ~ f 2015/3/4
  • 10. Locality Sensitive Hashing β€’ Informally, an LSH H is a set of hash functions over a metric space satisfying the following condition. β€’ Let h be chosen uniformly at random from H – p and q are closer => Pr(h(p) = h(q)) is higher β€’ Formally, H is (r, r+ο₯, c, c’)-sensitive if – r < r+ο₯ and c> c’ – Let h be chosen uniformly at random from H – If dist(p, q) ο‚£ r , then Pr(h(p) = h(q)) ο‚³ c – If dist(p, q) ο‚³ r+ο₯, then Pr(h(p) = h(q)) ο‚£ c’
  • 11. LSH for angular distance β€’ Distance function: d(u, v) = arccos(u,v) = (u,v) β€’ Random Projection: – Choose a random unit vector w – h(u) = sgn(uοƒ—w) – Pr(h(u) = h(v)) = 1 - (u,v)/   u v
  • 12. LSH for Jaccard distance β€’ Distance function: d(A, B) = 1 - | A  B |/| A οƒˆ B | β€’ MinHash: –Pick a random permutation  on the universe U –h(A) = argminaοƒŽA (a) –Pr(h(A) = h(B)) = | A  B |/| A οƒˆ B | = 1 – d(A, B) β€’ Note: –Finding the Jaccard median β€’ Very easy to understand, very hard to compute β€’ Studied from 1981 β€’ Chierichetti, Kumar, Pandey, Vassilvitskii, SODA’10 – NP-Hard and no FPTAS – PTAS A B
  • 13. Algorithm for NNR Based on LSH β€’ Let H be a (r, r+ο₯, c, c’)-sensitive LSH over a metric space M β€’ Consider the following randomized algorithm for NNR – Uniformly at random choose a hash function h from H – Build a hash table Th s.t. Th(q) = { p οƒŽP : h(p) = h(q)} – Define Th.nbrs(q) to return points in Th(q) with dist(p,q) ο‚£ r β€’ For any q, β€’ 𝐸 π‘‡β„Ž cvg π‘ž|𝑇 = β€’ π‘‡β„Ž Pr π‘‡β„Ž 𝑖𝑠 𝑏𝑒𝑖𝑙𝑑 cvg (π‘ž|𝑇) = β€’ β„Ž Pr β„Ž 𝑖𝑠 π‘β„Žπ‘œπ‘ π‘’π‘› π‘βˆˆnbrs(π‘ž) 𝛿(β„Ž 𝑝 =β„Ž(π‘ž)) |nbrs(π‘ž)| = β€’ π‘βˆˆnbrs(π‘ž) β„Ž Pr β„Ž 𝑖𝑠 π‘π‘œπ‘ π‘’π‘› 𝛿(β„Ž 𝑝 =β„Ž(π‘ž)) |nbrs(π‘ž)| = β€’ π‘βˆˆnbrs(π‘ž) Pr(β„Ž 𝑝 =β„Ž π‘ž ) |nbrs(π‘ž)| ο‚³ 𝑐
  • 14. Query Time β€’ Query time = time for computing h(q) + |{p οƒŽP: h(p) = h(q) & dist(p,q) > r|οƒ— d + |{p οƒŽP: h(p) = h(q) & dist(p,q) ο‚£ r|οƒ— d = timec + timeFP + |output|
  • 16. Why Gap Amplification β€’ To lift coverage rate c β€’ Reduce false positive rate to improve timeFP Jaccard distance 0.2 c= 0.8 0.4 c’= 0.6 collision Prob. Jaccard distance 0.2 c= 0.9 0.4 c’= 0.1 collision Prob. (r, r+ο₯, c, c’)-sensitive LSH Gap amplification (r, r+ο₯, cο‚­, c’)-sensitive LSH
  • 17. 17 How Gap Amplification β€’ Construct LSH G from the original LSH H – LSH B = {b(x; h1, h2,…, hk): hi οƒŽ H } β€’ b(p; h1, h2,…, hk) = b(q; h1, h2,…, hk) iff ANDi=1,…,k hi(p) = hi(q) – LSH G = {g(x; b1, b2,… , bL): bi οƒŽ B} β€’ g(p; b1, b2,… , bL ) = g(q; b1, b2,… , bL) iff ORi=1,…,L bi(p) = bi(q) β€’ Intuition: – AND increases the gap β€’ Collision probabilities of distant points decrease exponentially faster than near points – OR increases the collision probabilities approx. linearly β€’ Let P = PrhοƒŽH (h(p) = h(q)) => PrbοƒŽB (b(p) = b(q)) = Pk => PrgοƒŽG (g(p) ο‚Ή g(q)) = (1 – Pk)L => PrgοƒŽG (g(p) = g(q)) = 1 - (1 – Pk)L ~ LPk L buckets k hashes or or or
  • 18. 18 Parameter Optimization β€’ Situation – A (r, r+ο₯, c, c’)-sensitive LSH H is given – After gap amplification we want c to be lifted to cο‚­ β€’ Let cο‚­ = 1 – (1-ck)L β‡’ 𝐿 = log 1 – cο‚­ log 1 βˆ’ ck β‡’ L is a strictly increasing function of k β€’ So we only need to select a good k
  • 19. 19 How to select a good k β€’ How to measure the β€œgoodness”? – Minimize the timec + E[timeFP ]under the space constraint β€’ Let’s investigate how the query time and space usage will react when we increase k
  • 20. k ο‚­ => space ο‚­ β€’ Space = 𝑂(𝑑𝑛 + 𝑛𝐿) = 𝑂(𝑛(𝑑 + log 1 – cο‚­ log 1 βˆ’ ck ) ))
  • 21. k ο‚­ => timec ο‚­ β€’ timec =O (dkL) = 𝑂(π‘‘π‘˜ log 1 – cο‚­ log 1 βˆ’ ck ) – Interested reader can refer to E2LSH for how to reduce timec to O(dkL1/2)
  • 22. k ο‚­ => timeFP ο‚― c P Y cο‚­ k=2 k=3 β€’ Consider s-curves Y = 1 – (1-Pk)L passing through (c, cο‚­) β€’ Larger k => steeper s-curve, => collision prob. drop faster for distant points => less false positive => timeFP ο‚― original collision prob. of distant points
  • 23. 1. Determine the largest possible value kmax for k without violating the space constraint 2. Find k* in [1..kmax] mininizing timec(k) + E[timeFP(k)] – In practice, timec and timeFP is measured experimentally by β€’ constructing a data structure T β€’ running several queries sampled from S on T Procedure for optimizing k
  • 24. β€’ Let βˆ† π‘˜time 𝑐 = timec(k) – timec(k-1) β€’ Let βˆ† π‘˜timeFP = E[timeFP(k-1)] – E[timeFP(k)] β€’ Observation: – If βˆ† π‘˜time𝑐 is increasing and βˆ† π‘˜timeFP is decreasing, then k* would be the largest k such that βˆ† π‘˜timeFP > βˆ† π‘˜time 𝑐 and can be found using binary search. β€’ Question: – in which situation will βˆ† π‘˜timeFP = E[timeFP(k-1)] – E[timeFP(k)] be increasing? Observation & Question k βˆ† π‘˜time 𝑐 βˆ† π‘˜timeFP
  • 25. Summary β€’ Near Neighbor Reporting – Find many applications in practice β€’ Locality Sensitive Hashing – Hash near points to the same value – One of the most useful techniques for NNR β€’ Performance tuning – Gap amplification for higher coverage and lower FP – Parameter optimization for query time
  • 26. Further Reading β€’ Dimensionality Reduction – Variance preserving – Principal Component Analysis – Singular Value Decomposition – Distance preserving – Random Projection and the Johnson–Lindenstrauss lemma – Locality preserving – Locally Linear Embedding – Multi-dimensional Scaling – ISOMAP
  • 27. References 1. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality 2. Similarity Search in High Dimensions via Hashing 3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf 4. E2LSH
  • 28. Appendix β€’ Suppose that we have a (r, r+ο₯, c, err)-sensitive LSH H and want to amplify H to get a (r, r+ο₯, cο‚­, errο‚―)-sensitive LSH G. β€’ How does the bucket number L and the collision error errο‚― change with k?
  • 29. Increasing rate of bucket number 𝐿𝑒𝑑 1 βˆ’ 1 βˆ’ 𝑐 π‘˜ 𝐿 = 𝑐 β‡’ πΏπ‘π‘˜ ≀ 𝑐 and 𝑐 ο‚£ 1 βˆ’ 𝑒 𝐿𝑐 π‘˜ β‡’ 𝑐 𝑐 π‘˜ ο‚£ 𝐿 ο‚£ βˆ’ln(1 βˆ’ 𝑐) 𝑐 π‘˜ β‡’ 𝐿 = πœƒ( 1 𝑐 π‘˜ )
  • 30. Decreasing rate of collision error β€’ errο‚― = 1 βˆ’ 1 βˆ’ err π‘˜ 𝐿 ο‚£ 𝐿err π‘˜ ο‚£ ( err 𝑐 ) π‘˜ for some constant  βˆ’βˆ’ βˆ’(1) β€’ errο‚― = 1 βˆ’ 1 βˆ’ err π‘˜ 𝐿 ο‚³ 1 βˆ’ π‘’βˆ’πΏerr π‘˜ ο‚³ 1 βˆ’ 𝑒 βˆ’π›½ err 𝑐 π‘˜ for some constant 0 < 𝛽 < 1 = 𝛽 err 𝑐 π‘˜ βˆ’ 𝛽2 err 𝑐 2π‘˜ 2! + 𝛽3 err 𝑐 3π‘˜ 3! … =  err 𝑐 π‘˜ βˆ’βˆ’ βˆ’(2) β€’ By (1) and (2), we have errο‚― = ΞΈ err 𝑐 π‘˜