SlideShare a Scribd company logo
Locality Sensitive Hashing
Randomized Algorithm
Problem Statement
• Given a query point q,
• Find closest items to the query
point with the probability of 1 − 𝛿
• Iterative methods?
• Large volume of data
• Curse of dimensionality
Taxonomy – Near Neighbor Query (NN)
NN
Trees
K-d Tree Range Tree B Tree Cover Tree
Grid
Voronoi
Diagram
Hash
Approximate
LSH
Approximate LSH
• Simple Idea
• if two points are close together, then after a “projection” operation these two
points will remain close together
LSH Requirement
• For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1
𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2
• Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need
• (𝑃1−𝑃2) to be large
• (𝑑1−𝑑2) to be small
P
d
2d
c.d
q
q
≥ P(1)
≥ P(2)
≥ P(c) P(1) ≥P(2) ≥P(3)
q
Probability vs. Distance on candidate pairs
Hash Function(Random)
• Locality-preserving
• Independent
• Deterministic
• Family of Hash Function per various distance measures
• Euclidean
• Jaccard
• Cosine Similarity
• Hamming
LSH Family for Euclidean distance (2d)
• When d. cos 𝜃 ≤ 𝑎,
• Chance of colliding
• But not certain
• But can guarantee,
• If 𝑑 ≤ 𝑎/2,
• 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃1 ≥ 1/2
• If 𝑑 ≥ 2𝑎,
• 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎
• ∴ 𝑃2 ≤ 1/3
• As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive
• (𝑎, 2𝑎,
1
2
,
1
3
)
How to define the projection?
• Scalar projection (Dot product)
ℎ
𝑣
=
𝑣
.
𝑥
;
𝑣
= 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒
𝑥
= 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1)
ℎ
𝑣
= 𝑣
.
𝑥
+ 𝑏
𝑤
;
𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛
𝑏 − random variable uniformly distributed between 0 and w
How to define the projection?
• K-dot product, that
(
𝑃1
𝑃2
) 𝑘> (
𝑃1
𝑃2
)
points at different separations will fall into the same quantization bin
• Perform k independent dot products
• Achieve success,
• if the query and the nearest neighbor are in the same bin in all k dot products
• Success probability = 𝑃1
𝑘
; decreases as we include more dot products
Multiple-projections
• L independent projections
• True near neighbor will be unlikely to be unlucky in all the projections
• By increasing L,
• we can find the true nearest neighbor with arbitrarily high probability
Accuracy
• Two close points p and q,
• Separated by 𝑢 = 𝑝 − 𝑞
• Probability of collision 𝑃 𝐻 𝑢 ,
𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞))
=
0
𝑤
1
𝑢
. 𝑓𝑠
𝑡
𝑢
. 1 −
𝑡
𝑤
𝑑𝑡
𝑓𝑠- probability density function of H
• As distance u increases, 𝑃 𝐻 𝑢 decreases
Time complexity
• For a query point q,
• To Find the near neighbor: (𝑇𝑔+𝑇𝑐)
• Calculate & hash the projections (𝑇𝑔)
• O(DkL); D−dimension, kL projections
• Search the bucket for collisions (𝑇𝑐)
• O(DL𝑁𝑐); D-dimension, L projections, and
• where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘
. | 𝑞 − 𝑞′
|; 𝑁𝑐 - expected number of collisions for single projection
• Analyze
• 𝑇𝑔 increases as k & L increase
• 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
How many projections(L)?
• For query point p & neighbor q,
• For single projection,
• Success probability of collisions: ≥ 𝑃1
𝑘
• For L projections,
• Failure probability of collisions: ≤ (1 − 𝑃1
𝑘
) 𝐿
∴ (1 − 𝑃1
𝑘
) 𝐿= 𝛿
𝐿 =
log 𝛿
log(1 − 𝑃1
𝑘
)
LSH in MAXDIVREL Diversity
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
#1 #2 #3 … #k dot
product
1 1 1 0 .. 1
2 1 0 1 … 1
w 0 1 1 … 0
#1 #2 #3 … #k dot
product
1 1 0 1 .. 0
2 0 0 1 … 0
w 0 1 0 … 0
#1 #2 #3 … #k dot
product
1 1 0 0 .. 1
2 0 1 1 … 1
w 0 0 1 … 0
REFERENCES
[1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of
Massive Datasets,’” pp. 72–130.
[2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008.
[3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk,
S. Madden, and P. Dubey, “Streaming similarity search over one billion
tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol.
6, no. 14, pp. 1930–1941, Sep. 2013.

More Related Content

What's hot

Backtracking
Backtracking  Backtracking
Backtracking
Vikas Sharma
 
Fractional knapsack problem
Fractional knapsack problemFractional knapsack problem
Fractional knapsack problem
Learning Courses Online
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back Propagation
Sung-ju Kim
 
Graph coloring using backtracking
Graph coloring using backtrackingGraph coloring using backtracking
Graph coloring using backtracking
shashidharPapishetty
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
Jinwon Lee
 
convex hull
convex hullconvex hull
convex hull
ravikirankalal
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
Hakky St
 
LeNet-5
LeNet-5LeNet-5
LeNet-5
佳蓉 倪
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
Rashika Ahuja
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
Mark Chang
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
sandeep54552
 
strassen matrix multiplication algorithm
strassen matrix multiplication algorithmstrassen matrix multiplication algorithm
strassen matrix multiplication algorithm
evil eye
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
Puneet Kulyana
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
sathish sak
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Amit Kumar Rathi
 
Informed and Uninformed search Strategies
Informed and Uninformed search StrategiesInformed and Uninformed search Strategies
Informed and Uninformed search Strategies
Amey Kerkar
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDB
ArangoDB Database
 
Graph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptxGraph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptx
Home
 
Chap4
Chap4Chap4
Chap4
nathanurag
 

What's hot (20)

Backtracking
Backtracking  Backtracking
Backtracking
 
Fractional knapsack problem
Fractional knapsack problemFractional knapsack problem
Fractional knapsack problem
 
Multi Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back PropagationMulti Layer Perceptron & Back Propagation
Multi Layer Perceptron & Back Propagation
 
Graph coloring using backtracking
Graph coloring using backtrackingGraph coloring using backtracking
Graph coloring using backtracking
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
convex hull
convex hullconvex hull
convex hull
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
 
LeNet-5
LeNet-5LeNet-5
LeNet-5
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Greedy algorithms
Greedy algorithmsGreedy algorithms
Greedy algorithms
 
strassen matrix multiplication algorithm
strassen matrix multiplication algorithmstrassen matrix multiplication algorithm
strassen matrix multiplication algorithm
 
MACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHMMACHINE LEARNING - GENETIC ALGORITHM
MACHINE LEARNING - GENETIC ALGORITHM
 
Classification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision TreesClassification: Basic Concepts and Decision Trees
Classification: Basic Concepts and Decision Trees
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
 
Informed and Uninformed search Strategies
Informed and Uninformed search StrategiesInformed and Uninformed search Strategies
Informed and Uninformed search Strategies
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDB
 
Graph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptxGraph coloring problem(DAA).pptx
Graph coloring problem(DAA).pptx
 
Chap4
Chap4Chap4
Chap4
 

Similar to Locality sensitive hashing

Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
ssuser2be88c
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
MapR Technologies
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
ChenYiHuang5
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
Hwa Pyung Kim
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
Ted Dunning
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
Subrata Kumer Paul
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
Ted Dunning
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
Ted Dunning
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
Albert Y. C. Chen
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
cmpt cmpt
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
LiemNguyenDuy
 
cnn.pptx
cnn.pptxcnn.pptx
cnn.pptx
sghorai
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
MapR Technologies
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...
MITSUNARI Shigeo
 
A short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum CryptographyA short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum Cryptography
Facultad de Informática UCM
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logic
Yamagata Yoriyuki
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
Aritra Sarkar
 
Data Mining Lecture_9.pptx
Data Mining Lecture_9.pptxData Mining Lecture_9.pptx
Data Mining Lecture_9.pptx
Subrata Kumer Paul
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
MapR Technologies
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
San Kim
 

Similar to Locality sensitive hashing (20)

Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)Tutorial on Object Detection (Faster R-CNN)
Tutorial on Object Detection (Faster R-CNN)
 
Oxford 05-oct-2012
Oxford 05-oct-2012Oxford 05-oct-2012
Oxford 05-oct-2012
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
ACM 2013-02-25
ACM 2013-02-25ACM 2013-02-25
ACM 2013-02-25
 
Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28Paris data-geeks-2013-03-28
Paris data-geeks-2013-03-28
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
SVD.ppt
SVD.pptSVD.ppt
SVD.ppt
 
SPATIAL POINT PATTERNS
SPATIAL POINT PATTERNSSPATIAL POINT PATTERNS
SPATIAL POINT PATTERNS
 
cnn.pptx
cnn.pptxcnn.pptx
cnn.pptx
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...A compact zero knowledge proof to restrict message space in homomorphic encry...
A compact zero knowledge proof to restrict message space in homomorphic encry...
 
A short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum CryptographyA short introduction to Quantum Computing and Quantum Cryptography
A short introduction to Quantum Computing and Quantum Cryptography
 
Bounded arithmetic in free logic
Bounded arithmetic in free logicBounded arithmetic in free logic
Bounded arithmetic in free logic
 
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
HiPEAC'19 Tutorial on Quantum algorithms using QX - 2019-01-23
 
Data Mining Lecture_9.pptx
Data Mining Lecture_9.pptxData Mining Lecture_9.pptx
Data Mining Lecture_9.pptx
 
Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 

More from Sameera Horawalavithana

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and Simulation
Sameera Horawalavithana
 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Sameera Horawalavithana
 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Sameera Horawalavithana
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Sameera Horawalavithana
 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Sameera Horawalavithana
 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
Sameera Horawalavithana
 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
Sameera Horawalavithana
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
Sameera Horawalavithana
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
Sameera Horawalavithana
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
Sameera Horawalavithana
 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015
Sameera Horawalavithana
 
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
Sameera Horawalavithana
 
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
Sameera Horawalavithana
 
Zipf distribution
Zipf distributionZipf distribution
Zipf distribution
Sameera Horawalavithana
 
Query personalization
Query personalizationQuery personalization
Query personalization
Sameera Horawalavithana
 
Dancing with publish/subscribe
Dancing with publish/subscribeDancing with publish/subscribe
Dancing with publish/subscribe
Sameera Horawalavithana
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Sameera Horawalavithana
 

More from Sameera Horawalavithana (17)

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and Simulation
 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015
 
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe ...
 
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
 
Zipf distribution
Zipf distributionZipf distribution
Zipf distribution
 
Query personalization
Query personalizationQuery personalization
Query personalization
 
Dancing with publish/subscribe
Dancing with publish/subscribeDancing with publish/subscribe
Dancing with publish/subscribe
 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
 

Recently uploaded

Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 

Recently uploaded (20)

Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 

Locality sensitive hashing

  • 2. Problem Statement • Given a query point q, • Find closest items to the query point with the probability of 1 − 𝛿 • Iterative methods? • Large volume of data • Curse of dimensionality
  • 3. Taxonomy – Near Neighbor Query (NN) NN Trees K-d Tree Range Tree B Tree Cover Tree Grid Voronoi Diagram Hash Approximate LSH
  • 4. Approximate LSH • Simple Idea • if two points are close together, then after a “projection” operation these two points will remain close together
  • 5. LSH Requirement • For any given points 𝑝, 𝑞 ∈ 𝑅 𝑑 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≥ 𝑃1 𝑓𝑜𝑟 𝑝 − 𝑞 ≤ 𝑑1 𝑃 𝐻 ℎ 𝑝 = ℎ 𝑞 ≤ 𝑃2 𝑓𝑜𝑟 𝑝 − 𝑞 ≥ 𝑐𝑑1 = 𝑑2 • Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, Ideally we need • (𝑃1−𝑃2) to be large • (𝑑1−𝑑2) to be small
  • 6. P d 2d c.d q q ≥ P(1) ≥ P(2) ≥ P(c) P(1) ≥P(2) ≥P(3) q
  • 7. Probability vs. Distance on candidate pairs
  • 8. Hash Function(Random) • Locality-preserving • Independent • Deterministic • Family of Hash Function per various distance measures • Euclidean • Jaccard • Cosine Similarity • Hamming
  • 9. LSH Family for Euclidean distance (2d) • When d. cos 𝜃 ≤ 𝑎, • Chance of colliding • But not certain • But can guarantee, • If 𝑑 ≤ 𝑎/2, • 90 ≥ 𝜃 ≥ 45 to have d. cos 𝜃 ≤ 𝑎 • ∴ 𝑃1 ≥ 1/2 • If 𝑑 ≥ 2𝑎, • 90 ≥ 𝜃 ≥ 60 to have d. cos 𝜃 ≤ 𝑎 • ∴ 𝑃2 ≤ 1/3 • As LSH (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive • (𝑎, 2𝑎, 1 2 , 1 3 )
  • 10. How to define the projection? • Scalar projection (Dot product) ℎ 𝑣 = 𝑣 . 𝑥 ; 𝑣 = 𝑞𝑢𝑒𝑟𝑦 𝑝𝑜𝑖𝑛𝑡 𝑖𝑛 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 𝑠𝑝𝑎𝑐𝑒 𝑥 = 𝑣𝑒𝑐𝑡𝑜𝑟 𝑤𝑖𝑡ℎ 𝑟𝑎𝑛𝑑𝑜𝑚 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑓𝑟𝑜𝑚 𝑁(0,1) ℎ 𝑣 = 𝑣 . 𝑥 + 𝑏 𝑤 ; 𝑤 − 𝑤𝑖𝑑𝑡ℎ 𝑜𝑓 𝑞𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑏𝑖𝑛 𝑏 − random variable uniformly distributed between 0 and w
  • 11. How to define the projection? • K-dot product, that ( 𝑃1 𝑃2 ) 𝑘> ( 𝑃1 𝑃2 ) points at different separations will fall into the same quantization bin • Perform k independent dot products • Achieve success, • if the query and the nearest neighbor are in the same bin in all k dot products • Success probability = 𝑃1 𝑘 ; decreases as we include more dot products
  • 12. Multiple-projections • L independent projections • True near neighbor will be unlikely to be unlucky in all the projections • By increasing L, • we can find the true nearest neighbor with arbitrarily high probability
  • 13. Accuracy • Two close points p and q, • Separated by 𝑢 = 𝑝 − 𝑞 • Probability of collision 𝑃 𝐻 𝑢 , 𝑃 𝐻 𝑢 = (𝑃 𝐻(𝐻 𝑝 = 𝐻(𝑞)) = 0 𝑤 1 𝑢 . 𝑓𝑠 𝑡 𝑢 . 1 − 𝑡 𝑤 𝑑𝑡 𝑓𝑠- probability density function of H • As distance u increases, 𝑃 𝐻 𝑢 decreases
  • 14. Time complexity • For a query point q, • To Find the near neighbor: (𝑇𝑔+𝑇𝑐) • Calculate & hash the projections (𝑇𝑔) • O(DkL); D−dimension, kL projections • Search the bucket for collisions (𝑇𝑐) • O(DL𝑁𝑐); D-dimension, L projections, and • where 𝑁𝑐 = 𝑞′∈𝐷 𝑝 𝑘 . | 𝑞 − 𝑞′ |; 𝑁𝑐 - expected number of collisions for single projection • Analyze • 𝑇𝑔 increases as k & L increase • 𝑇𝑐 decreases as k increases since 𝑝 𝑘 < 𝑝
  • 15. How many projections(L)? • For query point p & neighbor q, • For single projection, • Success probability of collisions: ≥ 𝑃1 𝑘 • For L projections, • Failure probability of collisions: ≤ (1 − 𝑃1 𝑘 ) 𝐿 ∴ (1 − 𝑃1 𝑘 ) 𝐿= 𝛿 𝐿 = log 𝛿 log(1 − 𝑃1 𝑘 )
  • 16. LSH in MAXDIVREL Diversity #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0 #1 #2 #3 … #k dot product 1 1 1 0 .. 1 2 1 0 1 … 1 w 0 1 1 … 0 #1 #2 #3 … #k dot product 1 1 0 1 .. 0 2 0 0 1 … 0 w 0 1 0 … 0 #1 #2 #3 … #k dot product 1 1 0 0 .. 1 2 0 1 1 … 1 w 0 0 1 … 0
  • 17. REFERENCES [1] Anand Rajaraman and Jeff Ullman, “Chapter Three of ‘Mining of Massive Datasets,’” pp. 72–130. [2] M. Slaney and M. Casey, “Lecture Note: LSH,” 2008. [3] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey, “Streaming similarity search over one billion tweets using parallel locality-sensitive hashing,” Proc. VLDB Endow., vol. 6, no. 14, pp. 1930–1941, Sep. 2013.

Editor's Notes

  1. A randomized algorithm does not guarantee an exact answer but instead provides a high proba- bility guarantee that it will return the cor- rect answer or one close to it
  2. O(log N) ; N – number of object; when d is one dimensional this is binary search, but when d becomes high K-d tree algorithm - The problem with multidimensional algorithms such as k-d trees is that they break down when the dimensionality of the search space is greater than a few dimensions O(N) Grid: Close points should be in same grid cell. But some can always lay across the boundary (no matter how close). Some may be further than 1 grid cell, but still close. And in high dimensions, the number of neighboring grid cells grows exponentially. One option is to randomly shift (and rotate) and try again Hash – O(1) search, while O(N) memory
  3. Notice that we say nothing about what happens when the distance between the items is strictly between d1 and d2, but we can make d1 and d2 as close as we wish. The penalty is that typically p1 and p2 are then close as well. As we shall see, it is possible to drive p1 and p2 apart while keeping d1 and d2 fixed - according to a Chernoff-Hoeffding bound
  4. the probability that p and q collide under a random choice of hash function depends only on the distance between p and q
  5. In fact, if the angle θ between the randomly chosen line and the line connecting the points is large, then there is an even greater chance that the two points will fall in the same bucket. For instance, if θ is 90 degrees, then the two points are certain to fall in the same bucket. However, suppose d is larger than a. In order for there to be any chance of the two points falling in the same bucket, we need d cos θ ≤ a
  6. Finding a good hash implementation, and analyzing the hash performance
  7. Increasing the quantization bucket width w will increase the number of points that fall into each bucket. To obtain our final nearest neighbor result we will have to perform a linear search through all the points that fall into the same bucket as the query, so varying w effects a trade-off between a larger table with a smaller final linear search, or a more compact table with more points to consider in the final search