SlideShare a Scribd company logo
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Sean Moran
Victor Lavrenko, Miles Osborne
School of Informatics
University of Edinburgh
ACL Sofia, August 2013
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 1/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 2/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Problem: retrieve nearest neighbour(s) to a given query item
q?
y
x
1-NN
1-NN
q
x
y
Na¨ıve approach: compare query to all N database items
Scales linearly O(N) with the size of the database
Impractical for all but the smallest of databases
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 3/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 4/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 5/41
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
[1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 6/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
Database
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 7/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 8/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H
H
Query
Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 9/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H
H
Query
Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 10/41
Variable Bit Quantisation for LSH
Hashing-based search using binary codes
110101
010111
H
H
Query Compute
Similarity
Nearest
Neighbours
Query
Database
Hash Table
010101
111101
.....
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 11/41
Variable Bit Quantisation for LSH
Why transform our data to binary codes?
Constant O(1) query time with respect to the dataset size.
Binary code comparison requires few machine instructions
(XOR followed by popcount).
Binary codes are extremely compact e.g. 16Mb to store 1
million, 128 bit encoded data points.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 12/41
Variable Bit Quantisation for LSH
Applications
Recommendation
Near duplicate detection
Content based IR
Recommendation
Near Duplicate Detection
Noun Clustering
Location Recognition
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 13/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 14/41
Variable Bit Quantisation for LSH
Computing similarity preserving binary codes
Locality Sensitive Hashing (LSH):
Randomised algorithm for approximate nearest neighbour
search
Generates similarity preserving binary codes for a dataset
Preserved similarity depends on the selected hash family
Probabilistic guarantee on accuracy versus search time
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 15/41
Variable Bit Quantisation for LSH
Locality Sensitive Hashing
x
y
n2
n1
h1
h2
11
0100
10
h1(p) (p)h2
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 16/41
Variable Bit Quantisation for LSH
Low Dimensional Projection
n2
n1
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 17/41
Variable Bit Quantisation for LSH
Single Bit Quantisation (SBQ)
0 1
t
n2
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 18/41
Variable Bit Quantisation for LSH
Plus many more...
Kernel methods [1]
Spectral methods [2] [3]
Neural networks [4]
Loss based methods [5]
All use single bit quantisation (SBQ)...
[1] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.
[2] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.
[3] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.
[4] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.
[5] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 19/41
Variable Bit Quantisation for LSH
Problem 1: SBQ leads to high quantisation errors
Highest point density typically occurs around zero:
−1.5 −1 −0.5 0 0.5 1 1.5
0
1000
2000
3000
4000
5000
6000
Projected Value
Count
Closer points can have a greater Hamming distance than
distant points:
0 1
t
n2
Same codeDifferent code
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 20/41
Variable Bit Quantisation for LSH
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 21/41
Variable Bit Quantisation for LSH
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 22/41
Variable Bit Quantisation for LSH
Problem 2: some hyperplanes are better than others
n1
n2Projected Dimension 2
Projected Dimension 1
x
y
n2
n1
h1
h2
11
0100
10
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 23/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 24/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 25/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 26/41
Variable Bit Quantisation for LSH
Multiple Bit Encoding: Natural Binary Code
0 1
t
n2
00 01 10 11
t1 t2 t3
n2
1 bit
2 bits
000 001 010
t2 t4 t6
n2
3 bits
etc...
t1 t3 t5 t7
011 111101 110100
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 27/41
Variable Bit Quantisation for LSH
Multiple Bit Encoding [1]
00 01 10 11
t1 t2 t3
n2
2 bits
Decimal distance = 3-1 = 2
Decimal distance = 2-0 = 2
[1] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image
retrieval. SIGIR ’12.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 28/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 29/41
Variable Bit Quantisation for LSH
Threshold Optimisation
Multiple bits per hyperplane requires multiple thresholds.
F1-based optimisation using pairwise constraints matrix S:
TP: # Sij = 1 pairs in the same region. FP: # Sij = 0 pairs
in the same region. FN: # Sij = 1 pairs in different regions.
F1 score: 1.00
00 01 10 11
t1 t2 t3
n2
F1 score: 0.44
00 01 10 11
t1 t2 t3
n2
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 30/41
Variable Bit Quantisation for LSH
Our Solution: Variable Bit Quantisation
Threshold
Optimisation
Variable Bit
Allocation
Variable Bit Quantisation
Continuous
data
Binary
string
Multiple Bit Encoding
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 31/41
Variable Bit Quantisation for LSH
Variable Bit Allocation
F1 score is a measure of the neighbourhood preserving quality
of a hyperplane:
F1 score: 1.00
F1 score: 0.25
00 01 10 11
t1 t2 t3
0 bits assigned: 0 thresholds do just as well as one or more thresholds
n1
n2
2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure
Compute bit allocation that maximises the cumulative F1
score across all hyperplanes subject to a bit budget B.
Bit allocation solved as a binary integer linear program (BILP).
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 32/41
Variable Bit Quantisation for LSH
Variable Bit Allocation
max F ◦ Z
subject to Zh = 1 h ∈ {1 . . . B}
Z ◦ D ≤ B
Z is binary
F contains the F1 scores per hyperplane, per bit count
Z is an indicator matrix specifying the bit allocation
D is a constraint matrix
B is the bit budget
. denotes the Frobenius L1 norm
◦ the Hadamard product
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 33/41
Variable Bit Quantisation for LSH
Variable Bit Allocation
max F ◦ Z
subject to Zh = 1 h ∈ {1 . . . B}
Z ◦ D ≤ B
Z is binary


F h1 h2
b0 0.25 0.25
b1 0.35 0.50
b2 0.40 1.00




D
0 0
1 1
2 2




Z
1 0
0 0
0 1


Sparse solution possible: lower quality hyperplanes can be
discarded.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 34/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 35/41
Variable Bit Quantisation for LSH
Evaluation Protocol
Task: Text and image retrieval on three standard datasets:
CIFAR-10, TDT-2 and Reuters-21578.
Projections: LSH [1], Shift-invariant kernel hashing (SIKH)
[2], Binary LSI (BLSI) [3], Spectral Hashing (SH) [4] and
PCA-Hashing (PCAH) [5].
Baselines: Single Bit Quantisation (SBQ), Manhattan
Hashing (MQ)[6], Double-Bit quantisation (DBQ) [7].
Hamming Ranking: how well do we retrieve the −NN of
query points?
[1] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98.
[2] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09.
[3] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08.
[4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08.
[5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12.
[6] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12.
[7] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 36/41
Variable Bit Quantisation for LSH
AUPRC across different projection methods
Dataset CIFAR-10 (32 bits) Reuters-21578 (128 bits)
SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ
SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389
LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538
BLSI 0.038 0.135 0.111 0.231 0.100 0.030 0.030 0.156
SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154
PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154
VBQ is an effective quantisation scheme for both image and
text datasets.
VBQ and a cheap projection (e.g. LSH) can outperform SBQ
and an expensive projection (e.g. PCA).
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 37/41
Variable Bit Quantisation for LSH
AUPRC for LSH across a broad bit range
32 48 64 96 128 256
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SBQ MQ VBQ
Number of Bits
AUPRC
8 16 24 32 40 48 56 64
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
SBQ MQ VBQ
Number of Bits
AUPRC
(a) Reuters-21578 (b) CIFAR-10
VBQ is effective across a wide range of bits.
See paper for further results.
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 38/41
Variable Bit Quantisation for LSH
Variable Bit Quantisation for LSH
Fast search in large-scale datasets
Locality Sensitive Hashing (LSH)
Variable Bit Quantisation
Evaluation
Summary
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 39/41
Variable Bit Quantisation for LSH
Summary
Proposed a general data-driven technique to adaptively assign
variable bits per LSH hyperplane
Hyperplanes better preserving the neighbourhood structure
are afforded more bits from the bit budget
VBQ substantially increased retrieval performance across
standard text and image datasets
Future work: evaluate with an LSH system that uses hash
tables for fast retrieval
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 40/41
Variable Bit Quantisation for LSH
Thank you for your attention
Sean Moran
sean.moran@ed.ac.uk
www.seanjmoran.com
Sean Moran
Victor Lavrenko, Miles Osborne
ACL Sofia, August 2013 41/41

More Related Content

Similar to ACL Variable Bit Quantisation Talk

PhD thesis defence slides
PhD thesis defence slidesPhD thesis defence slides
PhD thesis defence slides
Sean Moran
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
xlight
 
2015 agbt sv_poster_justin
2015 agbt sv_poster_justin2015 agbt sv_poster_justin
2015 agbt sv_poster_justin
GenomeInABottle
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
Universitat Politècnica de Catalunya
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Thomas Keane
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Seonghyun Kim
 
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
Venkat Projects
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
Dimitris Kontokostas
 
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS CodesUtilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
ALYAA AL-BARRAK
 

Similar to ACL Variable Bit Quantisation Talk (9)

PhD thesis defence slides
PhD thesis defence slidesPhD thesis defence slides
PhD thesis defence slides
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
2015 agbt sv_poster_justin
2015 agbt sv_poster_justin2015 agbt sv_poster_justin
2015 agbt sv_poster_justin
 
Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...Faster R-CNN: Towards real-time object detection with region proposal network...
Faster R-CNN: Towards real-time object detection with region proposal network...
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
4.LOCAL DYNAMIC NEIGHBORHOOD BASED OUTLIER DETECTION APPROACH AND ITS FRAMEWO...
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS CodesUtilise Multipath Propagation to Improve the performance of BCH and RS Codes
Utilise Multipath Propagation to Improve the performance of BCH and RS Codes
 

Recently uploaded

Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
DianaGray10
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
SynapseIndia
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
alexjohnson7307
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Nicolás Lopéz
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
Bhajan Mehta
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
janagijoythi
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
FIDO Alliance
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
AimanAthambawa1
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
ldtexsolbl
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 

Recently uploaded (20)

Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3Communications Mining Series - Zero to Hero - Session 3
Communications Mining Series - Zero to Hero - Session 3
 
Tailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer InsightsTailored CRM Software Development for Enhanced Customer Insights
Tailored CRM Software Development for Enhanced Customer Insights
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
leewayhertz.com-AI agents for healthcare Applications benefits and implementa...
 
Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024Vertex AI Agent Builder - GDG Alicante - Julio 2024
Vertex AI Agent Builder - GDG Alicante - Julio 2024
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17Mule Experience Hub and Release Channel with Java 17
Mule Experience Hub and Release Channel with Java 17
 
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptxMAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
MAKE MONEY ONLINE Unlock Your Income Potential Today.pptx
 
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
UX Webinar Series: Essentials for Adopting Passkeys as the Foundation of your...
 
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
COVID-19 and the Level of Cloud Computing Adoption: A Study of Sri Lankan Inf...
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
Types of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technologyTypes of Weaving loom machine & it's technology
Types of Weaving loom machine & it's technology
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 

ACL Variable Bit Quantisation Talk

  • 1. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Sean Moran Victor Lavrenko, Miles Osborne School of Informatics University of Edinburgh ACL Sofia, August 2013 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 1/41
  • 2. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 2/41
  • 3. Variable Bit Quantisation for LSH Fast search in large-scale datasets Problem: retrieve nearest neighbour(s) to a given query item q? y x 1-NN 1-NN q x y Na¨ıve approach: compare query to all N database items Scales linearly O(N) with the size of the database Impractical for all but the smallest of databases Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 3/41
  • 4. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 4/41
  • 5. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 5/41
  • 6. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 6/41
  • 7. Variable Bit Quantisation for LSH Hashing-based search using binary codes Database Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 7/41
  • 8. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 8/41
  • 9. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 9/41
  • 10. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 10/41
  • 11. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Compute Similarity Nearest Neighbours Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 11/41
  • 12. Variable Bit Quantisation for LSH Why transform our data to binary codes? Constant O(1) query time with respect to the dataset size. Binary code comparison requires few machine instructions (XOR followed by popcount). Binary codes are extremely compact e.g. 16Mb to store 1 million, 128 bit encoded data points. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 12/41
  • 13. Variable Bit Quantisation for LSH Applications Recommendation Near duplicate detection Content based IR Recommendation Near Duplicate Detection Noun Clustering Location Recognition Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 13/41
  • 14. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 14/41
  • 15. Variable Bit Quantisation for LSH Computing similarity preserving binary codes Locality Sensitive Hashing (LSH): Randomised algorithm for approximate nearest neighbour search Generates similarity preserving binary codes for a dataset Preserved similarity depends on the selected hash family Probabilistic guarantee on accuracy versus search time Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 15/41
  • 16. Variable Bit Quantisation for LSH Locality Sensitive Hashing x y n2 n1 h1 h2 11 0100 10 h1(p) (p)h2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 16/41
  • 17. Variable Bit Quantisation for LSH Low Dimensional Projection n2 n1 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 17/41
  • 18. Variable Bit Quantisation for LSH Single Bit Quantisation (SBQ) 0 1 t n2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 18/41
  • 19. Variable Bit Quantisation for LSH Plus many more... Kernel methods [1] Spectral methods [2] [3] Neural networks [4] Loss based methods [5] All use single bit quantisation (SBQ)... [1] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09. [2] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08. [3] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12. [4] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08. [5] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 19/41
  • 20. Variable Bit Quantisation for LSH Problem 1: SBQ leads to high quantisation errors Highest point density typically occurs around zero: −1.5 −1 −0.5 0 0.5 1 1.5 0 1000 2000 3000 4000 5000 6000 Projected Value Count Closer points can have a greater Hamming distance than distant points: 0 1 t n2 Same codeDifferent code Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 20/41
  • 21. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 21/41
  • 22. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 22/41
  • 23. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 23/41
  • 24. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 24/41
  • 25. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 25/41
  • 26. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 26/41
  • 27. Variable Bit Quantisation for LSH Multiple Bit Encoding: Natural Binary Code 0 1 t n2 00 01 10 11 t1 t2 t3 n2 1 bit 2 bits 000 001 010 t2 t4 t6 n2 3 bits etc... t1 t3 t5 t7 011 111101 110100 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 27/41
  • 28. Variable Bit Quantisation for LSH Multiple Bit Encoding [1] 00 01 10 11 t1 t2 t3 n2 2 bits Decimal distance = 3-1 = 2 Decimal distance = 2-0 = 2 [1] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 28/41
  • 29. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 29/41
  • 30. Variable Bit Quantisation for LSH Threshold Optimisation Multiple bits per hyperplane requires multiple thresholds. F1-based optimisation using pairwise constraints matrix S: TP: # Sij = 1 pairs in the same region. FP: # Sij = 0 pairs in the same region. FN: # Sij = 1 pairs in different regions. F1 score: 1.00 00 01 10 11 t1 t2 t3 n2 F1 score: 0.44 00 01 10 11 t1 t2 t3 n2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 30/41
  • 31. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 31/41
  • 32. Variable Bit Quantisation for LSH Variable Bit Allocation F1 score is a measure of the neighbourhood preserving quality of a hyperplane: F1 score: 1.00 F1 score: 0.25 00 01 10 11 t1 t2 t3 0 bits assigned: 0 thresholds do just as well as one or more thresholds n1 n2 2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure Compute bit allocation that maximises the cumulative F1 score across all hyperplanes subject to a bit budget B. Bit allocation solved as a binary integer linear program (BILP). Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 32/41
  • 33. Variable Bit Quantisation for LSH Variable Bit Allocation max F ◦ Z subject to Zh = 1 h ∈ {1 . . . B} Z ◦ D ≤ B Z is binary F contains the F1 scores per hyperplane, per bit count Z is an indicator matrix specifying the bit allocation D is a constraint matrix B is the bit budget . denotes the Frobenius L1 norm ◦ the Hadamard product Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 33/41
  • 34. Variable Bit Quantisation for LSH Variable Bit Allocation max F ◦ Z subject to Zh = 1 h ∈ {1 . . . B} Z ◦ D ≤ B Z is binary   F h1 h2 b0 0.25 0.25 b1 0.35 0.50 b2 0.40 1.00     D 0 0 1 1 2 2     Z 1 0 0 0 0 1   Sparse solution possible: lower quality hyperplanes can be discarded. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 34/41
  • 35. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 35/41
  • 36. Variable Bit Quantisation for LSH Evaluation Protocol Task: Text and image retrieval on three standard datasets: CIFAR-10, TDT-2 and Reuters-21578. Projections: LSH [1], Shift-invariant kernel hashing (SIKH) [2], Binary LSI (BLSI) [3], Spectral Hashing (SH) [4] and PCA-Hashing (PCAH) [5]. Baselines: Single Bit Quantisation (SBQ), Manhattan Hashing (MQ)[6], Double-Bit quantisation (DBQ) [7]. Hamming Ranking: how well do we retrieve the −NN of query points? [1] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98. [2] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09. [3] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08. [4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08. [5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12. [6] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12. [7] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 36/41
  • 37. Variable Bit Quantisation for LSH AUPRC across different projection methods Dataset CIFAR-10 (32 bits) Reuters-21578 (128 bits) SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389 LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538 BLSI 0.038 0.135 0.111 0.231 0.100 0.030 0.030 0.156 SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154 PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154 VBQ is an effective quantisation scheme for both image and text datasets. VBQ and a cheap projection (e.g. LSH) can outperform SBQ and an expensive projection (e.g. PCA). Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 37/41
  • 38. Variable Bit Quantisation for LSH AUPRC for LSH across a broad bit range 32 48 64 96 128 256 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SBQ MQ VBQ Number of Bits AUPRC 8 16 24 32 40 48 56 64 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 SBQ MQ VBQ Number of Bits AUPRC (a) Reuters-21578 (b) CIFAR-10 VBQ is effective across a wide range of bits. See paper for further results. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 38/41
  • 39. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 39/41
  • 40. Variable Bit Quantisation for LSH Summary Proposed a general data-driven technique to adaptively assign variable bits per LSH hyperplane Hyperplanes better preserving the neighbourhood structure are afforded more bits from the bit budget VBQ substantially increased retrieval performance across standard text and image datasets Future work: evaluate with an LSH system that uses hash tables for fast retrieval Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 40/41
  • 41. Variable Bit Quantisation for LSH Thank you for your attention Sean Moran sean.moran@ed.ac.uk www.seanjmoran.com Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 41/41