ACL Variable Bit Quantisation Talk

523 views

Published on

ACL Variable Bit Quantisation Talk

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
523
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ACL Variable Bit Quantisation Talk

  1. 1. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Sean Moran Victor Lavrenko, Miles Osborne School of Informatics University of Edinburgh ACL Sofia, August 2013 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 1/41
  2. 2. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 2/41
  3. 3. Variable Bit Quantisation for LSH Fast search in large-scale datasets Problem: retrieve nearest neighbour(s) to a given query item q? y x 1-NN 1-NN q x y Na¨ıve approach: compare query to all N database items Scales linearly O(N) with the size of the database Impractical for all but the smallest of databases Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 3/41
  4. 4. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 4/41
  5. 5. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 5/41
  6. 6. Variable Bit Quantisation for LSH Fast search in large-scale datasets [1] M. Norouzi and D. Fleet. Minimal Loss Hashing. ICML ’11. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 6/41
  7. 7. Variable Bit Quantisation for LSH Hashing-based search using binary codes Database Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 7/41
  8. 8. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 8/41
  9. 9. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 9/41
  10. 10. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 10/41
  11. 11. Variable Bit Quantisation for LSH Hashing-based search using binary codes 110101 010111 H H Query Compute Similarity Nearest Neighbours Query Database Hash Table 010101 111101 ..... Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 11/41
  12. 12. Variable Bit Quantisation for LSH Why transform our data to binary codes? Constant O(1) query time with respect to the dataset size. Binary code comparison requires few machine instructions (XOR followed by popcount). Binary codes are extremely compact e.g. 16Mb to store 1 million, 128 bit encoded data points. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 12/41
  13. 13. Variable Bit Quantisation for LSH Applications Recommendation Near duplicate detection Content based IR Recommendation Near Duplicate Detection Noun Clustering Location Recognition Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 13/41
  14. 14. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 14/41
  15. 15. Variable Bit Quantisation for LSH Computing similarity preserving binary codes Locality Sensitive Hashing (LSH): Randomised algorithm for approximate nearest neighbour search Generates similarity preserving binary codes for a dataset Preserved similarity depends on the selected hash family Probabilistic guarantee on accuracy versus search time Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 15/41
  16. 16. Variable Bit Quantisation for LSH Locality Sensitive Hashing x y n2 n1 h1 h2 11 0100 10 h1(p) (p)h2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 16/41
  17. 17. Variable Bit Quantisation for LSH Low Dimensional Projection n2 n1 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 17/41
  18. 18. Variable Bit Quantisation for LSH Single Bit Quantisation (SBQ) 0 1 t n2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 18/41
  19. 19. Variable Bit Quantisation for LSH Plus many more... Kernel methods [1] Spectral methods [2] [3] Neural networks [4] Loss based methods [5] All use single bit quantisation (SBQ)... [1] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09. [2] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08. [3] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12. [4] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08. [5] B. Kulis and T. Darrell. Learning to Hash with Binary Reconstructive Embeddings. NIPS ’09. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 19/41
  20. 20. Variable Bit Quantisation for LSH Problem 1: SBQ leads to high quantisation errors Highest point density typically occurs around zero: −1.5 −1 −0.5 0 0.5 1 1.5 0 1000 2000 3000 4000 5000 6000 Projected Value Count Closer points can have a greater Hamming distance than distant points: 0 1 t n2 Same codeDifferent code Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 20/41
  21. 21. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 21/41
  22. 22. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 22/41
  23. 23. Variable Bit Quantisation for LSH Problem 2: some hyperplanes are better than others n1 n2Projected Dimension 2 Projected Dimension 1 x y n2 n1 h1 h2 11 0100 10 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 23/41
  24. 24. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 24/41
  25. 25. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 25/41
  26. 26. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 26/41
  27. 27. Variable Bit Quantisation for LSH Multiple Bit Encoding: Natural Binary Code 0 1 t n2 00 01 10 11 t1 t2 t3 n2 1 bit 2 bits 000 001 010 t2 t4 t6 n2 3 bits etc... t1 t3 t5 t7 011 111101 110100 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 27/41
  28. 28. Variable Bit Quantisation for LSH Multiple Bit Encoding [1] 00 01 10 11 t1 t2 t3 n2 2 bits Decimal distance = 3-1 = 2 Decimal distance = 2-0 = 2 [1] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 28/41
  29. 29. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 29/41
  30. 30. Variable Bit Quantisation for LSH Threshold Optimisation Multiple bits per hyperplane requires multiple thresholds. F1-based optimisation using pairwise constraints matrix S: TP: # Sij = 1 pairs in the same region. FP: # Sij = 0 pairs in the same region. FN: # Sij = 1 pairs in different regions. F1 score: 1.00 00 01 10 11 t1 t2 t3 n2 F1 score: 0.44 00 01 10 11 t1 t2 t3 n2 Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 30/41
  31. 31. Variable Bit Quantisation for LSH Our Solution: Variable Bit Quantisation Threshold Optimisation Variable Bit Allocation Variable Bit Quantisation Continuous data Binary string Multiple Bit Encoding Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 31/41
  32. 32. Variable Bit Quantisation for LSH Variable Bit Allocation F1 score is a measure of the neighbourhood preserving quality of a hyperplane: F1 score: 1.00 F1 score: 0.25 00 01 10 11 t1 t2 t3 0 bits assigned: 0 thresholds do just as well as one or more thresholds n1 n2 2 bits assigned: 3 thresholds perfectly preserve the neighbourhood structure Compute bit allocation that maximises the cumulative F1 score across all hyperplanes subject to a bit budget B. Bit allocation solved as a binary integer linear program (BILP). Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 32/41
  33. 33. Variable Bit Quantisation for LSH Variable Bit Allocation max F ◦ Z subject to Zh = 1 h ∈ {1 . . . B} Z ◦ D ≤ B Z is binary F contains the F1 scores per hyperplane, per bit count Z is an indicator matrix specifying the bit allocation D is a constraint matrix B is the bit budget . denotes the Frobenius L1 norm ◦ the Hadamard product Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 33/41
  34. 34. Variable Bit Quantisation for LSH Variable Bit Allocation max F ◦ Z subject to Zh = 1 h ∈ {1 . . . B} Z ◦ D ≤ B Z is binary   F h1 h2 b0 0.25 0.25 b1 0.35 0.50 b2 0.40 1.00     D 0 0 1 1 2 2     Z 1 0 0 0 0 1   Sparse solution possible: lower quality hyperplanes can be discarded. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 34/41
  35. 35. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 35/41
  36. 36. Variable Bit Quantisation for LSH Evaluation Protocol Task: Text and image retrieval on three standard datasets: CIFAR-10, TDT-2 and Reuters-21578. Projections: LSH [1], Shift-invariant kernel hashing (SIKH) [2], Binary LSI (BLSI) [3], Spectral Hashing (SH) [4] and PCA-Hashing (PCAH) [5]. Baselines: Single Bit Quantisation (SBQ), Manhattan Hashing (MQ)[6], Double-Bit quantisation (DBQ) [7]. Hamming Ranking: how well do we retrieve the −NN of query points? [1] P. Indyk and R. Motwani. Approximate nearest neighbors: removing the curse of dimensionality. In STOC ’98. [2] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. In NIPS ’09. [3] R. Salakhutdinov and G. Hinton. Semantic Hashing. NIPS ’08. [4] Y. Weiss and A. Torralba and R. Fergus. Spectral Hashing. NIPS ’08. [5] J. Wang and S. Kumar and SF. Chang. Semi-supervised hashing for large-scale search. PAMI ’12. [6] W. Kong and W. Li and M. Guo. Manhattan hashing for large-scale image retrieval. SIGIR ’12. [7] W. Kong and W. Li. Double Bit Quantisation for Hashing. AAAI ’12. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 36/41
  37. 37. Variable Bit Quantisation for LSH AUPRC across different projection methods Dataset CIFAR-10 (32 bits) Reuters-21578 (128 bits) SBQ MQ DBQ VBQ SBQ MQ DBQ VBQ SIKH 0.042 0.046 0.047 0.161 0.102 0.112 0.087 0.389 LSH 0.119 0.091 0.066 0.207 0.276 0.201 0.175 0.538 BLSI 0.038 0.135 0.111 0.231 0.100 0.030 0.030 0.156 SH 0.051 0.144 0.111 0.202 0.033 0.028 0.030 0.154 PCAH 0.036 0.132 0.107 0.219 0.095 0.034 0.027 0.154 VBQ is an effective quantisation scheme for both image and text datasets. VBQ and a cheap projection (e.g. LSH) can outperform SBQ and an expensive projection (e.g. PCA). Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 37/41
  38. 38. Variable Bit Quantisation for LSH AUPRC for LSH across a broad bit range 32 48 64 96 128 256 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 SBQ MQ VBQ Number of Bits AUPRC 8 16 24 32 40 48 56 64 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 SBQ MQ VBQ Number of Bits AUPRC (a) Reuters-21578 (b) CIFAR-10 VBQ is effective across a wide range of bits. See paper for further results. Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 38/41
  39. 39. Variable Bit Quantisation for LSH Variable Bit Quantisation for LSH Fast search in large-scale datasets Locality Sensitive Hashing (LSH) Variable Bit Quantisation Evaluation Summary Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 39/41
  40. 40. Variable Bit Quantisation for LSH Summary Proposed a general data-driven technique to adaptively assign variable bits per LSH hyperplane Hyperplanes better preserving the neighbourhood structure are afforded more bits from the bit budget VBQ substantially increased retrieval performance across standard text and image datasets Future work: evaluate with an LSH system that uses hash tables for fast retrieval Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 40/41
  41. 41. Variable Bit Quantisation for LSH Thank you for your attention Sean Moran sean.moran@ed.ac.uk www.seanjmoran.com Sean Moran Victor Lavrenko, Miles Osborne ACL Sofia, August 2013 41/41

×