Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

167 views

Published on

Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

  1. 1. London Information Retrieval Meetup 19 Feb 2019 Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping Elia Porciani, Software Engineer 19th February 2019
  2. 2. Introduction ●Top-k retrieval and inverted index ●Introduction to early termination techniques ●Block max wand Faster BlockMax WAND with Variable-sized Blocks A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini SIGIR, 2017 Faster BlockMax WAND with Longer Skipping A Mallia, E Porciani ECIR, 2019
  3. 3. Queries over search engines
  4. 4. Inverted index Documents term1 term2 term3 term4 term5
  5. 5. Inverted index compression We compressed posting lists with partitioned Elias-Fano. Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research; Development in Information Retrieval, SIGIR ’14 1 2 5 7 12 13 14 20Inverted List 1 1 3 2 5 1 1 6Dgaps Only few bits are necessary to store each item of an inverted list
  6. 6. Top-K Retrieval We are interested only in the first K documents, with k small.
  7. 7. Tf-Idf In details, we use OKAPI BM25 Term frequency Inverse document frequency 1 2 5 7 12 13 14 20Doc-id 3 2 1 8 2 4 6 2Frequencies Term tfij = |nij | |dj | idfi = |D| |d : i ∈ d |
  8. 8. Inverted list iterator operations next() Find the next document Id nextGEQ(k) Find the next document id in the list with id >= k score() Compute the score of the current document id, considering the frequency associated The score() function involves in floating point computations Iterating over inverted index is expansive because it is compressed.
  9. 9. Ranked-Or Doc-Id T1 T2 T3 T4 T5
  10. 10. Early termination techniques ●It is not necessary to compute the score function on all the postings. ●Max score ●Wand ●BlockMaxWand These algorithms compute the exact top-k documents.
  11. 11. Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Pivot List Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval process., CIKM ’03,
  12. 12. Block-Max-Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 3.2 4.2 4.5 5.4 2.1 2.1 3.2 2.1 2.3 0.8 3.5 1.4 4.1 2.7 Pivot List Block max upper estimation = 10.2 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 The less is the average approximation error, the better are performance. Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11 2.0
  13. 13. Block Max Wand 1.Pivot selection as in wand. 2.Compute block max contributions (blockmaxsum) of the pivot doc-id 3.If block max sum overcomes the threshold: 1.Full evaluate the document of the pivot. 2.Move iterator to pivot.docid + 1 4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.
  14. 14. Partitioning Fixed size blocks Variable size blocks ∑ 𝑏∈𝐵 (max(𝑠  ∈ 𝑏) 𝑏   −   ∑ 𝑠 ∈ 𝑏  𝑠 ) 𝑏 min ∑ 𝑏∈𝐵 (max(𝑠  ∈ 𝑏) 𝑏 ) + 𝑆
  15. 15. Shortest Path Problem • V postings sorted by their position in the list • E every possible block in the list • C(i,j) is the approximation error We add a fix cost F to the cost function C(i,j) O(n2 )
  16. 16. Approximation algorithm ● Monotonicity: Quasi-subaddictivity: 𝑂( 𝑛2 ) → 𝑂(𝑛 log 𝑈 ) 𝑂( 𝑛log𝑈) → 𝑂(𝑛) C(i, j) ≤ C(i, j + 1) C(i, j) ≤ C(i − 1,j) G1 = {(i, j) ∈ G|∃k . C(i, j) ≤ F(1 + α)k ≤ C(i, j + 1)} C(i, k) + C(k + 1,j) ≤ C(i, j) + F + 1 G2 = {(i, j) ∈ G1 |C(i, j) ≤ F/β} sp(G2 ) = (1 + α)(1 + β)sp(G)
  17. 17. Experimental analysis Collection Size Size after compression # lists # postings # documents Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions ● Trec2005 and Trec2006 query collections. ● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.
  18. 18. Choosing block size Block size Block size
  19. 19. Block-Max-Wand Compression Maximum impact element Boundary doc-id
  20. 20. Block-Max-Wand Compression (score quantization) Uniform partitioning Opt partitioning Sort
  21. 21. Compression algorithms comparisons
  22. 22. Gov2 Clueweb09 Trec2005 Trec2006 Trec2005 Trec2006 Wand 7.06 (1.92x) 12.92 (1.55x) 28.85 (2.25x) 37.55 (1.40x) MaxScore 6.59 (1.79x) 11.35 (1.36x) 23.58 (1.84x) 32.28 (1.21x) BMW 3.67 8.33 12.81 26.64 Gov2 Clueweb09 Plain index 6.91 8.36 Wand/MaxScore 7.24 (1.04x) 8.65 (1.03x) BMW/VBMW 9.14 (1.32x) 10.68 (1.27x) VBMW c. 8.07 (1.16x) 9.51 (1.13x) Gov2 Clueweb09 Trec2005 Trec2006 Trec2005 Trec2006 Wand 7.06 (3.34x) 12.92 (2.72x) 28.85 (3.98x) 37.55 (2.55x) MaxScore 6.59 (3.12x) 11.35 (2.39x) 23.58 (3.25x) 32.28 (2.11x) BMW 3.67 (1.73x) 8.33 (1.75x) 12.81 (1.77x) 26.64 (1.74x) VBMW 2.11 4.75 7.25 15.30 VBMW c. 2.35 (1.11x) 5.29 (1.11x ) 8.21 (1.13x ) 17.00 (1.11x ) Time in ms Space in bits per posting
  23. 23. Longer skipping We can do better than skip at the block boundary. Ls-boundaryBoundary Iterate over the blocks at runtime Add a pointer per block
  24. 24. Block-Max-Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 3.2 4.2 4.5 5.4 2.1 2.1 3.2 2.1 2.3 0.8 3.5 1.4 4.1 2.7 Pivot List Block max upper estimation = 10.2 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11 2.0
  25. 25. Longer skipping 2 3 4 5 6+ VBMW 3.17 (1.45x) 6.39 (1.13x) 8.92 (1.04x) 14.46 (1.00x) 32.04 (1.03x) VBMW LS 2.18 5.66 8.57 14.44 31.05 (1.04x) VBMW c. 3.53 (1.31x) 6.97 (1.15x) 9.86 (1.04x) 16.06 (1.00x) 36.26 (1.01x) VBMW LSP c. 2.68 6.3 9.52 16.07 36.01 ClueWeb - Trec2005
  26. 26. Thank you

×