London Information Retrieval Meetup
19 Feb 2019
Improving Top-K Retrieval
Algorithms Using Dynamic
Programming and Longer
Skipping
Elia Porciani, Software Engineer
19th February 2019
Introduction
●Top-k retrieval and inverted index
●Introduction to early termination techniques
●Block max wand
Faster BlockMax WAND with Variable-sized Blocks
A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini
SIGIR, 2017
Faster BlockMax WAND with Longer Skipping
A Mallia, E Porciani
ECIR, 2019
Queries over search engines
Inverted index
Documents
term1 term2
term3 term4
term5
Inverted index compression
We compressed posting lists with
partitioned Elias-Fano.
Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano
indexes. In Proceedings of the 37th International ACM SIGIR Conference
on Research; Development in Information Retrieval, SIGIR ’14
1 2 5 7 12 13 14 20Inverted List
1 1 3 2 5 1 1 6Dgaps
Only few bits are necessary to store
each item of an inverted list
Top-K Retrieval
We are interested only in the first
K documents, with k small.
Tf-Idf
In details, we use OKAPI BM25
Term frequency
Inverse document frequency
1 2 5 7 12 13 14 20Doc-id
3 2 1 8 2 4 6 2Frequencies
Term
tfij =
|nij |
|dj |
idfi =
|D|
|d : i ∈ d |
Inverted list iterator operations
next() Find the next document Id
nextGEQ(k) Find the next document id in the list with id >= k
score() Compute the score of the current document id,
considering the frequency associated
The score() function involves in
floating point computations
Iterating over inverted index is
expansive because it is
compressed.
Ranked-Or
Doc-Id
T1
T2
T3
T4
T5
Early termination techniques
●It is not necessary to compute the score function on all the
postings.
●Max score
●Wand
●BlockMaxWand These algorithms compute the exact top-k
documents.
Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Pivot
List
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval
process., CIKM ’03,
Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
The less is the average
approximation error, the better are
performance.
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0
Block Max Wand
1.Pivot selection as in wand.
2.Compute block max contributions (blockmaxsum) of the pivot doc-id
3.If block max sum overcomes the threshold:
1.Full evaluate the document of the pivot.
2.Move iterator to pivot.docid + 1
4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.
Partitioning
Fixed size blocks Variable size blocks
∑
𝑏∈𝐵
(max(𝑠  ∈ 𝑏) 𝑏   −  
∑
𝑠 ∈ 𝑏 
𝑠 )
𝑏
min
∑
𝑏∈𝐵
(max(𝑠  ∈ 𝑏) 𝑏 ) + 𝑆
Shortest Path Problem
• V postings sorted by their position in the list
• E every possible block in the list
• C(i,j) is the approximation error
We add a fix cost F to the cost
function C(i,j)
O(n2
)
Approximation algorithm
● Monotonicity: Quasi-subaddictivity:
𝑂( 𝑛2
) → 𝑂(𝑛 log 𝑈 ) 𝑂( 𝑛log𝑈) → 𝑂(𝑛)
C(i, j) ≤ C(i, j + 1)
C(i, j) ≤ C(i − 1,j)
G1
= {(i, j) ∈ G|∃k . C(i, j) ≤ F(1 + α)k
≤ C(i, j + 1)}
C(i, k) + C(k + 1,j) ≤ C(i, j) + F + 1
G2
= {(i, j) ∈ G1
|C(i, j) ≤ F/β}
sp(G2
) = (1 + α)(1 + β)sp(G)
Experimental analysis
Collection Size
Size after
compression
# lists # postings # documents
Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions
Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions
● Trec2005 and Trec2006 query collections.
● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings
and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.
Choosing block size
Block size Block size
Block-Max-Wand Compression
Maximum impact
element
Boundary doc-id
Block-Max-Wand Compression (score quantization)
Uniform partitioning
Opt partitioning
Sort
Compression algorithms comparisons
Gov2 Clueweb09
Trec2005 Trec2006 Trec2005 Trec2006
Wand 7.06 (1.92x) 12.92 (1.55x) 28.85 (2.25x) 37.55 (1.40x)
MaxScore 6.59 (1.79x) 11.35 (1.36x) 23.58 (1.84x) 32.28 (1.21x)
BMW 3.67 8.33 12.81 26.64
Gov2 Clueweb09
Plain index 6.91 8.36
Wand/MaxScore 7.24 (1.04x) 8.65 (1.03x)
BMW/VBMW 9.14 (1.32x) 10.68 (1.27x)
VBMW c. 8.07 (1.16x) 9.51 (1.13x)
Gov2 Clueweb09
Trec2005 Trec2006 Trec2005 Trec2006
Wand 7.06 (3.34x) 12.92 (2.72x) 28.85 (3.98x) 37.55 (2.55x)
MaxScore 6.59 (3.12x) 11.35 (2.39x) 23.58 (3.25x) 32.28 (2.11x)
BMW 3.67 (1.73x) 8.33 (1.75x) 12.81 (1.77x) 26.64 (1.74x)
VBMW 2.11 4.75 7.25 15.30
VBMW c. 2.35 (1.11x) 5.29 (1.11x ) 8.21 (1.13x ) 17.00 (1.11x )
Time in ms
Space in bits
per posting
Longer skipping
We can do better than skip
at the block boundary.
Ls-boundaryBoundary
Iterate over the
blocks at runtime
Add a pointer per
block
Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0
Longer skipping
2 3 4 5 6+
VBMW 3.17 (1.45x) 6.39 (1.13x) 8.92 (1.04x) 14.46 (1.00x) 32.04 (1.03x)
VBMW LS 2.18 5.66 8.57 14.44 31.05 (1.04x)
VBMW c. 3.53 (1.31x) 6.97 (1.15x) 9.86 (1.04x) 16.06 (1.00x) 36.26 (1.01x)
VBMW LSP c. 2.68 6.3 9.52 16.07 36.01
ClueWeb - Trec2005
Thank you

Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

  • 1.
    London Information RetrievalMeetup 19 Feb 2019 Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping Elia Porciani, Software Engineer 19th February 2019
  • 2.
    Introduction ●Top-k retrieval andinverted index ●Introduction to early termination techniques ●Block max wand Faster BlockMax WAND with Variable-sized Blocks A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini SIGIR, 2017 Faster BlockMax WAND with Longer Skipping A Mallia, E Porciani ECIR, 2019
  • 3.
  • 4.
  • 5.
    Inverted index compression Wecompressed posting lists with partitioned Elias-Fano. Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research; Development in Information Retrieval, SIGIR ’14 1 2 5 7 12 13 14 20Inverted List 1 1 3 2 5 1 1 6Dgaps Only few bits are necessary to store each item of an inverted list
  • 6.
    Top-K Retrieval We areinterested only in the first K documents, with k small.
  • 7.
    Tf-Idf In details, weuse OKAPI BM25 Term frequency Inverse document frequency 1 2 5 7 12 13 14 20Doc-id 3 2 1 8 2 4 6 2Frequencies Term tfij = |nij | |dj | idfi = |D| |d : i ∈ d |
  • 8.
    Inverted list iteratoroperations next() Find the next document Id nextGEQ(k) Find the next document id in the list with id >= k score() Compute the score of the current document id, considering the frequency associated The score() function involves in floating point computations Iterating over inverted index is expansive because it is compressed.
  • 9.
  • 10.
    Early termination techniques ●Itis not necessary to compute the score function on all the postings. ●Max score ●Wand ●BlockMaxWand These algorithms compute the exact top-k documents.
  • 11.
    Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 Ms= 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Pivot List Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval process., CIKM ’03,
  • 12.
    Block-Max-Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 3.24.2 4.5 5.4 2.1 2.1 3.2 2.1 2.3 0.8 3.5 1.4 4.1 2.7 Pivot List Block max upper estimation = 10.2 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 The less is the average approximation error, the better are performance. Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11 2.0
  • 13.
    Block Max Wand 1.Pivotselection as in wand. 2.Compute block max contributions (blockmaxsum) of the pivot doc-id 3.If block max sum overcomes the threshold: 1.Full evaluate the document of the pivot. 2.Move iterator to pivot.docid + 1 4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.
  • 14.
    Partitioning Fixed size blocksVariable size blocks ∑ 𝑏∈𝐵 (max(𝑠  ∈ 𝑏) 𝑏   −   ∑ 𝑠 ∈ 𝑏  𝑠 ) 𝑏 min ∑ 𝑏∈𝐵 (max(𝑠  ∈ 𝑏) 𝑏 ) + 𝑆
  • 15.
    Shortest Path Problem •V postings sorted by their position in the list • E every possible block in the list • C(i,j) is the approximation error We add a fix cost F to the cost function C(i,j) O(n2 )
  • 16.
    Approximation algorithm ● Monotonicity:Quasi-subaddictivity: 𝑂( 𝑛2 ) → 𝑂(𝑛 log 𝑈 ) 𝑂( 𝑛log𝑈) → 𝑂(𝑛) C(i, j) ≤ C(i, j + 1) C(i, j) ≤ C(i − 1,j) G1 = {(i, j) ∈ G|∃k . C(i, j) ≤ F(1 + α)k ≤ C(i, j + 1)} C(i, k) + C(k + 1,j) ≤ C(i, j) + F + 1 G2 = {(i, j) ∈ G1 |C(i, j) ≤ F/β} sp(G2 ) = (1 + α)(1 + β)sp(G)
  • 17.
    Experimental analysis Collection Size Sizeafter compression # lists # postings # documents Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions ● Trec2005 and Trec2006 query collections. ● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.
  • 18.
  • 19.
  • 20.
    Block-Max-Wand Compression (scorequantization) Uniform partitioning Opt partitioning Sort
  • 21.
  • 22.
    Gov2 Clueweb09 Trec2005 Trec2006Trec2005 Trec2006 Wand 7.06 (1.92x) 12.92 (1.55x) 28.85 (2.25x) 37.55 (1.40x) MaxScore 6.59 (1.79x) 11.35 (1.36x) 23.58 (1.84x) 32.28 (1.21x) BMW 3.67 8.33 12.81 26.64 Gov2 Clueweb09 Plain index 6.91 8.36 Wand/MaxScore 7.24 (1.04x) 8.65 (1.03x) BMW/VBMW 9.14 (1.32x) 10.68 (1.27x) VBMW c. 8.07 (1.16x) 9.51 (1.13x) Gov2 Clueweb09 Trec2005 Trec2006 Trec2005 Trec2006 Wand 7.06 (3.34x) 12.92 (2.72x) 28.85 (3.98x) 37.55 (2.55x) MaxScore 6.59 (3.12x) 11.35 (2.39x) 23.58 (3.25x) 32.28 (2.11x) BMW 3.67 (1.73x) 8.33 (1.75x) 12.81 (1.77x) 26.64 (1.74x) VBMW 2.11 4.75 7.25 15.30 VBMW c. 2.35 (1.11x) 5.29 (1.11x ) 8.21 (1.13x ) 17.00 (1.11x ) Time in ms Space in bits per posting
  • 23.
    Longer skipping We cando better than skip at the block boundary. Ls-boundaryBoundary Iterate over the blocks at runtime Add a pointer per block
  • 24.
    Block-Max-Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 3.24.2 4.5 5.4 2.1 2.1 3.2 2.1 2.3 0.8 3.5 1.4 4.1 2.7 Pivot List Block max upper estimation = 10.2 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11 2.0
  • 25.
    Longer skipping 2 34 5 6+ VBMW 3.17 (1.45x) 6.39 (1.13x) 8.92 (1.04x) 14.46 (1.00x) 32.04 (1.03x) VBMW LS 2.18 5.66 8.57 14.44 31.05 (1.04x) VBMW c. 3.53 (1.31x) 6.97 (1.15x) 9.86 (1.04x) 16.06 (1.00x) 36.26 (1.01x) VBMW LSP c. 2.68 6.3 9.52 16.07 36.01 ClueWeb - Trec2005
  • 26.