Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

London Information Retrieval Meetup
19 Feb 2019
Improving Top-K Retrieval
Algorithms Using Dynamic
Programming and Longer
Skipping
Elia Porciani, Software Engineer
19th February 2019

Introduction
●Top-k retrieval and inverted index
●Introduction to early termination techniques
●Block max wand
Faster BlockMax WAND with Variable-sized Blocks
A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini
SIGIR, 2017
Faster BlockMax WAND with Longer Skipping
A Mallia, E Porciani
ECIR, 2019

Inverted index
Documents
term1 term2
term3 term4
term5

Inverted index compression
We compressed posting lists with
partitioned Elias-Fano.
Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano
indexes. In Proceedings of the 37th International ACM SIGIR Conference
on Research; Development in Information Retrieval, SIGIR ’14
1 2 5 7 12 13 14 20Inverted List
1 1 3 2 5 1 1 6Dgaps
Only few bits are necessary to store
each item of an inverted list

Top-K Retrieval
We are interested only in the first
K documents, with k small.

Tf-Idf
In details, we use OKAPI BM25
Term frequency
Inverse document frequency
1 2 5 7 12 13 14 20Doc-id
3 2 1 8 2 4 6 2Frequencies
Term
tfij =
|nij |
|dj |
idfi =
|D|
|d : i ∈ d |

Inverted list iterator operations
next() Find the next document Id
nextGEQ(k) Find the next document id in the list with id >= k
score() Compute the score of the current document id,
considering the frequency associated
The score() function involves in
floating point computations
Iterating over inverted index is
expansive because it is
compressed.

Ranked-Or
Doc-Id
T1
T2
T3
T4
T5

Early termination techniques
●It is not necessary to compute the score function on all the
postings.
●Max score
●Wand
●BlockMaxWand These algorithms compute the exact top-k
documents.

Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Pivot
List
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval
process., CIKM ’03,

Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
The less is the average
approximation error, the better are
performance.
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0

Block Max Wand
1.Pivot selection as in wand.
2.Compute block max contributions (blockmaxsum) of the pivot doc-id
3.If block max sum overcomes the threshold:
1.Full evaluate the document of the pivot.
2.Move iterator to pivot.docid + 1
4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.

Partitioning
Fixed size blocks Variable size blocks
∑
𝑏∈𝐵
(max(𝑠 ∈ 𝑏) 𝑏 −
∑
𝑠 ∈ 𝑏
𝑠 )
𝑏
min
∑
𝑏∈𝐵
(max(𝑠 ∈ 𝑏) 𝑏 ) + 𝑆

Shortest Path Problem
• V postings sorted by their position in the list
• E every possible block in the list
• C(i,j) is the approximation error
We add a fix cost F to the cost
function C(i,j)
O(n2
)

Approximation algorithm
● Monotonicity: Quasi-subaddictivity:
𝑂( 𝑛2
) → 𝑂(𝑛 log 𝑈 ) 𝑂( 𝑛log𝑈) → 𝑂(𝑛)
C(i, j) ≤ C(i, j + 1)
C(i, j) ≤ C(i − 1,j)
G1
= {(i, j) ∈ G|∃k . C(i, j) ≤ F(1 + α)k
≤ C(i, j + 1)}
C(i, k) + C(k + 1,j) ≤ C(i, j) + F + 1
G2
= {(i, j) ∈ G1
|C(i, j) ≤ F/β}
sp(G2
) = (1 + α)(1 + β)sp(G)

Experimental analysis
Collection Size
Size after
compression
# lists # postings # documents
Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions
Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions
● Trec2005 and Trec2006 query collections.
● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings
and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.

Choosing block size
Block size Block size

Block-Max-Wand Compression
Maximum impact
element
Boundary doc-id

Block-Max-Wand Compression (score quantization)
Uniform partitioning
Opt partitioning
Sort

Compression algorithms comparisons

Gov2 Clueweb09
Trec2005 Trec2006 Trec2005 Trec2006
Wand 7.06 (1.92x) 12.92 (1.55x) 28.85 (2.25x) 37.55 (1.40x)
MaxScore 6.59 (1.79x) 11.35 (1.36x) 23.58 (1.84x) 32.28 (1.21x)
BMW 3.67 8.33 12.81 26.64
Gov2 Clueweb09
Plain index 6.91 8.36
Wand/MaxScore 7.24 (1.04x) 8.65 (1.03x)
BMW/VBMW 9.14 (1.32x) 10.68 (1.27x)
VBMW c. 8.07 (1.16x) 9.51 (1.13x)
Gov2 Clueweb09
Trec2005 Trec2006 Trec2005 Trec2006
Wand 7.06 (3.34x) 12.92 (2.72x) 28.85 (3.98x) 37.55 (2.55x)
MaxScore 6.59 (3.12x) 11.35 (2.39x) 23.58 (3.25x) 32.28 (2.11x)
BMW 3.67 (1.73x) 8.33 (1.75x) 12.81 (1.77x) 26.64 (1.74x)
VBMW 2.11 4.75 7.25 15.30
VBMW c. 2.35 (1.11x) 5.29 (1.11x ) 8.21 (1.13x ) 17.00 (1.11x )
Time in ms
Space in bits
per posting

Longer skipping
We can do better than skip
at the block boundary.
Ls-boundaryBoundary
Iterate over the
blocks at runtime
Add a pointer per
block

Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0

Longer skipping
2 3 4 5 6+
VBMW 3.17 (1.45x) 6.39 (1.13x) 8.92 (1.04x) 14.46 (1.00x) 32.04 (1.03x)
VBMW LS 2.18 5.66 8.57 14.44 31.05 (1.04x)
VBMW c. 3.53 (1.31x) 6.97 (1.15x) 9.86 (1.04x) 16.06 (1.00x) 36.26 (1.01x)
VBMW LSP c. 2.68 6.3 9.52 16.07 36.01
ClueWeb - Trec2005

Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

Similar to Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping (20)

More from Sease

More from Sease (20)

Recently uploaded

Recently uploaded (20)

Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping