Faster and smaller inverted indices with Treaps Research Paper
1. Faster and Smaller
Inverted Indices with
Treaps
SD Nelson 148232M
DMI De Silva 148207R
2. Outline
• Introduction
• Basic Concepts
• Related Work
• Treap Usage
• Experiments & Results
• Conclusions
2
3. Introduction
• New Representation of inverted index, based on
Treap data structure
• Two main challenges in Modern Information retrieval
systems
• Manage huge amounts of data
• Return very precise results in response to user queries
• Two-stage ranking process
• fast and simple extract with hundreds/thousands from billions of
documents
• complex learned ranking to reduce candidate set
• Focus on improving the efficiency of the first stage
3
4. Introduction
• Two approaches for first stage
• Ranked intersection
Boolean intersection & computation of scores for documents
• Ranked union
Approximate form, avoiding a costly Boolean union
• New compressed representation for posting lists
• Performs ranked intersections & (exact) unions directly
• Based on the Treap data structure
• Allows to differentially encode both document identifiers and
weights
4
5. Basic Concepts
• Inverted index for efficient processing of ranked and
Boolean queries
• Index Store vocabulary of the collection
• Document identifier (docid)
• Weight of the term
• Idea of achieve compression to differentially encode
either the document identifiers/ weights
• New in-memory posting list implementation instead of
traditional disk storing.
5
6. Related Work
• Two query processing strategies
• Term-at-a-time (TAAT) - one posting list after the other, shortest
to longest
• Document-at-a-time (DAAT) - lists are processed in parallel
looking for the same document in all.
• Ranked intersection strategies employ full Boolean
intersection
• followed by a post processing step for ranking
• Strategies used for ranked union and intersection queries
in the paper can be classified as DAAT
6
7. Related Work
• Two approaches : Block-Max
• Special-purpose structure for ranked intersections and unions
• Sorts the list by Increasing docid, cuts lists into blocks, and
stores the maximumweight for each block
• Enables to skip whole blocks whose maximum possible
contribution is very low, by comparing its maximum weight with
a threshold
• Obtains considerable performance gains over the previous
techniques for exact ranked unions/ ranked intersections
• New technique can be seen as a generalization of the
block max concept
7
8. Related Work
• Two approaches : Dual-sorted inverted lists
• Sorted by decreasing frequency, using a wavelet tree data
structure
• TAAT processing for approximate ranked unions, DAAT-like
processing for (exact) ranked intersections.
• Ability sort by both docids and weights simultaneously
• Not aware the frequencies until reaching the individual
documents
• Treaps give an upper bound to the frequencies in the
current interval
• Treap uses less space - Dual-Sorted can’t use differential
encoding on docids.
8
9. TREAPS - Basic Usage
• Treap representation of a posting list.
• Search key – document id
• Max heap property – term frequency (weight)
9
10. TREAPS - Compacted Tree
• More compact tree topology
representation via a general tree
• Introduce fake root node to
general tree
• Treap root is the first child of fake
root node
• Left child of a Treap node first
child in general tree
• Right child of a Treap node next
sibling
• Dashed lines shows original tree
• Represent topology using
balanced parenthesis
representation.
10
11. TREAPS - Differential Encoding
• Calculate docid, frequency differences for each node
• For VL ,
• docid -> id(U) – id(VL)
• freq -> f(U) – f(VL)
• For VR,
• docid -> id(VR) – id(U)
• freq -> f(U) – f(VR)
U
VL VR
• Store the differences instead of the actual values using
DAC (Direct Addressable Codes)
11
12. TREAPS - Improvements
• Use of a single DAC for both docids, frequencies
• Making the tree of balanced by choosing the maximum
frequency closest to the center of the interval
• Omit all nodes having frequency below some threshold
12
13. TREAPS – Query Processing
• Given query ‘Q’ composed of ‘q’ no of terms ‘t’ (t є Q)
• Traverse ‘q’ treaps accumulating weights for each term ‘t’ for
each document
• Insert each document into a priority queue of size ‘k’
• If queue size ‘k+1’ remove the minimum
• Queue size ‘k’ - use minimum score as a lower bound,
discard documents to be checked during ‘intersection’.
• Since treaps maintain max frequency can discard all
nodes below a particular node.
13
14. Experiments & Results
• Experimental setup
• TREC GOV2 collection – 25.2 million documents, 32.8 million
terms, 4.9 billion postings
• Intel Xeon 2.4GHz / 96GB RAM / 12MB cache
• Compared against other implementations
• Block-Max
• Dual-Sorted
• Traditional docid-sorted inverted index
• Traditional frequency-sorted inverted index
14
15. Experiments & Results
• Using differential encoding alone
is not sufficient – ‘Treap w/o f0’
still has high space usage
• Omitting low frequency items
from treaps offers lowest space
usage (Treap)
• 22% than Block-Max
• 18% then Dual-Sorted
15
16. Experiments & Results
• Treaps effective for small ‘k’ (k < 30),
3x faster for ranked intersection.
• Treaps affected by ‘k’ unlike Block-
Max, Dual-Sorted.
• Explained by no of documents
accessed. Only 2.6% accessed when
k=10 compared to intersection.
16
17. Experiments & Results
• For ranked union queries, the time
taken increases with k & q. Treaps
outperform Block-Max up to k=130
17
18. Conclusions
• New inverted index representation based on the Treaps -
An elegant and flexible tool
• Simultaneous representation of docid / weight ordering
of posting list
• Both docids & frequencies in differential form
• Significant gains in space and time
• 20 time less space/ 3X faster
18