A paper about a nice index compression method from latest WSDM'13 proceeding.
The paper uses Elias-Fano representation & a ranked characteristic function to compress inverted index, and both compression rate & speed are very good.
1. Quasi Succinct IndicesQuasi Succinct Indices ((WSDM’13)WSDM’13)
Author:Author: Sebastiano VignaSebastiano Vigna
Slides By:Slides By: Han JiangHan Jiang
2. AgendaAgenda
Related workRelated work
Representation of monotone sequencesRepresentation of monotone sequences
Practical examplePractical example
Theoretical estimationTheoretical estimation
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
3. Related workRelated work
Why index compression:Why index compression:
Saves disk spaceSaves disk space
Reduce overhead between disk & memoryReduce overhead between disk & memory
[Index compression is good, especially for random access, CIKM’07]
Two tricks at the basis of index compression:Two tricks at the basis of index compression:
Instantaneous codes (or prefix codes)Instantaneous codes (or prefix codes)
e.g. Variable byte
Gap encodingGap encoding
e.g. [1, 3, 9]e.g. [1, 3, 9] [1, 2, 6][1, 2, 6]
4. Related work +Related work +
Popular approaches:Popular approaches:
Variable BytesVariable Bytes
(VB, previously used in Lucene)
Gamma/Delta encodingGamma/Delta encoding
(at most 2*Theoretical lower bound)
Golomb codeGolomb code
(near theoretical lower bound)
PForDeltaPForDelta
(block encoding, efficient and cache friendly)
Unary: 8Unary: 8 000,000,001000,000,001
(stupidest, but efficient when combined with others, we’ll see this again)
……
5. AgendaAgenda
Related work √Related work √
Representation of monotone sequencesRepresentation of monotone sequences
Practical examplePractical example
Theoretical estimationTheoretical estimation
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
7. Assume uu is the upper bound of this list (e.g. u=36)
Then lower width l is: (e.g. l=log(36/5)=2)
5 88 15 32
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High: Low:
Representation of monotone sequences +Representation of monotone sequences +
How to decide when splitting high/low bits?
Why don’t we operate d-gap before encoding?
We’ll leave it as implementation details
8. X0=5
1 01 0010 0010 1111 1000 00
List = { }
Theoretical estimationTheoretical estimation
101 01 01 000001 00110001 00High:High: Low:
For each value, we need:
n*L bits for lower part;
n bits for stop ‘1’ in unary code
But non-stop ‘0’s ?
X1=8 X2=8 X3=15 X4=32
Note that we only unary encode higher bits,
For each ‘0’, the value increases 2^l
This increment will only happen q times:
So the upper bound for this part is:
Then in total:
9. Theoretical estimation +Theoretical estimation +
So what?So what?
Let’s see the lower bound with ‘best’ format :Let’s see the lower bound with ‘best’ format :
Upper bound for Quasi-succinct encoding:Upper bound for Quasi-succinct encoding:
And it is proved that QS can achieve a ‘quasi’ optimalAnd it is proved that QS can achieve a ‘quasi’ optimal
resultresult : “: “ less than half a bit per element away”.less than half a bit per element away”.
That’s why it’s called ‘quasi’ succinct…That’s why it’s called ‘quasi’ succinct…
The information-theoretical lower bound for a non-strict monotoneThe information-theoretical lower bound for a non-strict monotone
list of n elements, within interval [0,u]: (thelist of n elements, within interval [0,u]: (the ≈ cancan
also be replaced byalso be replaced by >))
10. Short conclusionShort conclusion
No distribution of document gapsNo distribution of document gaps
Document reordering won’t affect index size muchDocument reordering won’t affect index size much
GeneralGeneral
Works for sequences both monotonic or notWorks for sequences both monotonic or not
Unary code is enoughUnary code is enough
And we’ll see it works well for skipping
SimpleSimple
A few unary reads and bit shifts
11. AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
12. Index structure (no skipping)Index structure (no skipping)
Given bound ‘b’, advance to xGiven bound ‘b’, advance to xii so that xso that xii >= b>= b
X0=5
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High:High: Low:
X1=8 X2=8 X3=15 X4=32
It is easy to see that, xIt is easy to see that, xii must be after zeros.must be after zeros.
So, walking on the high bits list, when we reach bit position p, andSo, walking on the high bits list, when we reach bit position p, and
have already past zeros, we must be in the middle ofhave already past zeros, we must be in the middle of
This is why we don’t need d-gap on original List: the unary highThis is why we don’t need d-gap on original List: the unary high
bits should act as a ‘skip table’, with skip interval=2^lbits should act as a ‘skip table’, with skip interval=2^l
13. Index structure + (with skipping)Index structure + (with skipping)
X0=5
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High:High: Low:
X1=8 X2=8 X3=15 X4=32
The skipper can be surprisingly simple…The skipper can be surprisingly simple…
So, the skipper only need to store theSo, the skipper only need to store the locationlocation for everyfor every
q unary codes. (and the value j = p - i = p - q)q unary codes. (and the value j = p - i = p - q)
Note that, when scanning in the higher bits tableNote that, when scanning in the higher bits table
p = current bit locationp = current bit location
i = number of ‘1’s we read, telling us we’re reading Xi = number of ‘1’s we read, telling us we’re reading Xii
j = number of ‘0’s we read, telling us the value of higher bits isj = number of ‘0’s we read, telling us the value of higher bits is
i + j = pi + j = p
14. Index structure ++ (example)Index structure ++ (example)
X0=5
1 01 0010 0010 1111 1000 00
List = { }
1
00110001 00
High:High:
Low:
X1=8 X2=8 X3=15 X4=32
0 10 01 01 00 00 1
Skip interval=4, next pos=7
value before next skip = (pos – interval) * 2^l = 3 * 4 = 12
Advance Target = 22
so we can skip, and should walk three bits to get 24 > 22
complete current unary, then read lower bits, got result X4 = 32
15. Index structure +++ (conceptual layout)Index structure +++ (conceptual layout)
Size of each sectionSize of each section
Metadata sectionMetadata section records n: num of elements, u: value upper bound, etcrecords n: num of elements, u: value upper bound, etc
Skip tableSkip table p*w bits, (p: skip interval, w: data width)p*w bits, (p: skip interval, w: data width)
Lower bitsLower bits n*l bits, (l: estimated width)n*l bits, (l: estimated width)
Upper bitsUpper bits unknown without metadata, so put in last sectionunknown without metadata, so put in last section
For doc ids, the sequence is strictly monotonicFor doc ids, the sequence is strictly monotonic
For doc freqs, the sequence is ‘prefix sum of freq’, i.e.For doc freqs, the sequence is ‘prefix sum of freq’, i.e.
For positions, the format is a little different, and we’ll leave this for nowFor positions, the format is a little different, and we’ll leave this for now
16. Index structure ++++ (for dense sequence)Index structure ++++ (for dense sequence)
However it’s not efficient when the sequence is very dense…However it’s not efficient when the sequence is very dense…
Here we’ll encode the sequence as a bit sequence insteadHere we’ll encode the sequence as a bit sequence instead
where: Bit k is set when Xwhere: Bit k is set when Xii == k== k
10 11 10 10 0
X0=1List = { }X1=2 X2=3 X3=5 X4=7
This is only for ‘strictly monotone sequence’This is only for ‘strictly monotone sequence’
Skipper will be set for every q positions, and store num of ‘1’ s before that.Skipper will be set for every q positions, and store num of ‘1’ s before that.
We’ll cutover to this format when n > u/3We’ll cutover to this format when n > u/3
17. AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation detailsImplementation details
Index structure √Index structure √
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
18. Miscellaneous (design of position list)Miscellaneous (design of position list)
For a term t, all its position lists are stored as one sequence:For a term t, all its position lists are stored as one sequence:
The length of this sequence is total_term_freq, and the upper bound is:The length of this sequence is total_term_freq, and the upper bound is:
To revive positions from document i, we need:To revive positions from document i, we need:
Sum of frq from previous documentsSum of frq from previous documents
Sum of p from previous documentsSum of p from previous documents
(also from current document, if we need more frequent skip)(also from current document, if we need more frequent skip)
These will be store in skipper for position listThese will be store in skipper for position list
19. Miscellaneous + (reuse logic)Miscellaneous + (reuse logic)
101 01 01 000001High:High:
To read past 4 values, we need unary decodingTo read past 4 values, we need unary decoding
To read past 4 ‘zero’s, we simply need ‘negated unary decoding’To read past 4 ‘zero’s, we simply need ‘negated unary decoding’
Another aspect of higher bits:Another aspect of higher bits:
0 10High:High: 110 10 0 0 0 0 1
20. AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation details √Implementation details √
Index structure √Index structure √
Miscellaneous √Miscellaneous √
ExperimentsExperiments
DiscussionsDiscussions
21. ExperimentsExperiments
Five competitors:Five competitors:
Lucene 3.6 (VB)Lucene 3.6 (VB) [sigh, not the latest version]
MG4J (gamma/delta)MG4J (gamma/delta) [an old version written by the author]
Zettair (VB)Zettair (VB)
Kamikaze (PForDelta)Kamikaze (PForDelta)
Optimized PForDelta implementation in COptimized PForDelta implementation in C
Four datasets with different statistics:Four datasets with different statistics:
TREC GOV2 (25M documents)TREC GOV2 (25M documents)
.uk dataset (132M documents).uk dataset (132M documents)
Mimir index (1M documents)Mimir index (1M documents)
Tweet data (13M documents)Tweet data (13M documents)
Aside from whole HTML index, title field is also extracted as another test groupAside from whole HTML index, title field is also extracted as another test group
To make sure the tests is fair enough between competitors, input data is a pre-parsedTo make sure the tests is fair enough between competitors, input data is a pre-parsed
stream of UTF-8 text documents.stream of UTF-8 text documents.
23. Experiments ++ (speed)Experiments ++ (speed)
Design of queries:Design of queries:
150 Queries from Terabyte track (04~06), as150 Queries from Terabyte track (04~06), as
Conjunctive QueryConjunctive Query
Phrasal QueryPhrasal Query
Proximity Query (query words must appear within a window of 16)Proximity Query (query words must appear within a window of 16)
Term Scanning Query (pure test)Term Scanning Query (pure test)
Design of task:Design of task:
All engines will be set up to return exactly one resultAll engines will be set up to return exactly one result
The QS format is implemented with both Java and C++ for fair testThe QS format is implemented with both Java and C++ for fair test
Since both Lucene and MG4J interleaves doc id and freq, pure boolean query willSince both Lucene and MG4J interleaves doc id and freq, pure boolean query will
hurt when reading unused freq data, the QS* is a modified version to make test fairhurt when reading unused freq data, the QS* is a modified version to make test fair
25. Experiments ++++ (examples from old paper)Experiments ++++ (examples from old paper)
Almost pure unary reads
Without skipping
With heavy skipping
Heavy position addressing,
Hmm… however note that Lucene doesn’t
have skip table for position list…
26. DiscussionDiscussion
A DocIdSet with this representation is already implemented in LuceneA DocIdSet with this representation is already implemented in Lucene
(https://issues.apache.org/jira/browse/LUCENE-5084)(https://issues.apache.org/jira/browse/LUCENE-5084)
We’ll see performance comparison soon!We’ll see performance comparison soon!
Drawbacks?Drawbacks?
It might take more time during index construction:It might take more time during index construction:
Many statistics needed for encoding (upper bound, total_term_frq, etc)Many statistics needed for encoding (upper bound, total_term_frq, etc)
It is possible to pre-store a postings list with VB in memory, then translated as QSIt is possible to pre-store a postings list with VB in memory, then translated as QS
To be digested…To be digested…
““storing positions with PForDelta codes is know to give a compression rate close to thatstoring positions with PForDelta codes is know to give a compression rate close to that
provided by VB coding” ?provided by VB coding” ?
Consider there are u numbers in a basket, each time after we pick up one, we then put the number back into the basket, so the possible combinations should be C(u+n, n), it is also the number of solutions for this: X1 + X2 + X3 + … + Xu = n ( Xi >= 0) And, when the sequence is strictly monotonic, the lower bound’s lower bound Z ~ nlog(u/n), So QS will achieves an index size with Z + O(n) here
Later when discuss about position list, we’ll mention why doc freq is encoded like this
That’s why we need to encode frq list as a monotone sequence