Upcoming SlideShare
×

# Quasi succinct indices

2,673 views
2,493 views

Published on

A paper about a nice index compression method from latest WSDM'13 proceeding.

The paper uses Elias-Fano representation & a ranked characteristic function to compress inverted index, and both compression rate & speed are very good.

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
2,673
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
19
0
Likes
0
Embeds 0
No embeds

No notes for slide
• Introduction?
• And of course, IPC
• Consider there are u numbers in a basket, each time after we pick up one, we then put the number back into the basket, so the possible combinations should be C(u+n, n), it is also the number of solutions for this: X1 + X2 + X3 + … + Xu = n ( Xi &gt;= 0) And, when the sequence is strictly monotonic, the lower bound’s lower bound Z ~ nlog(u/n), So QS will achieves an index size with Z + O(n) here
• Later when discuss about position list, we’ll mention why doc freq is encoded like this
• That’s why we need to encode frq list as a monotone sequence
• ### Quasi succinct indices

1. 1. Quasi Succinct IndicesQuasi Succinct Indices ((WSDM’13)WSDM’13) Author:Author: Sebastiano VignaSebastiano Vigna Slides By:Slides By: Han JiangHan Jiang
2. 2. AgendaAgenda Related workRelated work Representation of monotone sequencesRepresentation of monotone sequences Practical examplePractical example Theoretical estimationTheoretical estimation Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
3. 3. Related workRelated work Why index compression:Why index compression: Saves disk spaceSaves disk space Reduce overhead between disk & memoryReduce overhead between disk & memory [Index compression is good, especially for random access, CIKM’07] Two tricks at the basis of index compression:Two tricks at the basis of index compression: Instantaneous codes (or prefix codes)Instantaneous codes (or prefix codes) e.g. Variable byte Gap encodingGap encoding e.g. [1, 3, 9]e.g. [1, 3, 9]  [1, 2, 6][1, 2, 6]
4. 4. Related work +Related work + Popular approaches:Popular approaches: Variable BytesVariable Bytes (VB, previously used in Lucene) Gamma/Delta encodingGamma/Delta encoding (at most 2*Theoretical lower bound) Golomb codeGolomb code (near theoretical lower bound) PForDeltaPForDelta (block encoding, efficient and cache friendly) Unary: 8Unary: 8  000,000,001000,000,001 (stupidest, but efficient when combined with others, we’ll see this again) ……
5. 5. AgendaAgenda Related work √Related work √ Representation of monotone sequencesRepresentation of monotone sequences Practical examplePractical example Theoretical estimationTheoretical estimation Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
6. 6. Representation of monotone sequencesRepresentation of monotone sequences 5 88 15 32 1 01 0010 0010 1111 1000 00 List = { } 00110001 008321 2 101 01 01 000001 5101 1 d-gap unary Total bits: 23 bitsTotal bits: 23 bits Gamma: 23 bitsGamma: 23 bits Delta: 22 bitsDelta: 22 bits VB: 40 bitsVB: 40 bits
7. 7. Assume uu is the upper bound of this list (e.g. u=36) Then lower width l is: (e.g. l=log(36/5)=2) 5 88 15 32 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High: Low: Representation of monotone sequences +Representation of monotone sequences + How to decide when splitting high/low bits? Why don’t we operate d-gap before encoding? We’ll leave it as implementation details
8. 8. X0=5 1 01 0010 0010 1111 1000 00 List = { } Theoretical estimationTheoretical estimation 101 01 01 000001 00110001 00High:High: Low: For each value, we need: n*L bits for lower part; n bits for stop ‘1’ in unary code But non-stop ‘0’s ? X1=8 X2=8 X3=15 X4=32 Note that we only unary encode higher bits, For each ‘0’, the value increases 2^l This increment will only happen q times: So the upper bound for this part is: Then in total:
9. 9. Theoretical estimation +Theoretical estimation + So what?So what? Let’s see the lower bound with ‘best’ format :Let’s see the lower bound with ‘best’ format : Upper bound for Quasi-succinct encoding:Upper bound for Quasi-succinct encoding: And it is proved that QS can achieve a ‘quasi’ optimalAnd it is proved that QS can achieve a ‘quasi’ optimal resultresult ： “： “ less than half a bit per element away”.less than half a bit per element away”. That’s why it’s called ‘quasi’ succinct…That’s why it’s called ‘quasi’ succinct… The information-theoretical lower bound for a non-strict monotoneThe information-theoretical lower bound for a non-strict monotone list of n elements, within interval [0,u]: (thelist of n elements, within interval [0,u]: (the ≈ cancan also be replaced byalso be replaced by >))
10. 10. Short conclusionShort conclusion No distribution of document gapsNo distribution of document gaps Document reordering won’t affect index size muchDocument reordering won’t affect index size much GeneralGeneral Works for sequences both monotonic or notWorks for sequences both monotonic or not Unary code is enoughUnary code is enough And we’ll see it works well for skipping SimpleSimple A few unary reads and bit shifts
11. 11. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
12. 12. Index structure (no skipping)Index structure (no skipping) Given bound ‘b’, advance to xGiven bound ‘b’, advance to xii so that xso that xii >= b>= b X0=5 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High:High: Low: X1=8 X2=8 X3=15 X4=32 It is easy to see that, xIt is easy to see that, xii must be after zeros.must be after zeros. So, walking on the high bits list, when we reach bit position p, andSo, walking on the high bits list, when we reach bit position p, and have already past zeros, we must be in the middle ofhave already past zeros, we must be in the middle of This is why we don’t need d-gap on original List: the unary highThis is why we don’t need d-gap on original List: the unary high bits should act as a ‘skip table’, with skip interval=2^lbits should act as a ‘skip table’, with skip interval=2^l
13. 13. Index structure + (with skipping)Index structure + (with skipping) X0=5 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High:High: Low: X1=8 X2=8 X3=15 X4=32 The skipper can be surprisingly simple…The skipper can be surprisingly simple… So, the skipper only need to store theSo, the skipper only need to store the locationlocation for everyfor every q unary codes. (and the value j = p - i = p - q)q unary codes. (and the value j = p - i = p - q) Note that, when scanning in the higher bits tableNote that, when scanning in the higher bits table p = current bit locationp = current bit location i = number of ‘1’s we read, telling us we’re reading Xi = number of ‘1’s we read, telling us we’re reading Xii j = number of ‘0’s we read, telling us the value of higher bits isj = number of ‘0’s we read, telling us the value of higher bits is i + j = pi + j = p
14. 14. Index structure ++ (example)Index structure ++ (example) X0=5 1 01 0010 0010 1111 1000 00 List = { } 1 00110001 00 High:High: Low: X1=8 X2=8 X3=15 X4=32 0 10 01 01 00 00 1 Skip interval=4, next pos=7 value before next skip = (pos – interval) * 2^l = 3 * 4 = 12 Advance Target = 22 so we can skip, and should walk three bits to get 24 > 22 complete current unary, then read lower bits, got result X4 = 32
15. 15. Index structure +++ (conceptual layout)Index structure +++ (conceptual layout) Size of each sectionSize of each section Metadata sectionMetadata section records n: num of elements, u: value upper bound, etcrecords n: num of elements, u: value upper bound, etc Skip tableSkip table p*w bits, (p: skip interval, w: data width)p*w bits, (p: skip interval, w: data width) Lower bitsLower bits n*l bits, (l: estimated width)n*l bits, (l: estimated width) Upper bitsUpper bits unknown without metadata, so put in last sectionunknown without metadata, so put in last section For doc ids, the sequence is strictly monotonicFor doc ids, the sequence is strictly monotonic For doc freqs, the sequence is ‘prefix sum of freq’, i.e.For doc freqs, the sequence is ‘prefix sum of freq’, i.e. For positions, the format is a little different, and we’ll leave this for nowFor positions, the format is a little different, and we’ll leave this for now
16. 16. Index structure ++++ (for dense sequence)Index structure ++++ (for dense sequence) However it’s not efficient when the sequence is very dense…However it’s not efficient when the sequence is very dense… Here we’ll encode the sequence as a bit sequence insteadHere we’ll encode the sequence as a bit sequence instead where: Bit k is set when Xwhere: Bit k is set when Xii == k== k 10 11 10 10 0 X0=1List = { }X1=2 X2=3 X3=5 X4=7 This is only for ‘strictly monotone sequence’This is only for ‘strictly monotone sequence’ Skipper will be set for every q positions, and store num of ‘1’ s before that.Skipper will be set for every q positions, and store num of ‘1’ s before that. We’ll cutover to this format when n > u/3We’ll cutover to this format when n > u/3
17. 17. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation detailsImplementation details Index structure √Index structure √ MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
18. 18. Miscellaneous (design of position list)Miscellaneous (design of position list) For a term t, all its position lists are stored as one sequence:For a term t, all its position lists are stored as one sequence: The length of this sequence is total_term_freq, and the upper bound is:The length of this sequence is total_term_freq, and the upper bound is: To revive positions from document i, we need:To revive positions from document i, we need: Sum of frq from previous documentsSum of frq from previous documents Sum of p from previous documentsSum of p from previous documents (also from current document, if we need more frequent skip)(also from current document, if we need more frequent skip) These will be store in skipper for position listThese will be store in skipper for position list
19. 19. Miscellaneous + (reuse logic)Miscellaneous + (reuse logic) 101 01 01 000001High:High: To read past 4 values, we need unary decodingTo read past 4 values, we need unary decoding To read past 4 ‘zero’s, we simply need ‘negated unary decoding’To read past 4 ‘zero’s, we simply need ‘negated unary decoding’ Another aspect of higher bits:Another aspect of higher bits: 0 10High:High: 110 10 0 0 0 0 1
20. 20. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation details √Implementation details √ Index structure √Index structure √ Miscellaneous √Miscellaneous √ ExperimentsExperiments DiscussionsDiscussions
21. 21. ExperimentsExperiments Five competitors:Five competitors: Lucene 3.6 (VB)Lucene 3.6 (VB) [sigh, not the latest version] MG4J (gamma/delta)MG4J (gamma/delta) [an old version written by the author] Zettair (VB)Zettair (VB) Kamikaze (PForDelta)Kamikaze (PForDelta) Optimized PForDelta implementation in COptimized PForDelta implementation in C Four datasets with different statistics:Four datasets with different statistics: TREC GOV2 (25M documents)TREC GOV2 (25M documents) .uk dataset (132M documents).uk dataset (132M documents) Mimir index (1M documents)Mimir index (1M documents) Tweet data (13M documents)Tweet data (13M documents) Aside from whole HTML index, title field is also extracted as another test groupAside from whole HTML index, title field is also extracted as another test group To make sure the tests is fair enough between competitors, input data is a pre-parsedTo make sure the tests is fair enough between competitors, input data is a pre-parsed stream of UTF-8 text documents.stream of UTF-8 text documents.
22. 22. Experiments + (compression)Experiments + (compression)
23. 23. Experiments ++ (speed)Experiments ++ (speed) Design of queries:Design of queries: 150 Queries from Terabyte track (04~06), as150 Queries from Terabyte track (04~06), as Conjunctive QueryConjunctive Query Phrasal QueryPhrasal Query Proximity Query (query words must appear within a window of 16)Proximity Query (query words must appear within a window of 16) Term Scanning Query (pure test)Term Scanning Query (pure test) Design of task:Design of task: All engines will be set up to return exactly one resultAll engines will be set up to return exactly one result The QS format is implemented with both Java and C++ for fair testThe QS format is implemented with both Java and C++ for fair test Since both Lucene and MG4J interleaves doc id and freq, pure boolean query willSince both Lucene and MG4J interleaves doc id and freq, pure boolean query will hurt when reading unused freq data, the QS* is a modified version to make test fairhurt when reading unused freq data, the QS* is a modified version to make test fair
24. 24. Experiments +++ (speed)Experiments +++ (speed)
25. 25. Experiments ++++ (examples from old paper)Experiments ++++ (examples from old paper) Almost pure unary reads Without skipping With heavy skipping Heavy position addressing, Hmm… however note that Lucene doesn’t have skip table for position list…
26. 26. DiscussionDiscussion A DocIdSet with this representation is already implemented in LuceneA DocIdSet with this representation is already implemented in Lucene (https://issues.apache.org/jira/browse/LUCENE-5084)(https://issues.apache.org/jira/browse/LUCENE-5084) We’ll see performance comparison soon!We’ll see performance comparison soon! Drawbacks?Drawbacks? It might take more time during index construction:It might take more time during index construction: Many statistics needed for encoding (upper bound, total_term_frq, etc)Many statistics needed for encoding (upper bound, total_term_frq, etc) It is possible to pre-store a postings list with VB in memory, then translated as QSIt is possible to pre-store a postings list with VB in memory, then translated as QS To be digested…To be digested… ““storing positions with PForDelta codes is know to give a compression rate close to thatstoring positions with PForDelta codes is know to give a compression rate close to that provided by VB coding” ?provided by VB coding” ?
27. 27. Thank You !Thank You !