SlideShare a Scribd company logo
1 of 27
Quasi Succinct IndicesQuasi Succinct Indices ((WSDM’13)WSDM’13)
Author:Author: Sebastiano VignaSebastiano Vigna
Slides By:Slides By: Han JiangHan Jiang
AgendaAgenda
Related workRelated work
Representation of monotone sequencesRepresentation of monotone sequences
Practical examplePractical example
Theoretical estimationTheoretical estimation
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Related workRelated work
Why index compression:Why index compression:
Saves disk spaceSaves disk space
Reduce overhead between disk & memoryReduce overhead between disk & memory
[Index compression is good, especially for random access, CIKM’07]
Two tricks at the basis of index compression:Two tricks at the basis of index compression:
Instantaneous codes (or prefix codes)Instantaneous codes (or prefix codes)
e.g. Variable byte
Gap encodingGap encoding
e.g. [1, 3, 9]e.g. [1, 3, 9]  [1, 2, 6][1, 2, 6]
Related work +Related work +
Popular approaches:Popular approaches:
Variable BytesVariable Bytes
(VB, previously used in Lucene)
Gamma/Delta encodingGamma/Delta encoding
(at most 2*Theoretical lower bound)
Golomb codeGolomb code
(near theoretical lower bound)
PForDeltaPForDelta
(block encoding, efficient and cache friendly)
Unary: 8Unary: 8  000,000,001000,000,001
(stupidest, but efficient when combined with others, we’ll see this again)
……
AgendaAgenda
Related work √Related work √
Representation of monotone sequencesRepresentation of monotone sequences
Practical examplePractical example
Theoretical estimationTheoretical estimation
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Representation of monotone sequencesRepresentation of monotone sequences
5 88 15 32
1 01 0010 0010 1111 1000 00
List = { }
00110001 008321 2
101 01 01 000001
5101 1
d-gap
unary
Total bits: 23 bitsTotal bits: 23 bits
Gamma: 23 bitsGamma: 23 bits
Delta: 22 bitsDelta: 22 bits
VB: 40 bitsVB: 40 bits
Assume uu is the upper bound of this list (e.g. u=36)
Then lower width l is: (e.g. l=log(36/5)=2)
5 88 15 32
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High: Low:
Representation of monotone sequences +Representation of monotone sequences +
How to decide when splitting high/low bits?
Why don’t we operate d-gap before encoding?
We’ll leave it as implementation details
X0=5
1 01 0010 0010 1111 1000 00
List = { }
Theoretical estimationTheoretical estimation
101 01 01 000001 00110001 00High:High: Low:
For each value, we need:
n*L bits for lower part;
n bits for stop ‘1’ in unary code
But non-stop ‘0’s ?
X1=8 X2=8 X3=15 X4=32
Note that we only unary encode higher bits,
For each ‘0’, the value increases 2^l
This increment will only happen q times:
So the upper bound for this part is:
Then in total:
Theoretical estimation +Theoretical estimation +
So what?So what?
Let’s see the lower bound with ‘best’ format :Let’s see the lower bound with ‘best’ format :
Upper bound for Quasi-succinct encoding:Upper bound for Quasi-succinct encoding:
And it is proved that QS can achieve a ‘quasi’ optimalAnd it is proved that QS can achieve a ‘quasi’ optimal
resultresult : “: “ less than half a bit per element away”.less than half a bit per element away”.
That’s why it’s called ‘quasi’ succinct…That’s why it’s called ‘quasi’ succinct…
The information-theoretical lower bound for a non-strict monotoneThe information-theoretical lower bound for a non-strict monotone
list of n elements, within interval [0,u]: (thelist of n elements, within interval [0,u]: (the ≈ cancan
also be replaced byalso be replaced by >))
Short conclusionShort conclusion
No distribution of document gapsNo distribution of document gaps
Document reordering won’t affect index size muchDocument reordering won’t affect index size much
GeneralGeneral
Works for sequences both monotonic or notWorks for sequences both monotonic or not
Unary code is enoughUnary code is enough
And we’ll see it works well for skipping
SimpleSimple
A few unary reads and bit shifts
AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation detailsImplementation details
Index structureIndex structure
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Index structure (no skipping)Index structure (no skipping)
Given bound ‘b’, advance to xGiven bound ‘b’, advance to xii so that xso that xii >= b>= b
X0=5
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High:High: Low:
X1=8 X2=8 X3=15 X4=32
It is easy to see that, xIt is easy to see that, xii must be after zeros.must be after zeros.
So, walking on the high bits list, when we reach bit position p, andSo, walking on the high bits list, when we reach bit position p, and
have already past zeros, we must be in the middle ofhave already past zeros, we must be in the middle of
This is why we don’t need d-gap on original List: the unary highThis is why we don’t need d-gap on original List: the unary high
bits should act as a ‘skip table’, with skip interval=2^lbits should act as a ‘skip table’, with skip interval=2^l
Index structure + (with skipping)Index structure + (with skipping)
X0=5
1 01 0010 0010 1111 1000 00
List = { }
101 01 01 000001 00110001 00High:High: Low:
X1=8 X2=8 X3=15 X4=32
The skipper can be surprisingly simple…The skipper can be surprisingly simple…
So, the skipper only need to store theSo, the skipper only need to store the locationlocation for everyfor every
q unary codes. (and the value j = p - i = p - q)q unary codes. (and the value j = p - i = p - q)
Note that, when scanning in the higher bits tableNote that, when scanning in the higher bits table
p = current bit locationp = current bit location
i = number of ‘1’s we read, telling us we’re reading Xi = number of ‘1’s we read, telling us we’re reading Xii
j = number of ‘0’s we read, telling us the value of higher bits isj = number of ‘0’s we read, telling us the value of higher bits is
i + j = pi + j = p
Index structure ++ (example)Index structure ++ (example)
X0=5
1 01 0010 0010 1111 1000 00
List = { }
1
00110001 00
High:High:
Low:
X1=8 X2=8 X3=15 X4=32
0 10 01 01 00 00 1
Skip interval=4, next pos=7
value before next skip = (pos – interval) * 2^l = 3 * 4 = 12
Advance Target = 22
so we can skip, and should walk three bits to get 24 > 22
complete current unary, then read lower bits, got result X4 = 32
Index structure +++ (conceptual layout)Index structure +++ (conceptual layout)
Size of each sectionSize of each section
Metadata sectionMetadata section records n: num of elements, u: value upper bound, etcrecords n: num of elements, u: value upper bound, etc
Skip tableSkip table p*w bits, (p: skip interval, w: data width)p*w bits, (p: skip interval, w: data width)
Lower bitsLower bits n*l bits, (l: estimated width)n*l bits, (l: estimated width)
Upper bitsUpper bits unknown without metadata, so put in last sectionunknown without metadata, so put in last section
For doc ids, the sequence is strictly monotonicFor doc ids, the sequence is strictly monotonic
For doc freqs, the sequence is ‘prefix sum of freq’, i.e.For doc freqs, the sequence is ‘prefix sum of freq’, i.e.
For positions, the format is a little different, and we’ll leave this for nowFor positions, the format is a little different, and we’ll leave this for now
Index structure ++++ (for dense sequence)Index structure ++++ (for dense sequence)
However it’s not efficient when the sequence is very dense…However it’s not efficient when the sequence is very dense…
Here we’ll encode the sequence as a bit sequence insteadHere we’ll encode the sequence as a bit sequence instead
where: Bit k is set when Xwhere: Bit k is set when Xii == k== k
10 11 10 10 0
X0=1List = { }X1=2 X2=3 X3=5 X4=7
This is only for ‘strictly monotone sequence’This is only for ‘strictly monotone sequence’
Skipper will be set for every q positions, and store num of ‘1’ s before that.Skipper will be set for every q positions, and store num of ‘1’ s before that.
We’ll cutover to this format when n > u/3We’ll cutover to this format when n > u/3
AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation detailsImplementation details
Index structure √Index structure √
MiscellaneousMiscellaneous
ExperimentsExperiments
DiscussionsDiscussions
Miscellaneous (design of position list)Miscellaneous (design of position list)
For a term t, all its position lists are stored as one sequence:For a term t, all its position lists are stored as one sequence:
The length of this sequence is total_term_freq, and the upper bound is:The length of this sequence is total_term_freq, and the upper bound is:
To revive positions from document i, we need:To revive positions from document i, we need:
Sum of frq from previous documentsSum of frq from previous documents
Sum of p from previous documentsSum of p from previous documents
(also from current document, if we need more frequent skip)(also from current document, if we need more frequent skip)
These will be store in skipper for position listThese will be store in skipper for position list
Miscellaneous + (reuse logic)Miscellaneous + (reuse logic)
101 01 01 000001High:High:
To read past 4 values, we need unary decodingTo read past 4 values, we need unary decoding
To read past 4 ‘zero’s, we simply need ‘negated unary decoding’To read past 4 ‘zero’s, we simply need ‘negated unary decoding’
Another aspect of higher bits:Another aspect of higher bits:
0 10High:High: 110 10 0 0 0 0 1
AgendaAgenda
Related work √Related work √
Representation of monotone sequences √Representation of monotone sequences √
Practical example √Practical example √
Theoretical estimation √Theoretical estimation √
Implementation details √Implementation details √
Index structure √Index structure √
Miscellaneous √Miscellaneous √
ExperimentsExperiments
DiscussionsDiscussions
ExperimentsExperiments
Five competitors:Five competitors:
Lucene 3.6 (VB)Lucene 3.6 (VB) [sigh, not the latest version]
MG4J (gamma/delta)MG4J (gamma/delta) [an old version written by the author]
Zettair (VB)Zettair (VB)
Kamikaze (PForDelta)Kamikaze (PForDelta)
Optimized PForDelta implementation in COptimized PForDelta implementation in C
Four datasets with different statistics:Four datasets with different statistics:
TREC GOV2 (25M documents)TREC GOV2 (25M documents)
.uk dataset (132M documents).uk dataset (132M documents)
Mimir index (1M documents)Mimir index (1M documents)
Tweet data (13M documents)Tweet data (13M documents)
Aside from whole HTML index, title field is also extracted as another test groupAside from whole HTML index, title field is also extracted as another test group
To make sure the tests is fair enough between competitors, input data is a pre-parsedTo make sure the tests is fair enough between competitors, input data is a pre-parsed
stream of UTF-8 text documents.stream of UTF-8 text documents.
Experiments + (compression)Experiments + (compression)
Experiments ++ (speed)Experiments ++ (speed)
Design of queries:Design of queries:
150 Queries from Terabyte track (04~06), as150 Queries from Terabyte track (04~06), as
Conjunctive QueryConjunctive Query
Phrasal QueryPhrasal Query
Proximity Query (query words must appear within a window of 16)Proximity Query (query words must appear within a window of 16)
Term Scanning Query (pure test)Term Scanning Query (pure test)
Design of task:Design of task:
All engines will be set up to return exactly one resultAll engines will be set up to return exactly one result
The QS format is implemented with both Java and C++ for fair testThe QS format is implemented with both Java and C++ for fair test
Since both Lucene and MG4J interleaves doc id and freq, pure boolean query willSince both Lucene and MG4J interleaves doc id and freq, pure boolean query will
hurt when reading unused freq data, the QS* is a modified version to make test fairhurt when reading unused freq data, the QS* is a modified version to make test fair
Experiments +++ (speed)Experiments +++ (speed)
Experiments ++++ (examples from old paper)Experiments ++++ (examples from old paper)
Almost pure unary reads
Without skipping
With heavy skipping
Heavy position addressing,
Hmm… however note that Lucene doesn’t
have skip table for position list…
DiscussionDiscussion
A DocIdSet with this representation is already implemented in LuceneA DocIdSet with this representation is already implemented in Lucene
(https://issues.apache.org/jira/browse/LUCENE-5084)(https://issues.apache.org/jira/browse/LUCENE-5084)
We’ll see performance comparison soon!We’ll see performance comparison soon!
Drawbacks?Drawbacks?
It might take more time during index construction:It might take more time during index construction:
Many statistics needed for encoding (upper bound, total_term_frq, etc)Many statistics needed for encoding (upper bound, total_term_frq, etc)
It is possible to pre-store a postings list with VB in memory, then translated as QSIt is possible to pre-store a postings list with VB in memory, then translated as QS
To be digested…To be digested…
““storing positions with PForDelta codes is know to give a compression rate close to thatstoring positions with PForDelta codes is know to give a compression rate close to that
provided by VB coding” ?provided by VB coding” ?
Thank You !Thank You !

More Related Content

What's hot

Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDAAshish Duggal
 
Implementation Of String Functions In C
Implementation Of String Functions In CImplementation Of String Functions In C
Implementation Of String Functions In CFazila Sadia
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 
Automata theory - Push Down Automata (PDA)
Automata theory - Push Down Automata (PDA)Automata theory - Push Down Automata (PDA)
Automata theory - Push Down Automata (PDA)Akila Krishnamoorthy
 
Push down automata
Push down automataPush down automata
Push down automataSomya Bagai
 
Multimedia lossless compression algorithms
Multimedia lossless compression algorithmsMultimedia lossless compression algorithms
Multimedia lossless compression algorithmsMazin Alwaaly
 
Improved security system using steganography and elliptic curve crypto...
Improved  security  system using  steganography  and  elliptic  curve  crypto...Improved  security  system using  steganography  and  elliptic  curve  crypto...
Improved security system using steganography and elliptic curve crypto...atanuanwesha
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2MuradAmn
 
Aae oop xp_06
Aae oop xp_06Aae oop xp_06
Aae oop xp_06Niit Care
 
Data Protection Techniques and Cryptography
Data Protection Techniques and CryptographyData Protection Techniques and Cryptography
Data Protection Techniques and CryptographyTalha SAVAS
 
Arithmetic coding
Arithmetic codingArithmetic coding
Arithmetic codingVikas Goyal
 

What's hot (20)

Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
 
Huffman coding01
Huffman coding01Huffman coding01
Huffman coding01
 
Introduction to Turing Machine
Introduction to Turing MachineIntroduction to Turing Machine
Introduction to Turing Machine
 
Implementation Of String Functions In C
Implementation Of String Functions In CImplementation Of String Functions In C
Implementation Of String Functions In C
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Automata theory - Push Down Automata (PDA)
Automata theory - Push Down Automata (PDA)Automata theory - Push Down Automata (PDA)
Automata theory - Push Down Automata (PDA)
 
Push down automata
Push down automataPush down automata
Push down automata
 
Turing machine-TOC
Turing machine-TOCTuring machine-TOC
Turing machine-TOC
 
Turing Machine
Turing MachineTuring Machine
Turing Machine
 
Turing machines
Turing machinesTuring machines
Turing machines
 
Multimedia lossless compression algorithms
Multimedia lossless compression algorithmsMultimedia lossless compression algorithms
Multimedia lossless compression algorithms
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
 
Improved security system using steganography and elliptic curve crypto...
Improved  security  system using  steganography  and  elliptic  curve  crypto...Improved  security  system using  steganography  and  elliptic  curve  crypto...
Improved security system using steganography and elliptic curve crypto...
 
04 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x204 greedyalgorithmsii 2x2
04 greedyalgorithmsii 2x2
 
Multimedia Communication Lec02: Info Theory and Entropy
Multimedia Communication Lec02: Info Theory and EntropyMultimedia Communication Lec02: Info Theory and Entropy
Multimedia Communication Lec02: Info Theory and Entropy
 
Aae oop xp_06
Aae oop xp_06Aae oop xp_06
Aae oop xp_06
 
Data Protection Techniques and Cryptography
Data Protection Techniques and CryptographyData Protection Techniques and Cryptography
Data Protection Techniques and Cryptography
 
Huffman coding
Huffman coding Huffman coding
Huffman coding
 
Headerfiles
HeaderfilesHeaderfiles
Headerfiles
 
Arithmetic coding
Arithmetic codingArithmetic coding
Arithmetic coding
 

Viewers also liked

Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsYu Liu
 
2009年4月8日セミナー 2.Sedue新機能
2009年4月8日セミナー 2.Sedue新機能2009年4月8日セミナー 2.Sedue新機能
2009年4月8日セミナー 2.Sedue新機能Preferred Networks
 
2009年4月8日セミナー 3.SSD向け全文検索エンジン
2009年4月8日セミナー 3.SSD向け全文検索エンジン2009年4月8日セミナー 3.SSD向け全文検索エンジン
2009年4月8日セミナー 3.SSD向け全文検索エンジンPreferred Networks
 
2009年4月8日セミナー 4.レコメンデーション Q&A
2009年4月8日セミナー 4.レコメンデーション Q&A2009年4月8日セミナー 4.レコメンデーション Q&A
2009年4月8日セミナー 4.レコメンデーション Q&APreferred Networks
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesTakeshi Yamamuro
 
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来Preferred Networks
 
2009年4月8日セミナー 1.オープニング
2009年4月8日セミナー 1.オープニング2009年4月8日セミナー 1.オープニング
2009年4月8日セミナー 1.オープニングPreferred Networks
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
 
Jubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaJubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaPreferred Networks
 
Session2:「グローバル化する情報処理」/伊藤敬彦
Session2:「グローバル化する情報処理」/伊藤敬彦Session2:「グローバル化する情報処理」/伊藤敬彦
Session2:「グローバル化する情報処理」/伊藤敬彦Preferred Networks
 
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」Preferred Networks
 

Viewers also liked (20)

PFI Christmas seminar 2009
PFI Christmas seminar 2009PFI Christmas seminar 2009
PFI Christmas seminar 2009
 
Introduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applicationsIntroduction to Ultra-succinct representation of ordered trees with applications
Introduction to Ultra-succinct representation of ordered trees with applications
 
2009年4月8日セミナー 2.Sedue新機能
2009年4月8日セミナー 2.Sedue新機能2009年4月8日セミナー 2.Sedue新機能
2009年4月8日セミナー 2.Sedue新機能
 
Pfi Seminar 2010 1 7
Pfi Seminar 2010 1 7Pfi Seminar 2010 1 7
Pfi Seminar 2010 1 7
 
PFI Seminar 2010/01/21
PFI Seminar 2010/01/21PFI Seminar 2010/01/21
PFI Seminar 2010/01/21
 
2009年4月8日セミナー 3.SSD向け全文検索エンジン
2009年4月8日セミナー 3.SSD向け全文検索エンジン2009年4月8日セミナー 3.SSD向け全文検索エンジン
2009年4月8日セミナー 3.SSD向け全文検索エンジン
 
Prosym53
Prosym53Prosym53
Prosym53
 
PFI Corporate Profile
PFI Corporate ProfilePFI Corporate Profile
PFI Corporate Profile
 
2009年4月8日セミナー 4.レコメンデーション Q&A
2009年4月8日セミナー 4.レコメンデーション Q&A2009年4月8日セミナー 4.レコメンデーション Q&A
2009年4月8日セミナー 4.レコメンデーション Q&A
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
 
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
【旧版】2009/12/10 GPUコンピューティングの現状とスーパーコンピューティングの未来
 
mlabforum2012_okanohara
mlabforum2012_okanoharamlabforum2012_okanohara
mlabforum2012_okanohara
 
2009年4月8日セミナー 1.オープニング
2009年4月8日セミナー 1.オープニング2009年4月8日セミナー 1.オープニング
2009年4月8日セミナー 1.オープニング
 
PFI Seminar 2012/02/24
PFI Seminar 2012/02/24PFI Seminar 2012/02/24
PFI Seminar 2012/02/24
 
tut_pfi_2012
tut_pfi_2012tut_pfi_2012
tut_pfi_2012
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
Jubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaJubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB Asia
 
PFI会社案内
PFI会社案内PFI会社案内
PFI会社案内
 
Session2:「グローバル化する情報処理」/伊藤敬彦
Session2:「グローバル化する情報処理」/伊藤敬彦Session2:「グローバル化する情報処理」/伊藤敬彦
Session2:「グローバル化する情報処理」/伊藤敬彦
 
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
PFIセミナー 2013/09/19 「Linux開発環境の自動構築」
 

Similar to Quasi succinct indices

app4.pptx
app4.pptxapp4.pptx
app4.pptxsg4795
 
16 -ansi-iso_standards
16  -ansi-iso_standards16  -ansi-iso_standards
16 -ansi-iso_standardsHector Garzo
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionGiorgio Orsi
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02Muhammad Aslam
 
Counting Sort Lowerbound
Counting Sort LowerboundCounting Sort Lowerbound
Counting Sort Lowerbounddespicable me
 
lecture 9
lecture 9lecture 9
lecture 9sajinsc
 
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekingeProf. Wim Van Criekinge
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...Alex Pruden
 
Cheat Sheets for Hard Problems
Cheat Sheets for Hard ProblemsCheat Sheets for Hard Problems
Cheat Sheets for Hard ProblemsNeeldhara Misra
 
Python programming –part 3
Python programming –part 3Python programming –part 3
Python programming –part 3Megha V
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues listsJames Wong
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
StacksqueueslistsFraboni Ec
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsYoung Alista
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsTony Nguyen
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues listsHarry Potter
 
SPU Optimizations-part 1
SPU Optimizations-part 1SPU Optimizations-part 1
SPU Optimizations-part 1Naughty Dog
 

Similar to Quasi succinct indices (20)

app4.pptx
app4.pptxapp4.pptx
app4.pptx
 
16 -ansi-iso_standards
16  -ansi-iso_standards16  -ansi-iso_standards
16 -ansi-iso_standards
 
Python ppt
Python pptPython ppt
Python ppt
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
 
Counting Sort Lowerbound
Counting Sort LowerboundCounting Sort Lowerbound
Counting Sort Lowerbound
 
lecture 9
lecture 9lecture 9
lecture 9
 
Concur15slides
Concur15slidesConcur15slides
Concur15slides
 
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
 
Cheat Sheets for Hard Problems
Cheat Sheets for Hard ProblemsCheat Sheets for Hard Problems
Cheat Sheets for Hard Problems
 
Python programming –part 3
Python programming –part 3Python programming –part 3
Python programming –part 3
 
Stack squeues lists
Stack squeues listsStack squeues lists
Stack squeues lists
 
Stacksqueueslists
StacksqueueslistsStacksqueueslists
Stacksqueueslists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
Stacks queues lists
Stacks queues listsStacks queues lists
Stacks queues lists
 
SPU Optimizations-part 1
SPU Optimizations-part 1SPU Optimizations-part 1
SPU Optimizations-part 1
 
iPython
iPythoniPython
iPython
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 

Quasi succinct indices

  • 1. Quasi Succinct IndicesQuasi Succinct Indices ((WSDM’13)WSDM’13) Author:Author: Sebastiano VignaSebastiano Vigna Slides By:Slides By: Han JiangHan Jiang
  • 2. AgendaAgenda Related workRelated work Representation of monotone sequencesRepresentation of monotone sequences Practical examplePractical example Theoretical estimationTheoretical estimation Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 3. Related workRelated work Why index compression:Why index compression: Saves disk spaceSaves disk space Reduce overhead between disk & memoryReduce overhead between disk & memory [Index compression is good, especially for random access, CIKM’07] Two tricks at the basis of index compression:Two tricks at the basis of index compression: Instantaneous codes (or prefix codes)Instantaneous codes (or prefix codes) e.g. Variable byte Gap encodingGap encoding e.g. [1, 3, 9]e.g. [1, 3, 9]  [1, 2, 6][1, 2, 6]
  • 4. Related work +Related work + Popular approaches:Popular approaches: Variable BytesVariable Bytes (VB, previously used in Lucene) Gamma/Delta encodingGamma/Delta encoding (at most 2*Theoretical lower bound) Golomb codeGolomb code (near theoretical lower bound) PForDeltaPForDelta (block encoding, efficient and cache friendly) Unary: 8Unary: 8  000,000,001000,000,001 (stupidest, but efficient when combined with others, we’ll see this again) ……
  • 5. AgendaAgenda Related work √Related work √ Representation of monotone sequencesRepresentation of monotone sequences Practical examplePractical example Theoretical estimationTheoretical estimation Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 6. Representation of monotone sequencesRepresentation of monotone sequences 5 88 15 32 1 01 0010 0010 1111 1000 00 List = { } 00110001 008321 2 101 01 01 000001 5101 1 d-gap unary Total bits: 23 bitsTotal bits: 23 bits Gamma: 23 bitsGamma: 23 bits Delta: 22 bitsDelta: 22 bits VB: 40 bitsVB: 40 bits
  • 7. Assume uu is the upper bound of this list (e.g. u=36) Then lower width l is: (e.g. l=log(36/5)=2) 5 88 15 32 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High: Low: Representation of monotone sequences +Representation of monotone sequences + How to decide when splitting high/low bits? Why don’t we operate d-gap before encoding? We’ll leave it as implementation details
  • 8. X0=5 1 01 0010 0010 1111 1000 00 List = { } Theoretical estimationTheoretical estimation 101 01 01 000001 00110001 00High:High: Low: For each value, we need: n*L bits for lower part; n bits for stop ‘1’ in unary code But non-stop ‘0’s ? X1=8 X2=8 X3=15 X4=32 Note that we only unary encode higher bits, For each ‘0’, the value increases 2^l This increment will only happen q times: So the upper bound for this part is: Then in total:
  • 9. Theoretical estimation +Theoretical estimation + So what?So what? Let’s see the lower bound with ‘best’ format :Let’s see the lower bound with ‘best’ format : Upper bound for Quasi-succinct encoding:Upper bound for Quasi-succinct encoding: And it is proved that QS can achieve a ‘quasi’ optimalAnd it is proved that QS can achieve a ‘quasi’ optimal resultresult : “: “ less than half a bit per element away”.less than half a bit per element away”. That’s why it’s called ‘quasi’ succinct…That’s why it’s called ‘quasi’ succinct… The information-theoretical lower bound for a non-strict monotoneThe information-theoretical lower bound for a non-strict monotone list of n elements, within interval [0,u]: (thelist of n elements, within interval [0,u]: (the ≈ cancan also be replaced byalso be replaced by >))
  • 10. Short conclusionShort conclusion No distribution of document gapsNo distribution of document gaps Document reordering won’t affect index size muchDocument reordering won’t affect index size much GeneralGeneral Works for sequences both monotonic or notWorks for sequences both monotonic or not Unary code is enoughUnary code is enough And we’ll see it works well for skipping SimpleSimple A few unary reads and bit shifts
  • 11. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation detailsImplementation details Index structureIndex structure MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 12. Index structure (no skipping)Index structure (no skipping) Given bound ‘b’, advance to xGiven bound ‘b’, advance to xii so that xso that xii >= b>= b X0=5 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High:High: Low: X1=8 X2=8 X3=15 X4=32 It is easy to see that, xIt is easy to see that, xii must be after zeros.must be after zeros. So, walking on the high bits list, when we reach bit position p, andSo, walking on the high bits list, when we reach bit position p, and have already past zeros, we must be in the middle ofhave already past zeros, we must be in the middle of This is why we don’t need d-gap on original List: the unary highThis is why we don’t need d-gap on original List: the unary high bits should act as a ‘skip table’, with skip interval=2^lbits should act as a ‘skip table’, with skip interval=2^l
  • 13. Index structure + (with skipping)Index structure + (with skipping) X0=5 1 01 0010 0010 1111 1000 00 List = { } 101 01 01 000001 00110001 00High:High: Low: X1=8 X2=8 X3=15 X4=32 The skipper can be surprisingly simple…The skipper can be surprisingly simple… So, the skipper only need to store theSo, the skipper only need to store the locationlocation for everyfor every q unary codes. (and the value j = p - i = p - q)q unary codes. (and the value j = p - i = p - q) Note that, when scanning in the higher bits tableNote that, when scanning in the higher bits table p = current bit locationp = current bit location i = number of ‘1’s we read, telling us we’re reading Xi = number of ‘1’s we read, telling us we’re reading Xii j = number of ‘0’s we read, telling us the value of higher bits isj = number of ‘0’s we read, telling us the value of higher bits is i + j = pi + j = p
  • 14. Index structure ++ (example)Index structure ++ (example) X0=5 1 01 0010 0010 1111 1000 00 List = { } 1 00110001 00 High:High: Low: X1=8 X2=8 X3=15 X4=32 0 10 01 01 00 00 1 Skip interval=4, next pos=7 value before next skip = (pos – interval) * 2^l = 3 * 4 = 12 Advance Target = 22 so we can skip, and should walk three bits to get 24 > 22 complete current unary, then read lower bits, got result X4 = 32
  • 15. Index structure +++ (conceptual layout)Index structure +++ (conceptual layout) Size of each sectionSize of each section Metadata sectionMetadata section records n: num of elements, u: value upper bound, etcrecords n: num of elements, u: value upper bound, etc Skip tableSkip table p*w bits, (p: skip interval, w: data width)p*w bits, (p: skip interval, w: data width) Lower bitsLower bits n*l bits, (l: estimated width)n*l bits, (l: estimated width) Upper bitsUpper bits unknown without metadata, so put in last sectionunknown without metadata, so put in last section For doc ids, the sequence is strictly monotonicFor doc ids, the sequence is strictly monotonic For doc freqs, the sequence is ‘prefix sum of freq’, i.e.For doc freqs, the sequence is ‘prefix sum of freq’, i.e. For positions, the format is a little different, and we’ll leave this for nowFor positions, the format is a little different, and we’ll leave this for now
  • 16. Index structure ++++ (for dense sequence)Index structure ++++ (for dense sequence) However it’s not efficient when the sequence is very dense…However it’s not efficient when the sequence is very dense… Here we’ll encode the sequence as a bit sequence insteadHere we’ll encode the sequence as a bit sequence instead where: Bit k is set when Xwhere: Bit k is set when Xii == k== k 10 11 10 10 0 X0=1List = { }X1=2 X2=3 X3=5 X4=7 This is only for ‘strictly monotone sequence’This is only for ‘strictly monotone sequence’ Skipper will be set for every q positions, and store num of ‘1’ s before that.Skipper will be set for every q positions, and store num of ‘1’ s before that. We’ll cutover to this format when n > u/3We’ll cutover to this format when n > u/3
  • 17. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation detailsImplementation details Index structure √Index structure √ MiscellaneousMiscellaneous ExperimentsExperiments DiscussionsDiscussions
  • 18. Miscellaneous (design of position list)Miscellaneous (design of position list) For a term t, all its position lists are stored as one sequence:For a term t, all its position lists are stored as one sequence: The length of this sequence is total_term_freq, and the upper bound is:The length of this sequence is total_term_freq, and the upper bound is: To revive positions from document i, we need:To revive positions from document i, we need: Sum of frq from previous documentsSum of frq from previous documents Sum of p from previous documentsSum of p from previous documents (also from current document, if we need more frequent skip)(also from current document, if we need more frequent skip) These will be store in skipper for position listThese will be store in skipper for position list
  • 19. Miscellaneous + (reuse logic)Miscellaneous + (reuse logic) 101 01 01 000001High:High: To read past 4 values, we need unary decodingTo read past 4 values, we need unary decoding To read past 4 ‘zero’s, we simply need ‘negated unary decoding’To read past 4 ‘zero’s, we simply need ‘negated unary decoding’ Another aspect of higher bits:Another aspect of higher bits: 0 10High:High: 110 10 0 0 0 0 1
  • 20. AgendaAgenda Related work √Related work √ Representation of monotone sequences √Representation of monotone sequences √ Practical example √Practical example √ Theoretical estimation √Theoretical estimation √ Implementation details √Implementation details √ Index structure √Index structure √ Miscellaneous √Miscellaneous √ ExperimentsExperiments DiscussionsDiscussions
  • 21. ExperimentsExperiments Five competitors:Five competitors: Lucene 3.6 (VB)Lucene 3.6 (VB) [sigh, not the latest version] MG4J (gamma/delta)MG4J (gamma/delta) [an old version written by the author] Zettair (VB)Zettair (VB) Kamikaze (PForDelta)Kamikaze (PForDelta) Optimized PForDelta implementation in COptimized PForDelta implementation in C Four datasets with different statistics:Four datasets with different statistics: TREC GOV2 (25M documents)TREC GOV2 (25M documents) .uk dataset (132M documents).uk dataset (132M documents) Mimir index (1M documents)Mimir index (1M documents) Tweet data (13M documents)Tweet data (13M documents) Aside from whole HTML index, title field is also extracted as another test groupAside from whole HTML index, title field is also extracted as another test group To make sure the tests is fair enough between competitors, input data is a pre-parsedTo make sure the tests is fair enough between competitors, input data is a pre-parsed stream of UTF-8 text documents.stream of UTF-8 text documents.
  • 23. Experiments ++ (speed)Experiments ++ (speed) Design of queries:Design of queries: 150 Queries from Terabyte track (04~06), as150 Queries from Terabyte track (04~06), as Conjunctive QueryConjunctive Query Phrasal QueryPhrasal Query Proximity Query (query words must appear within a window of 16)Proximity Query (query words must appear within a window of 16) Term Scanning Query (pure test)Term Scanning Query (pure test) Design of task:Design of task: All engines will be set up to return exactly one resultAll engines will be set up to return exactly one result The QS format is implemented with both Java and C++ for fair testThe QS format is implemented with both Java and C++ for fair test Since both Lucene and MG4J interleaves doc id and freq, pure boolean query willSince both Lucene and MG4J interleaves doc id and freq, pure boolean query will hurt when reading unused freq data, the QS* is a modified version to make test fairhurt when reading unused freq data, the QS* is a modified version to make test fair
  • 25. Experiments ++++ (examples from old paper)Experiments ++++ (examples from old paper) Almost pure unary reads Without skipping With heavy skipping Heavy position addressing, Hmm… however note that Lucene doesn’t have skip table for position list…
  • 26. DiscussionDiscussion A DocIdSet with this representation is already implemented in LuceneA DocIdSet with this representation is already implemented in Lucene (https://issues.apache.org/jira/browse/LUCENE-5084)(https://issues.apache.org/jira/browse/LUCENE-5084) We’ll see performance comparison soon!We’ll see performance comparison soon! Drawbacks?Drawbacks? It might take more time during index construction:It might take more time during index construction: Many statistics needed for encoding (upper bound, total_term_frq, etc)Many statistics needed for encoding (upper bound, total_term_frq, etc) It is possible to pre-store a postings list with VB in memory, then translated as QSIt is possible to pre-store a postings list with VB in memory, then translated as QS To be digested…To be digested… ““storing positions with PForDelta codes is know to give a compression rate close to thatstoring positions with PForDelta codes is know to give a compression rate close to that provided by VB coding” ?provided by VB coding” ?

Editor's Notes

  1. Introduction?
  2. And of course, IPC
  3. Consider there are u numbers in a basket, each time after we pick up one, we then put the number back into the basket, so the possible combinations should be C(u+n, n), it is also the number of solutions for this: X1 + X2 + X3 + … + Xu = n ( Xi >= 0) And, when the sequence is strictly monotonic, the lower bound’s lower bound Z ~ nlog(u/n), So QS will achieves an index size with Z + O(n) here
  4. Later when discuss about position list, we’ll mention why doc freq is encoded like this
  5. That’s why we need to encode frq list as a monotone sequence