- 1. x86/x64最適化勉強会#4 A x86-optimized rank/select dictionary for bit sequences 2012/6/16 Takeshi Yamamuro 1
- 2. What’s Succinct Data Structure? 2
- 3. SDS: Succinct Data Structure • Recently, Getting Popular in Some Areas – Researches & Engineering • Not Data Structure, But Data Representation – A compressed method for other data structures – e.g., alphabets, trees, and graphs • Transparent Operations w/o Unpacking Explicitly – e.g., succinct LZ77 compression*1 *1 3 Kreft, S. and Navarro, G.: LZ77-Like Compression with Fast Random Access, In Proceedings of DCC, 2010
- 4. More Details • SDS = Succinct Data + Succinct Index • Succinct Data – Compact representation for target data – Almost to information theoretic lower bounds e.g., If N patterns, the lower bound’s logN • Succinct Index – O(1) operations for target data – o(N) space costs: ignored asymptotically 4
- 5. More Details If you need more information, ... cited from: http://goo.gl/rkQ5z 5
- 6. A rank/select dictionary for SDS 6
- 7. A Rank/Select Operations • SDS Composed of Rank/Select Operations – Many calls of rank/select inside • Rank/Select for Succinct Bit Sequences: B[i] – rankx(n, B): the total of 1s in B[0...n] – selectx(n, B): n-th position of x in B[] i 0 1 2 3 4 5 6 7 8 B[i] 1 0 1 1 0 0 1 1 0 rank1(5, B)=3 select1(4, B)=6 7
- 8. A Rank/Select Operations • Available Rank/Select Implementation – ux-trie: http://code.google.com/p/ux-trie/ – rx: http://code.google.com/p/mozc/ – marisa-trie: http://code.google.com/p/marisa-trie/ • Today Contributions – x86-optimized rank/select – https://github.com/maropu/dbitv 8
- 9. Performance Results • Performance Benchmark Setups*1 – Generate a random sequence of bits: 50% density – Random rank/select queries over the bits – CPU: Intel Core-i5 U470@1.33GHz • Latency Observed – 11 trials, and median latency *1 9 Reference: http://d.hatena.ne.jp/s-yata/20111216/1324032373
- 10. Performance Results: Rank 1.E+03 averaged rank latency (ns) 1.E+02 1.E+01 ux rx marisa opt 1.E+00 bit length 10
- 11. Performance Results: Select 1.E+04 averaged select latency (ns) 1.E+03 1.E+02 ux 1.E+01 rx marisa opt 1.E+00 bit length 11
- 12. Implementation Details 12
- 13. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space B[] = A sequence of bits N-bits 13
- 14. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 • Split into log2N fixed-length blocks • Total Counts Pre-computed in L[] x x / log 2 N x rank1 ( x, B) B[i ] B[i ] B[i] i 1 i 1 i x / log 2 N 1 L1[ x / log 2 N ] 14
- 15. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 • Split into log2N fixed-length blocks • Total Counts Pre-computed in L[] x x / log 2 N x rank1 ( x, B) B[i ] B[i ] B[i] i 1 i 1 i x / log 2 N 1 L[ x / log 2 N ] O(log2N) O(1) 15
- 16. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 • L[]: o(N) space costs N N 2 log N O( ) o( N ) log N log N 16
- 17. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 1 log n 2 S[] = s1 s2 • Split into 1/2logN fixed-length blocks again • Total Counts Pre-computed in S[] 1 x x / log N 2 x / 2 log N x rank1 ( x, B) B[i ] B[i ] B[i] B[i] i 1 i 1 i x / log 2 N 1 1 i x / log N 1 2 1 L[ x / log 2 n] S[ x / log n] 2 17
- 18. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 1 log n 2 S[] = s1 s2 • Split into 1/2logN fixed-length blocks again • Total Counts Pre-computed in S[] 1 O(logN) x / log N 2 x / log N 2 x x rank1 ( x, B) B[i ] B[i ] B[i] B[i] i 1 i 1 i x / log 2 N 1 1 i x / log N 1 2 1 L[ x / log 2 n] S [ x / log n] 2 O(1) O(1) 18
- 19. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 1 log n 2 S[] = s1 s2 • S[]: o(N) space costs N log log N 2 log(log N ) O( N 2 ) o( N ) 1 2 log N log N 19
- 20. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 1 log n 2 S[] = s1 s2 • O(1) Popcount/Table-Lookup in Last Term 1 O(logN) -> O(1) x x / log 2 N x / 2 log N x rank1 ( x, B) B[i ] B[i ] B[i] B[i] i 1 i 1 i x / log 2 N 1 1 i x / log N 1 2 1 L[ x / log 2 n] S [ x / log n] 2 O(1) O(1) 20
- 21. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space log 2 N B[] = A sequence of bits L[] = l1 l2 1 log n 2 S[] = s1 s2 • As a result, o(N) Space Costs N 4 N log log N log log N O( N ) o( N ) log N log N log N L[] size S[] size 21
- 22. Implementation: 4 Russian Methods • Rule: O(1) operation costs with o(N) space 22
- 23. Implementation: Practice • Low Computation Costs & High Cache Penalties – 3 cache/TLB misses per rank ex. rank1(402=256*1+32*4+18, B) 256bit B[]: 01..000000....101......0 0110....001...............0 0000100 ... 32bit Popcount these left bits L[]: 18 21 … S[]: 1 3 4 6 7 9 10 13 2 5 7 9 12 13 18 19 1 3 7 … 23
- 24. Implementation: Practice • Low Computation Costs & High Cache Penalties – 3 cache/TLB misses per rank ex. rank1(402=256*1+32*4+18, B) 256bit B[]: 01..000000....101......0 0110....001...............0 0000100 ... 32bit Miss! Popcount these left bits L[]: 18 Miss! 21 … S[]: 1 3 4 6 7 9 10 13 2 5 7 9 12 13 18 19 1 3 7 … Miss! 24
- 25. Implementation: Practice • Packing the required data into a single cacheline 56B Chunk 4B 1B 32B ・・・ 12B padding 0110....001..........0 padding 64B Cache line 25
- 26. Implementation: Practice • Packing the required data into a single cacheline 26
- 27. Implementation: Practice • BTW, where select? – Omitted for my time limit – Plz see the code ... • 2 Way Implementation – O(logN) complexity • ux-trie, rx, and marisa-trie • Binary searches with rank • Many cache/TLB misses suffered – O(1) complexity • My implementation to minimize these penalties • 1-rank, 1-SIMD comparison, and O(1) –bsf • Only 2 cache/TLB misses 27
- 28. Implementation: Practice • BTW, where select? – Omitted for my time limit – Plz see the code ... • 2 Way Implementation – O(logN) complexity • ux-trie, rx, and marisa-trie • Binary searches with rank • Many cache/TLB misses suffered – O(1) complexity • My implementation to minimize these penalties • 1-rank, 1-SIMD comparison, and O(1) –bsf • Only 2 cache/TLB misses Not implemented yet ... 28