A x86-optimized rank&select dictionary for bit sequences

x86/x64最適化勉強会#4
A x86-optimized rank/select
dictionary for bit sequences
2012/6/16
Takeshi Yamamuro

1

What’s Succinct Data Structure?

2

SDS: Succinct Data Structure
• Recently, Getting Popular in Some Areas
– Researches & Engineering

• Not Data Structure, But Data Representation
– A compressed method for other data structures
– e.g., alphabets, trees, and graphs

• Transparent Operations w/o Unpacking Explicitly
– e.g., succinct LZ77 compression*1

*1
3
Kreft, S. and Navarro, G.: LZ77-Like Compression with Fast Random Access, In Proceedings of DCC, 2010

More Details
• SDS = Succinct Data + Succinct Index

• Succinct Data
– Compact representation for target data
– Almost to information theoretic lower bounds
e.g., If N patterns, the lower bound’s logN

• Succinct Index
– O(1) operations for target data
– o(N) space costs: ignored asymptotically

4

More Details

If you need more information, ...

cited from: http://goo.gl/rkQ5z
5

A rank/select dictionary for SDS

6

A Rank/Select Operations
• SDS Composed of Rank/Select Operations
– Many calls of rank/select inside

• Rank/Select for Succinct Bit Sequences: B[i]
– rankx(n, B): the total of 1s in B[0...n]
– selectx(n, B): n-th position of x in B[]

i 0 1 2 3 4 5 6 7 8
B[i] 1 0 1 1 0 0 1 1 0
rank1(5, B)=3 select1(4, B)=6

7

A Rank/Select Operations
• Available Rank/Select Implementation
– ux-trie: http://code.google.com/p/ux-trie/
– rx: http://code.google.com/p/mozc/
– marisa-trie: http://code.google.com/p/marisa-trie/

• Today Contributions
– x86-optimized rank/select
– https://github.com/maropu/dbitv

8

Performance Results
• Performance Benchmark Setups*1
– Generate a random sequence of bits: 50% density
– Random rank/select queries over the bits
– CPU: Intel Core-i5 U470@1.33GHz

• Latency Observed
– 11 trials, and median latency

*1
9
Reference: http://d.hatena.ne.jp/s-yata/20111216/1324032373

Performance Results: Rank

1.E+03
averaged rank latency (ns)

1.E+02

1.E+01 ux
rx
marisa
opt

1.E+00

bit length
10

Performance Results: Select

1.E+04
averaged select latency (ns)

1.E+03

1.E+02

ux
1.E+01 rx
marisa
opt

1.E+00

bit length

11

Implementation Details

12

Implementation: 4 Russian Methods
• Rule: O(1) operation costs with o(N) space

B[] = A sequence of bits

N-bits

13

log 2 N

L[] = l1 l2

• Split into log2N fixed-length blocks
• Total Counts Pre-computed in L[]

x x / log 2 N  x
rank1 ( x, B)   B[i ]   B[i ]   B[i]
i 1 i 1  
i  x / log 2 N 1

L1[ x / log 2 N ]

14

log 2 N

L[] = l1 l2

• Split into log2N fixed-length blocks
• Total Counts Pre-computed in L[]

x x / log 2 N  x
rank1 ( x, B)   B[i ]   B[i ]   B[i]
i 1 i 1  
i  x / log 2 N 1

L[ x / log 2 N ]
O(log2N)
O(1) 15

log 2 N

L[] = l1 l2

• L[]: o(N) space costs

N N
2
 log N  O( )  o( N )
log N log N

16

log 2 N

L[] = l1 l2 1 log n
2
S[] = s1 s2
• Split into 1/2logN fixed-length blocks again
• Total Counts Pre-computed in S[]
 1 
x x / log N 
2  x / 2 log N 
  x
rank1 ( x, B)   B[i ]   B[i ]   B[i]   B[i]
i 1 i 1  
i  x / log 2 N 1  1 
i   x / log N  1
 2 
1
L[ x / log 2 n] S[ x / log n]
2
17

log 2 N

L[] = l1 l2 1 log n
2
S[] = s1 s2
• Split into 1/2logN fixed-length blocks again
• Total Counts Pre-computed in S[]
 1  O(logN)
x / log N 
2 x / log N 
 2
x   x
rank1 ( x, B)   B[i ]   B[i ]   B[i]   B[i]
i 1 i 1  
i  x / log 2 N 1  1 
i   x / log N  1
 2 
1
L[ x / log 2 n] S [ x / log n]
2
O(1) O(1) 18

log 2 N

L[] = l1 l2 1 log n
2
S[] = s1 s2
• S[]: o(N) space costs

N log log N
2
 log(log N )  O( N 
2
)  o( N )
1 2 log N log N

19

log 2 N

L[] = l1 l2 1 log n
2
S[] = s1 s2
• O(1) Popcount/Table-Lookup in Last Term

 1  O(logN) -> O(1)
x x / log 2 N   x / 2 log N 
  x
rank1 ( x, B)   B[i ]   B[i ]   B[i]   B[i]
i 1 i 1  
i  x / log 2 N 1  1 
i   x / log N  1
 2 
1
L[ x / log 2 n] S [ x / log n]
2
O(1) O(1)
20

log 2 N

L[] = l1 l2 1 log n
2
S[] = s1 s2
• As a result, o(N) Space Costs

N 4 N log log N log log N
  O( N  )  o( N )
log N log N log N
L[] size S[] size

21


22

Implementation: Practice
• Low Computation Costs & High Cache Penalties
– 3 cache/TLB misses per rank

ex. rank1(402=256*1+32*4+18, B)
256bit

B[]: 01..000000....101......0 0110....001...............0 0000100 ...
32bit Popcount these left bits

L[]: 18 21 …
S[]: 1 3 4 6 7 9 10 13 2 5 7 9 12 13 18 19 1 3 7 …

23

• Low Computation Costs & High Cache Penalties
– 3 cache/TLB misses per rank

ex. rank1(402=256*1+32*4+18, B)
256bit

B[]: 01..000000....101......0 0110....001...............0 0000100 ...
32bit Miss! Popcount these left bits

L[]: 18 Miss! 21 …
S[]: 1 3 4 6 7 9 10 13 2 5 7 9 12 13 18 19 1 3 7 …
Miss!

24

• Packing the required data into a single cacheline

56B Chunk
4B 1B 32B

・・・ 12B padding
0110....001..........0 padding

64B Cache line

25

• Packing the required data into a single cacheline

26

• BTW, where select?
– Omitted for my time limit 
– Plz see the code ...

• 2 Way Implementation
– O(logN) complexity
• ux-trie, rx, and marisa-trie
• Binary searches with rank
• Many cache/TLB misses suffered

– O(1) complexity
• My implementation to minimize these penalties
• 1-rank, 1-SIMD comparison, and O(1) –bsf
• Only 2 cache/TLB misses
27

• BTW, where select?
– Omitted for my time limit 
– Plz see the code ...

• 2 Way Implementation
– O(logN) complexity
• ux-trie, rx, and marisa-trie
• Binary searches with rank
• Many cache/TLB misses suffered

– O(1) complexity
• My implementation to minimize these penalties
• 1-rank, 1-SIMD comparison, and O(1) –bsf
• Only 2 cache/TLB misses
Not implemented yet ...

28

A x86-optimized rank&select dictionary for bit sequences

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (6)

Similar to A x86-optimized rank&select dictionary for bit sequences

Similar to A x86-optimized rank&select dictionary for bit sequences (20)

More from Takeshi Yamamuro

More from Takeshi Yamamuro (20)

Recently uploaded

Recently uploaded (20)

A x86-optimized rank&select dictionary for bit sequences