Remove Branches in   BitVector Select Operations         - marisa 0.2.2 -                Susumu Yata                  @s5y...
Who I AmJob   Brazil, Inc. (groonga developer)   We need R&D software engineers.Personal research & development   Tri...
Relationships between BitVector and Marisa.  BitVector and Marisa                                                330 March...
BitVectorWhat‟s BitVector?   A sequence of bitsOperations   BitVector::get(i)   BitVector::rank(i)   BitVector::sele...
BitVector – Get OperationsInterface   BitVector::get(i)Description   The i-th bit (“0” or “1”)     0     1    2   …   ...
BitVector – Rank OperationsInterface   BitVector::rank(i)Description   The number of “1”s up to the i-th bit     0    ...
BitVector – Select OperationsInterface   BitVector::select(i)Description   The position of the i-th “1”     0     1   ...
Marisa Who‟s Marisa?   An ordinary human magician What‟s Marisa?   A static and space-efficient dictionary Data struc...
Marisa – PatriciaPatricia is a labeled tree.      Keys = Tree + Labels                                             Node ...
Marisa – RecursivenessUnfortunately, this margin is too small…   Keys = Tree + Labels   Labels = Tree + Labels   Label...
Marisa – BitVector UsageLOUDS   Level-Order Unary Degree SequenceTerminal flags   A node is terminal (“1”) or not (“0”...
Marisa – BitVector UsageLOUDS   BitVector::get(), select()Terminal flags   BitVector::get(), rank(), select()Link fla...
How to implement Rank/Select operations.  Implementations                                             1330 March 2013     ...
Rank DictionaryIndex structures   r_idx[x].abs = rank(512・x)       x = 0, 1, 2, …   r_idx[x].rel[y] =     rank(512・x +...
Rank OperationsTime complexity = O(1)       512              512               512             512             512       ...
Select DictionaryIndex structure   s_idx[x] = select(512・x)       i = 0, 1, 2, …Calculation   Limit the range by usin...
Select Operations        s_idx                                                s_idx  512       512        512         512 ...
Select Final RoundBinary search & table lookup   Three-level branches                                     if            ...
How to remove the branches in the final round.  Improvements                                                   1930 March ...
Original// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x ...
Tips – Tricky PopCount       0        1       1       1       0       0       1       0x = x – ((x >> 1) & MASK_55);      ...
Tips – Tricky PopCount// MASK_01 = 0x0101010101010101ULL;// x = x | (x << 8) | (x << 16) | (x << 24) | …;x *= MASK_01;    ...
+ SSE2 (After PopCount)// y[0 … 7] = i + 1;__m128i y = _mm_cvtsi64_si128((i + 1) * MASK_01);__m128i z = _mm_cvtsi64_si128(...
Tips – PCMPGTBy = _mm_cvtsi64_si128((i + 1) * MASK_01);      20        20   20     20      20     20     20     20z = _mm_...
+ Tricks (After Comparison)uint64_t j = _mm_cvtsi128_si64(y);// Calculation without TABLEj = ((j & MASK_01) * MASK_01) >> ...
– SSE2 (Simple and Fast)// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> ...
Tips – Comparisonuint64_t y = (i + 1) * MASK_01;     0x14   0x14   0x14   0x14   0x14    0x14   0x14   0x14uint64_t z = x ...
+ SSSE3 (For PopCount)// Get lower nibbles and upper nibbles of x.__m128i lower = _mm_cvtsi64_si128(x & MASK_0F);__m128i u...
Tips – PSHUFBlower = _mm_cvtsi64_si128(x & MASK_0F);         12           8           7           4           15          ...
How effective the improvements are.  Evaluation                                          3030 March 2013              Braz...
EnvironmentOS   Mac OSX 10.8.3 (64-bit)CPU   Core i7 3720QM – Ivy Bridge   2.6GHz – up to 3.6GHzCompiler   Apple LL...
DataSource   Japanese Wikipedia page titles   gzip –cd jawiki-20130328-all-titles-in-    ns0.gz | LC_ALL=C sort –R > da...
Binariesmarisa 0.2.1   ./configure CXX=clang++ --enable-popcnt   make   tools/marisa-benchmark < datamarisa 0.2.2   ...
Results – marisa 0.2.1Without improvements   #Tries        Size     Build      Lookup Reverse      Prefix   Predict      ...
Results – marisa 0.2.2With improvements   #Tries        Size     Build      Lookup Reverse      Prefix   Predict         ...
Results – ImprovementsImprovement ratios   #Tries       Size   Build      Lookup Reverse       Prefix   Predict          ...
Conclusion   “Any sufficiently advanced technology      is indistinguishable from magic.”    “Any sufficiently advanced te...
Upcoming SlideShare
Loading in …5
×

X86opti 05 s5yata

2,384 views

Published on

Remove Branches in BitVector Select Operations - marisa 0.2.2 -

Published in: Career
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,384
On SlideShare
0
From Embeds
0
Number of Embeds
356
Actions
Shares
0
Downloads
34
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

X86opti 05 s5yata

  1. 1. Remove Branches in BitVector Select Operations - marisa 0.2.2 - Susumu Yata @s5yata Brazil, Inc. 130 March 2013 Brazil, Inc.
  2. 2. Who I AmJob Brazil, Inc. (groonga developer) We need R&D software engineers.Personal research & development Tries darts-clone, marisa-trie, etc. Corpus Nihongo Web Corpus 2010 (NWC 2010) 230 March 2013 Brazil, Inc.
  3. 3. Relationships between BitVector and Marisa. BitVector and Marisa 330 March 2013 Brazil, Inc.
  4. 4. BitVectorWhat‟s BitVector? A sequence of bitsOperations BitVector::get(i) BitVector::rank(i) BitVector::select(i) 430 March 2013 Brazil, Inc.
  5. 5. BitVector – Get OperationsInterface BitVector::get(i)Description The i-th bit (“0” or “1”) 0 1 2 … i–1 i i+1 … n-2 n-1 0 0 1 … 0 1 1 … 0 0 Get! 530 March 2013 Brazil, Inc.
  6. 6. BitVector – Rank OperationsInterface BitVector::rank(i)Description The number of “1”s up to the i-th bit 0 1 2 … i–1 i i+1 … n-2 n-1 0 0 1 … 0 1 1 … 0 0 How many “1”s? 630 March 2013 Brazil, Inc.
  7. 7. BitVector – Select OperationsInterface BitVector::select(i)Description The position of the i-th “1” 0 1 2 … … … … … n-2 n-1 0 0 1 … … … … … 0 0 Where is the i-th “1”? 730 March 2013 Brazil, Inc.
  8. 8. Marisa Who‟s Marisa? An ordinary human magician What‟s Marisa? A static and space-efficient dictionary Data structure Recursive LOUDS-based Patricia tries Site http://code.google.com/p/marisa-trie 830 March 2013 Brazil, Inc.
  9. 9. Marisa – PatriciaPatricia is a labeled tree. Keys = Tree + Labels Node Label ID Key 1 “Ar” 4 0 “Argentina” 1 2 “Brazil” 1 “Armenia” 5 3 „C‟ 0 2 2 “Brazil” 4 “gentina” 6 3 “Canada” 3 5 “menia” 4 “Cyprus” 7 6 “anada” 7 “yprus” 930 March 2013 Brazil, Inc.
  10. 10. Marisa – RecursivenessUnfortunately, this margin is too small… Keys = Tree + Labels Labels = Tree + Labels Labels = Tree + Labels <– Reasonable Labels = Tree + Labels Labels = Tree + Labels Labels = Tree + Labels Labels = Tree + Labels … 1030 March 2013 Brazil, Inc.
  11. 11. Marisa – BitVector UsageLOUDS Level-Order Unary Degree SequenceTerminal flags A node is terminal (“1”) or not (“0”).Link flags A node has a link to its multi-byte label (“1”) or has a built-in single-byte label (“0”). 1130 March 2013 Brazil, Inc.
  12. 12. Marisa – BitVector UsageLOUDS BitVector::get(), select()Terminal flags BitVector::get(), rank(), select()Link flags BitVector::get(), rank() 1230 March 2013 Brazil, Inc.
  13. 13. How to implement Rank/Select operations. Implementations 1330 March 2013 Brazil, Inc.
  14. 14. Rank DictionaryIndex structures r_idx[x].abs = rank(512・x) x = 0, 1, 2, … r_idx[x].rel[y] = rank(512・x + 64・y) – rank(512・x) Y = 1, 2, 3, … , 7Calculation abs + rel + popcnt() 1430 March 2013 Brazil, Inc.
  15. 15. Rank OperationsTime complexity = O(1) 512 512 512 512 512 r_idx.abs 64 64 64 64 64 64 64 64 r_idx.rel 64 popcnt() 1530 March 2013 Brazil, Inc.
  16. 16. Select DictionaryIndex structure s_idx[x] = select(512・x) i = 0, 1, 2, …Calculation Limit the range by using s_idx. Limit the range by using r_idx[x].abs. Limit the range by using r_idx[x].rel[y]. Find the i-th “1” in the range. 1630 March 2013 Brazil, Inc.
  17. 17. Select Operations s_idx s_idx 512 512 512 512 512 512 512 r_idx.abs r_idx.abs 64 64 64 64 64 64 64 64 r_idx.rel r_idx.rel 64 Final round 1730 March 2013 Brazil, Inc.
  18. 18. Select Final RoundBinary search & table lookup Three-level branches if if if if if if if 8 8 8 8 8 8 8 8 Table lookup 1830 March 2013 Brazil, Inc.
  19. 19. How to remove the branches in the final round. Improvements 1930 March 2013 Brazil, Inc.
  20. 20. Original// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcountif (i < ((x >> 24) & 0xFF)) { // The first-level branch if (i < ((x >> 8) & 0xFF)) { // The second-level branch if (i < (x & 0xFF)) { // The third-level branch // The first byte contains the i-th “1”. } else { // The second byte contains the i-th “1”. 2030 March 2013 Brazil, Inc.
  21. 21. Tips – Tricky PopCount 0 1 1 1 0 0 1 0x = x – ((x >> 1) & MASK_55); 1 2 0 1x = (x & MASK_33) + ((x >> 2) & MASK_33); 3 1x = (x + (x >> 4)) & MASK_0F; 4 2130 March 2013 Brazil, Inc.
  22. 22. Tips – Tricky PopCount// MASK_01 = 0x0101010101010101ULL;// x = x | (x << 8) | (x << 16) | (x << 24) | …;x *= MASK_01; 4 1 3 5 2 6 3 4 28 23 15 7 24 20 13 4 2230 March 2013 Brazil, Inc.
  23. 23. + SSE2 (After PopCount)// y[0 … 7] = i + 1;__m128i y = _mm_cvtsi64_si128((i + 1) * MASK_01);__m128i z = _mm_cvtsi64_si128(x);// Compare the 16 8-bit signed integers in y and z.// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z); // PCMPGTB// The j-th byte contains the i-th “1”.// TABLE is a 128-byte pre-computed table.uint8_t j = TABLE[_mm_movemask_epi8(y)]; 2330 March 2013 Brazil, Inc.
  24. 24. Tips – PCMPGTBy = _mm_cvtsi64_si128((i + 1) * MASK_01); 20 20 20 20 20 20 20 20z = _mm_cvtsi64_si128(x); 28 24 23 20 15 13 7 4// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z); 0x00 0x00 0x00 0x00 0xFF 0xFF 0xFF 0xFF 2430 March 2013 Brazil, Inc.
  25. 25. + Tricks (After Comparison)uint64_t j = _mm_cvtsi128_si64(y);// Calculation without TABLEj = ((j & MASK_01) * MASK_01) >> 56;// Calculation with BSRj = (63 – __builtin_clzll(j + 1)) / 8;// Calculation with popcnt (SSE4.2 or SSE4a)j = __builtin_popcountll(j) / 8; 2530 March 2013 Brazil, Inc.
  26. 26. – SSE2 (Simple and Fast)// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcountuint64_t y = (i + 1) * MASK_01;uint64_t z = x | MASK_80;// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80;uint8_t j = __builtin_ctzll(z) / 8; 2630 March 2013 Brazil, Inc.
  27. 27. Tips – Comparisonuint64_t y = (i + 1) * MASK_01; 0x14 0x14 0x14 0x14 0x14 0x14 0x14 0x14uint64_t z = x | MASK_80; 0x9C 0x98 0x97 0x94 0x8F 0x8D 0x87 0x84// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80; 0x80 0x80 0x80 0x80 0x00 0x00 0x00 0x00 2730 March 2013 Brazil, Inc.
  28. 28. + SSSE3 (For PopCount)// Get lower nibbles and upper nibbles of x.__m128i lower = _mm_cvtsi64_si128(x & MASK_0F);__m128i upper = _mm_cvtsi64_si128(x & MASK_F0);upper = _mm_srli_epi32(upper, 4);// Use PSHUFB for counting “1”s in each nibble.__m128i table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0);lower = _mm_shuffle_epi8(table, lower);upper = _mm_shuffle_epi8(table, upper);// Merge the counts to get the number of “1”s in each byte.x = _mm_cvtsi128_si64(_mm_add_epi8(lower, upper));x *= MASK_01; 2830 March 2013 Brazil, Inc.
  29. 29. Tips – PSHUFBlower = _mm_cvtsi64_si128(x & MASK_0F); 12 8 7 4 15 13 7 4table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, …); 4 3 3 2 3 2 2 1 3 2 2 1 2 1 1 0// Perform a parallel 16-way lookup.lower = _mm_shuffle_epi8(table, lower); 2 1 3 1 4 3 3 1 2930 March 2013 Brazil, Inc.
  30. 30. How effective the improvements are. Evaluation 3030 March 2013 Brazil, Inc.
  31. 31. EnvironmentOS Mac OSX 10.8.3 (64-bit)CPU Core i7 3720QM – Ivy Bridge 2.6GHz – up to 3.6GHzCompiler Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn) 3130 March 2013 Brazil, Inc.
  32. 32. DataSource Japanese Wikipedia page titles gzip –cd jawiki-20130328-all-titles-in- ns0.gz | LC_ALL=C sort –R > dataDetails Number of keys: 1,367,750 Average length: 21.14 bytes Total length: 28,919,893 bytes 3230 March 2013 Brazil, Inc.
  33. 33. Binariesmarisa 0.2.1 ./configure CXX=clang++ --enable-popcnt make tools/marisa-benchmark < datamarisa 0.2.2 ./configure CXX=clang++ --enable-sse4 make tools/marisa-benchmark < data 3330 March 2013 Brazil, Inc.
  34. 34. Results – marisa 0.2.1Without improvements #Tries Size Build Lookup Reverse Prefix Predict [KB] [Kqps] [Kqps] [Kqps] [Kqps] [Kqps] 1 11,811 724 1,105 1,223 1,038 711 2 8,639 632 790 877 753 453 3 8,001 621 750 816 708 406 4 7,788 591 723 791 687 391 5 7,701 590 712 781 680 384 Baseline 3430 March 2013 Brazil, Inc.
  35. 35. Results – marisa 0.2.2With improvements #Tries Size Build Lookup Reverse Prefix Predict [KB] [Kqps] [Kqps] [Kqps] [Kqps] [Kqps] 1 11,811 757 1,198 1,359 1,115 772 2 8,639 657 873 1,000 820 503 3 8,001 621 817 924 770 453 4 7,788 613 797 900 752 438 5 7,701 610 787 884 737 427 Same size Faster operations 3530 March 2013 Brazil, Inc.
  36. 36. Results – ImprovementsImprovement ratios #Tries Size Build Lookup Reverse Prefix Predict [%] [%] [%] [%] [%] [%] 1 0.00 +4.56 +8.42 +11.12 +7.42 +8.58 2 0.00 +3.96 +10.52 +14.03 +8.90 +11.04 3 0.00 0.00 +8.93 +13.24 +8.76 +11.58 4 0.00 +3.72 +10.24 +13.78 +9.46 +12.02 5 0.00 +3.39 +10.53 +13.19 +8.38 +11.20 Same size Faster operations 3630 March 2013 Brazil, Inc.
  37. 37. Conclusion “Any sufficiently advanced technology is indistinguishable from magic.” “Any sufficiently advanced technique is indistinguishable from magic.” “You are magician.” 3730 March 2013 Brazil, Inc.

×