0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# X86opti 05 s5yata

1,260

Published on

Remove Branches in BitVector Select Operations - marisa 0.2.2 -

Remove Branches in BitVector Select Operations - marisa 0.2.2 -

Published in: Career
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,260
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
19
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Remove Branches in BitVector Select Operations - marisa 0.2.2 - Susumu Yata @s5yata Brazil, Inc. 130 March 2013 Brazil, Inc.
• 2. Who I AmJob Brazil, Inc. (groonga developer) We need R&D software engineers.Personal research & development Tries darts-clone, marisa-trie, etc. Corpus Nihongo Web Corpus 2010 (NWC 2010) 230 March 2013 Brazil, Inc.
• 3. Relationships between BitVector and Marisa. BitVector and Marisa 330 March 2013 Brazil, Inc.
• 4. BitVectorWhat‟s BitVector? A sequence of bitsOperations BitVector::get(i) BitVector::rank(i) BitVector::select(i) 430 March 2013 Brazil, Inc.
• 5. BitVector – Get OperationsInterface BitVector::get(i)Description The i-th bit (“0” or “1”) 0 1 2 … i–1 i i+1 … n-2 n-1 0 0 1 … 0 1 1 … 0 0 Get! 530 March 2013 Brazil, Inc.
• 6. BitVector – Rank OperationsInterface BitVector::rank(i)Description The number of “1”s up to the i-th bit 0 1 2 … i–1 i i+1 … n-2 n-1 0 0 1 … 0 1 1 … 0 0 How many “1”s? 630 March 2013 Brazil, Inc.
• 7. BitVector – Select OperationsInterface BitVector::select(i)Description The position of the i-th “1” 0 1 2 … … … … … n-2 n-1 0 0 1 … … … … … 0 0 Where is the i-th “1”? 730 March 2013 Brazil, Inc.
• 8. Marisa Who‟s Marisa? An ordinary human magician What‟s Marisa? A static and space-efficient dictionary Data structure Recursive LOUDS-based Patricia tries Site http://code.google.com/p/marisa-trie 830 March 2013 Brazil, Inc.
• 9. Marisa – PatriciaPatricia is a labeled tree. Keys = Tree + Labels Node Label ID Key 1 “Ar” 4 0 “Argentina” 1 2 “Brazil” 1 “Armenia” 5 3 „C‟ 0 2 2 “Brazil” 4 “gentina” 6 3 “Canada” 3 5 “menia” 4 “Cyprus” 7 6 “anada” 7 “yprus” 930 March 2013 Brazil, Inc.
• 10. Marisa – RecursivenessUnfortunately, this margin is too small… Keys = Tree + Labels Labels = Tree + Labels Labels = Tree + Labels <– Reasonable Labels = Tree + Labels Labels = Tree + Labels Labels = Tree + Labels Labels = Tree + Labels … 1030 March 2013 Brazil, Inc.
• 11. Marisa – BitVector UsageLOUDS Level-Order Unary Degree SequenceTerminal flags A node is terminal (“1”) or not (“0”).Link flags A node has a link to its multi-byte label (“1”) or has a built-in single-byte label (“0”). 1130 March 2013 Brazil, Inc.
• 12. Marisa – BitVector UsageLOUDS BitVector::get(), select()Terminal flags BitVector::get(), rank(), select()Link flags BitVector::get(), rank() 1230 March 2013 Brazil, Inc.
• 13. How to implement Rank/Select operations. Implementations 1330 March 2013 Brazil, Inc.
• 14. Rank DictionaryIndex structures r_idx[x].abs = rank(512・x) x = 0, 1, 2, … r_idx[x].rel[y] = rank(512・x + 64・y) – rank(512・x) Y = 1, 2, 3, … , 7Calculation abs + rel + popcnt() 1430 March 2013 Brazil, Inc.
• 15. Rank OperationsTime complexity = O(1) 512 512 512 512 512 r_idx.abs 64 64 64 64 64 64 64 64 r_idx.rel 64 popcnt() 1530 March 2013 Brazil, Inc.
• 16. Select DictionaryIndex structure s_idx[x] = select(512・x) i = 0, 1, 2, …Calculation Limit the range by using s_idx. Limit the range by using r_idx[x].abs. Limit the range by using r_idx[x].rel[y]. Find the i-th “1” in the range. 1630 March 2013 Brazil, Inc.
• 17. Select Operations s_idx s_idx 512 512 512 512 512 512 512 r_idx.abs r_idx.abs 64 64 64 64 64 64 64 64 r_idx.rel r_idx.rel 64 Final round 1730 March 2013 Brazil, Inc.
• 18. Select Final RoundBinary search & table lookup Three-level branches if if if if if if if 8 8 8 8 8 8 8 8 Table lookup 1830 March 2013 Brazil, Inc.
• 19. How to remove the branches in the final round. Improvements 1930 March 2013 Brazil, Inc.
• 20. Original// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcountif (i < ((x >> 24) & 0xFF)) { // The first-level branch if (i < ((x >> 8) & 0xFF)) { // The second-level branch if (i < (x & 0xFF)) { // The third-level branch // The first byte contains the i-th “1”. } else { // The second byte contains the i-th “1”. 2030 March 2013 Brazil, Inc.
• 21. Tips – Tricky PopCount 0 1 1 1 0 0 1 0x = x – ((x >> 1) & MASK_55); 1 2 0 1x = (x & MASK_33) + ((x >> 2) & MASK_33); 3 1x = (x + (x >> 4)) & MASK_0F; 4 2130 March 2013 Brazil, Inc.
• 22. Tips – Tricky PopCount// MASK_01 = 0x0101010101010101ULL;// x = x | (x << 8) | (x << 16) | (x << 24) | …;x *= MASK_01; 4 1 3 5 2 6 3 4 28 23 15 7 24 20 13 4 2230 March 2013 Brazil, Inc.
• 23. + SSE2 (After PopCount)// y[0 … 7] = i + 1;__m128i y = _mm_cvtsi64_si128((i + 1) * MASK_01);__m128i z = _mm_cvtsi64_si128(x);// Compare the 16 8-bit signed integers in y and z.// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z); // PCMPGTB// The j-th byte contains the i-th “1”.// TABLE is a 128-byte pre-computed table.uint8_t j = TABLE[_mm_movemask_epi8(y)]; 2330 March 2013 Brazil, Inc.
• 24. Tips – PCMPGTBy = _mm_cvtsi64_si128((i + 1) * MASK_01); 20 20 20 20 20 20 20 20z = _mm_cvtsi64_si128(x); 28 24 23 20 15 13 7 4// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z); 0x00 0x00 0x00 0x00 0xFF 0xFF 0xFF 0xFF 2430 March 2013 Brazil, Inc.
• 25. + Tricks (After Comparison)uint64_t j = _mm_cvtsi128_si64(y);// Calculation without TABLEj = ((j & MASK_01) * MASK_01) >> 56;// Calculation with BSRj = (63 – __builtin_clzll(j + 1)) / 8;// Calculation with popcnt (SSE4.2 or SSE4a)j = __builtin_popcountll(j) / 8; 2530 March 2013 Brazil, Inc.
• 26. – SSE2 (Simple and Fast)// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcountuint64_t y = (i + 1) * MASK_01;uint64_t z = x | MASK_80;// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80;uint8_t j = __builtin_ctzll(z) / 8; 2630 March 2013 Brazil, Inc.
• 27. Tips – Comparisonuint64_t y = (i + 1) * MASK_01; 0x14 0x14 0x14 0x14 0x14 0x14 0x14 0x14uint64_t z = x | MASK_80; 0x9C 0x98 0x97 0x94 0x8F 0x8D 0x87 0x84// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80; 0x80 0x80 0x80 0x80 0x00 0x00 0x00 0x00 2730 March 2013 Brazil, Inc.
• 28. + SSSE3 (For PopCount)// Get lower nibbles and upper nibbles of x.__m128i lower = _mm_cvtsi64_si128(x & MASK_0F);__m128i upper = _mm_cvtsi64_si128(x & MASK_F0);upper = _mm_srli_epi32(upper, 4);// Use PSHUFB for counting “1”s in each nibble.__m128i table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1, 0);lower = _mm_shuffle_epi8(table, lower);upper = _mm_shuffle_epi8(table, upper);// Merge the counts to get the number of “1”s in each byte.x = _mm_cvtsi128_si64(_mm_add_epi8(lower, upper));x *= MASK_01; 2830 March 2013 Brazil, Inc.
• 29. Tips – PSHUFBlower = _mm_cvtsi64_si128(x & MASK_0F); 12 8 7 4 15 13 7 4table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, …); 4 3 3 2 3 2 2 1 3 2 2 1 2 1 1 0// Perform a parallel 16-way lookup.lower = _mm_shuffle_epi8(table, lower); 2 1 3 1 4 3 3 1 2930 March 2013 Brazil, Inc.
• 30. How effective the improvements are. Evaluation 3030 March 2013 Brazil, Inc.
• 31. EnvironmentOS Mac OSX 10.8.3 (64-bit)CPU Core i7 3720QM – Ivy Bridge 2.6GHz – up to 3.6GHzCompiler Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn) 3130 March 2013 Brazil, Inc.
• 32. DataSource Japanese Wikipedia page titles gzip –cd jawiki-20130328-all-titles-in- ns0.gz | LC_ALL=C sort –R > dataDetails Number of keys: 1,367,750 Average length: 21.14 bytes Total length: 28,919,893 bytes 3230 March 2013 Brazil, Inc.
• 33. Binariesmarisa 0.2.1 ./configure CXX=clang++ --enable-popcnt make tools/marisa-benchmark < datamarisa 0.2.2 ./configure CXX=clang++ --enable-sse4 make tools/marisa-benchmark < data 3330 March 2013 Brazil, Inc.
• 34. Results – marisa 0.2.1Without improvements #Tries Size Build Lookup Reverse Prefix Predict [KB] [Kqps] [Kqps] [Kqps] [Kqps] [Kqps] 1 11,811 724 1,105 1,223 1,038 711 2 8,639 632 790 877 753 453 3 8,001 621 750 816 708 406 4 7,788 591 723 791 687 391 5 7,701 590 712 781 680 384 Baseline 3430 March 2013 Brazil, Inc.
• 35. Results – marisa 0.2.2With improvements #Tries Size Build Lookup Reverse Prefix Predict [KB] [Kqps] [Kqps] [Kqps] [Kqps] [Kqps] 1 11,811 757 1,198 1,359 1,115 772 2 8,639 657 873 1,000 820 503 3 8,001 621 817 924 770 453 4 7,788 613 797 900 752 438 5 7,701 610 787 884 737 427 Same size Faster operations 3530 March 2013 Brazil, Inc.
• 36. Results – ImprovementsImprovement ratios #Tries Size Build Lookup Reverse Prefix Predict [%] [%] [%] [%] [%] [%] 1 0.00 +4.56 +8.42 +11.12 +7.42 +8.58 2 0.00 +3.96 +10.52 +14.03 +8.90 +11.04 3 0.00 0.00 +8.93 +13.24 +8.76 +11.58 4 0.00 +3.72 +10.24 +13.78 +9.46 +12.02 5 0.00 +3.39 +10.53 +13.19 +8.38 +11.20 Same size Faster operations 3630 March 2013 Brazil, Inc.
• 37. Conclusion “Any sufficiently advanced technology is indistinguishable from magic.” “Any sufficiently advanced technique is indistinguishable from magic.” “You are magician.” 3730 March 2013 Brazil, Inc.