VAST-Tree: A Vector-Advanced andCompressed Structurefor Massive Data Tree Traversal         EDBT 2012, March 27th-29th, 20...
Outline• Backgrounds & Motivation  – Modern HW and HW-aware algorithms• Prerequisite Knowledge  – Search keys with SIMD in...
Backgrounds: Modern Hardware• Fast and highly-functional Hardware  – Multi-/Many-core CPUs    Intel Ivy bridge/Haswell/Sky...
Backgrounds: Multi-core CPUs• Highly-advanced instructions  – 128/256/512-bit SIMD, Transactional Memory, ...• Branch Pred...
Backgrounds: Tree Traversal• Search a key from a sequence of values• Fundamental operations     – Used everywhere, and wel...
Backgrounds: Tree Traversal• But, legacy algorithms too inefficient                                          Actual Execut...
Backgrounds: Existing Algorithms• Cache-conscious B+Tree [4][10][11][19][20]  – Realigning, prefetching, and buffering nod...
Prerequisite Knowledge:Tree traversal with SIMD instructions                                        8
Prerequisite Knowledge: Searches with SIMD• Process multiple data with SIMD instructions  – Most x86 processors support 12...
Logical Example: Searches with SIMD   : SIMD blocks compared simultaneously79 : A search key                              ...
Logical Example: Searches with SIMD     : SIMD blocks compared simultaneously  79 : A search keyCompare keys with SIMD    ...
Logical Example: Searches with SIMD     : SIMD blocks compared simultaneously  79 : A search key                     A loo...
Logical Example: Searches with SIMD   : SIMD blocks compared simultaneously79 : A search key                             M...
Physical Example: Searches with SIMD• Arrange SIMD blocks in breadth first order on  physically consecutive memory        ...
Physical Example: Searches with SIMD • Arrange SIMD blocks in breadth first order on   physically consecutive memory      ...
Issue: Number of Comparison Keys• More keys compared simultaneously!  – SIMD supports 1byte and 2byte elements            ...
A proposed technique:Branch compression for high parallelization                                          17
VAST-Tree: Designing Data Structure• Classify branches into 3 layers  – Apply FAST to P32, and compress keys in P16 and P8...
Detail Outline: VAST-Tree• Branch Lossy Compression  – Comparison Errors• SIMD Aligned Layouts• Other topics ...          ...
Proposed: Branch Lossy Compression• Apply to each compression block   – Prefix and suffix bit truncation• Transform ‘searc...
Penalty: Comparison Errors• But, some lossy keys compared incorrectly   Example)        value1 - 3220 (1100 1001 01002=201...
Proposed: SIMD-Aligned Layouts• Load data efficiently to SIMD registers• A few padding spaces between blocks  – Many blank...
Proposed: SIMD-Aligned Layouts• Load data efficiently to SIMD registers• A few padding spaces between blocks  – Many blank...
Proposed: Other Topics• Linear search optimization  – Remove bottom SBs• Apply P4Delta to leaf nodes  – A lossless compres...
Experimental Results                       25
Setup: Synthetic and Realistic Data Sets• Twitter Public Timeline data  – May, 2010 to Apr., 2011  – Twitter Ids and Times...
Results: Compression Ratio – Branch Nodes   VAST-Tree parameters(H32, H16)     Best                                       ...
Results: Compression Ratio – Leaf Nodes          Minimize Error Penalty                Improve Compression                ...
Results: Throughput – Synthetic Data             1.0E+08                                      VAST-Tree w/o P4Delta       ...
Results: Throughput – Twitter Data                                     Better                                     Worse   ...
Results: Error Ratio        1.0                                    1/λ= 16                                    1/λ= 64     ...
Summary & Future Work• Proposed lossy compression for high parallelization   – Linear search opt., leaf compression, and o...
33
Upcoming SlideShare
Loading in …5
×

VAST-Tree, EDBT'12

1,037 views

Published on

Published in: Technology, Travel
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,037
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

VAST-Tree, EDBT'12

  1. 1. VAST-Tree: A Vector-Advanced andCompressed Structurefor Massive Data Tree Traversal EDBT 2012, March 27th-29th, 2012 Humboldt University, Berlin 1
  2. 2. Outline• Backgrounds & Motivation – Modern HW and HW-aware algorithms• Prerequisite Knowledge – Search keys with SIMD instructions• Proposed Technique – Branch compression for high parallelization• Experimental – Twitter Public Timeline as a real data set – Compression ratio and throughput 2
  3. 3. Backgrounds: Modern Hardware• Fast and highly-functional Hardware – Multi-/Many-core CPUs Intel Ivy bridge/Haswell/Skylake/Knights Ferry – GPUs for General Purpose – ...• New algorithms advanced by these hardware – Sorts, Searches, Compression, and DB kernels A today topic: tree searches on multi-core CPUs 3
  4. 4. Backgrounds: Multi-core CPUs• Highly-advanced instructions – 128/256/512-bit SIMD, Transactional Memory, ...• Branch Prediction – Process “if-then” paths efficiently – High penalties of branch misses• Parallelism & Memory – Many cores on a single processor – Limited by memory accesses [5][14][15] 4
  5. 5. Backgrounds: Tree Traversal• Search a key from a sequence of values• Fundamental operations – Used everywhere, and well-known search_key Code Snippet: 48 if (search_key > node->compare_key) 12 68 node = node->right; else node = node->left; 7 20 Ex.) Binary-Tree 5
  6. 6. Backgrounds: Tree Traversal• But, legacy algorithms too inefficient Actual Execution time: 20-40% 100% 6.0E+03 Ratio of execution time # of instructions 4.0E+03 complete instructions 50% stall time branch penalties # of instructions 2.0E+03 0% 0.0E+00 22(0.161) 24(0.167) 26(0.206) 28(0.319) log2(# of keys) 6
  7. 7. Backgrounds: Existing Algorithms• Cache-conscious B+Tree [4][10][11][19][20] – Realigning, prefetching, and buffering nodes• FAST [14] – Cache-conscious and branch-free techniques – SIMD instructions used for branch-free searches• PALM [24] – Support incremental updates for FAST 7
  8. 8. Prerequisite Knowledge:Tree traversal with SIMD instructions 8
  9. 9. Prerequisite Knowledge: Searches with SIMD• Process multiple data with SIMD instructions – Most x86 processors support 128bit SIMD – Return 1 or 0 with inequality relation• FAST compare 3 keys simultaneously 32bit 128bit Register A: 34 78 91 x Register B: 79 79 79 x Register C: 1 1 0 x 9
  10. 10. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously79 : A search key 10
  11. 11. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously 79 : A search keyCompare keys with SIMD 11
  12. 12. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously 79 : A search key A lookup table Returned Offset Values BlocksCompare keys with SIMD ... ... 1 1 0 x 3 ... ... 1 2 3 4 12
  13. 13. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously79 : A search key Move to a next SIMD block 13
  14. 14. Physical Example: Searches with SIMD• Arrange SIMD blocks in breadth first order on physically consecutive memory 14
  15. 15. Physical Example: Searches with SIMD • Arrange SIMD blocks in breadth first order on physically consecutive memory 36B Offset Jumps! [34, 78, 91], [2, 11, 23], [35, 39, 49], [80, 87, 88], ... To high addresses in memoryEach SIMD block is 12B 15
  16. 16. Issue: Number of Comparison Keys• More keys compared simultaneously! – SIMD supports 1byte and 2byte elements x x x x x x 1byte each and 16 elements 2byte each and 8 elements 16
  17. 17. A proposed technique:Branch compression for high parallelization 17
  18. 18. VAST-Tree: Designing Data Structure• Classify branches into 3 layers – Apply FAST to P32, and compress keys in P16 and P8 : SBs - SIMD blocks (H32) : CBs - Compression blocks2byte keys, and 7 keyscompared simultaneously (H16)1byte keys, and 15 keys (H8)compared simultaneously 18
  19. 19. Detail Outline: VAST-Tree• Branch Lossy Compression – Comparison Errors• SIMD Aligned Layouts• Other topics ... 19
  20. 20. Proposed: Branch Lossy Compression• Apply to each compression block – Prefix and suffix bit truncation• Transform ‘search’ keys similarly for comparison – Extracted bit location stored in the header of CBs Remove lower bits 1byte keys Ascending order keys in a CB 1 Extract partial bits with red background 20
  21. 21. Penalty: Comparison Errors• But, some lossy keys compared incorrectly Example) value1 - 3220 (1100 1001 01002=20110) value2 - 3219 (1100 1001 00112=20110) Original Values: 3220 > 3219 --> Return 0 A error happens! Compressed Values: 201 ≦ 201 --> Return 1• Check and correct errors after tree traversal – Scan leaf nodes sequentially 21
  22. 22. Proposed: SIMD-Aligned Layouts• Load data efficiently to SIMD registers• A few padding spaces between blocks – Many blanks caused by page alignment in FAST Each block is SIMD-length aligned SBs CBs 22
  23. 23. Proposed: SIMD-Aligned Layouts• Load data efficiently to SIMD registers• A few padding spaces between blocks – Many blanks caused by page alignment in FAST Each block is SIMD-length aligned SBs CBs Padding spaces 23
  24. 24. Proposed: Other Topics• Linear search optimization – Remove bottom SBs• Apply P4Delta to leaf nodes – A lossless compression method Compress fixed k keys into a chunk Keys in leaf nodes: Single chunk Single chunk 24
  25. 25. Experimental Results 25
  26. 26. Setup: Synthetic and Realistic Data Sets• Twitter Public Timeline data – May, 2010 to Apr., 2011 – Twitter Ids and Timestamps – 36,068,948 entries (nearly equal to 225) 1.0 0.8 Twitter - Ids• Synthetic data Ratio Twitter - Timestamps 0.6 – Follow a Poisson 0.4 distribution 0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 d-gaps 26
  27. 27. Results: Compression Ratio – Branch Nodes VAST-Tree parameters(H32, H16) Best 27
  28. 28. Results: Compression Ratio – Leaf Nodes Minimize Error Penalty Improve Compression 28
  29. 29. Results: Throughput – Synthetic Data 1.0E+08 VAST-Tree w/o P4Delta VAST-Tree w P4Delta 7.5E+07 FASTThroughput binary trees Better 5.0E+07 2.5E+07 0.0E+00 22 24 26 28 log2 (# of keys) 29
  30. 30. Results: Throughput – Twitter Data Better Worse 30
  31. 31. Results: Error Ratio 1.0 1/λ= 16 1/λ= 64 0.8 1/λ= 256 Twitter -Ids 0.6 Twitter -TimestampsRatio 0.4 0.2 0.0 Better 0- 10- 100- 1000- 10000- ⊿w 31
  32. 32. Summary & Future Work• Proposed lossy compression for high parallelization – Linear search opt., leaf compression, and others• Experimental Evaluation – Compress branch nodes dynamically – Improve throughput and compression ratio – Throughput worsen by leaf compression• Future Work – Update supports, and more amount of keys 32
  33. 33. 33

×