Your SlideShare is downloading. ×
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter

954
views

Published on

Presentation, HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter at the AMD Developer Summit (APU13) Nov. 11-13.

Presentation, HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU," by Mayank Daga and Mark Nutter at the AMD Developer Summit (APU13) Nov. 11-13.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
954
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. EXPLOITING COARSE-GRAINED PARALLELISM IN B+ TREE SEARCHES ON APUS MAYANK DAGA, MARK NUTTER AMD RESEARCH
  • 2. RELEVANCE OF B+ TREE SEARCHES  B+ Trees are special case of B Trees ‒ Fundamental data structure used in several popular database management systems B Tree B+ Tree mongoDB MySQL CouchDB SQLite  High-throughput, read-only index searches are gaining traction in - Video-copy detection ‒ Audio-search ‒ Online Transaction Processing (OLTP) Benchmarks  Increase in memory capacity allows many database tables to reside in memory ‒Brings computational performance to the forefront X 2 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 3. A REAL-WORLD USE-CASE OF READ-ONLY SEARCHES Mobile: Step 1. Record Audio Step 2. Generate Audio Fingerprint Step 3. Send search request to server App on a smartphone Database 3 2 1 d1 2 d2 5 4 6 7 3 4 5 6 7 8 d3 d4 d5 d6 Server: Step 1. Receive search requests Step 2. Query Database Step 3. Return search results to client d7 d8 3 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Thousands of clients send requests Music Library – Millions of Songs
  • 4. DATABASE PRIMITIVES ON ACCELERATORS  Discrete graphics processing units (dGPUs) provide a compelling mix of ‒ Performance per Watt ‒ Performance per Dollar  dGPUs have been used to accelerate critical database primitives ‒ scan ‒ sort ‒ join ‒ aggregation ‒ B+ Tree Searches? 4 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 5. B+ TREE SEARCHES ON ACCELERATORS  B+ Tree searches present significant challenges ‒ Irregular representation in memory ‒ An artifact of malloc() and new() ‒ Today’s dGPUs do not have a direct mapping to the CPU virtual address space ‒ Indirect links need to be converted to relative offsets ‒ Requirement to copy the tree to the dGPU, which entails ‒ One is always bound by the amount of GPU device memory 5 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 6. OUR SOLUTION  Accelerated B+ Tree searches on a fused CPU+GPU processor (or APU1) ‒ Eliminates data-copies by combining x86 CPU and vector GPU cores on the same silicon die  Developed a memory allocator to form a regular representation of the tree in memory ‒ Fundamental data structure is not altered ‒ Merely parts of its layout is changed [1] www.hsafoundation.com 6 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 7. OUTLINE  Motivation and Contribution  Background ‒ AMD APU Architecture ‒ B+ Trees  Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence  Results ‒ Performance ‒ Analysis  Summary and Next Steps 7 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 8. AMD APU ARCHITECTURE System Memory Host Memory DRAM Controller x86 Cores DRAM Controller RMB System Request Interface (SRI) xBar Link Controll er MCT UNB GPU Frame-Buffer FCL Platform Interfaces AMD 2nd Gen. A-series APU UNB - Unified Northbridge, MCT - Memory Controller  The APU consists of a dedicated IOMMUv2 hardware GPU Vector Cores - Provides direct mapping between GPU and CPU virtual address (VA) space - Enables GPUs to access the system memory - Enables GPUs to track whether pages are resident in memory  Today, GPU cores can access VA space at a granularity of continuous chunks 8 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 9. B+ TREES 3 2 5 4 6 7 1  A B+ Tree … 2 3 4 5 6 7 d1 d2 d3 d4 d5 d6 d7 d8 ‒ is a dynamic, multi-level index ‒ Is efficient for retrieval of data, stored in a block-oriented context ‒ has a high fan-out to reduce disk I/O operations  Order (b) of a B+ Tree measures the capacity of its nodes  Number of children (m) in an internal node is ‒ [b/2] <= m <= b ‒ Root node can have as few as two children  Number of keys in an internal node = (m – 1) 9 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 8
  • 10. APPROACH FOR PARALLELIZATION  Fine-grained (Accelerate a single query) ‒ Replace Binary search in each node with K-ary search ‒ Maximum performance improvement = log(k)/log(2) ‒ Results in poor occupancy of the GPU cores  Coarse-grained (Perform many queries in parallel) ‒ Enables data-parallelism ‒Increases memory bandwidth with parallel reads ‒Increases throughput (transactions per second for OLTP) 10 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 11. TRANSFORMING THE MEMORY LAYOUT nodes w/ metadata .. keys .. values  Metadata ‒ Number of keys in a node ‒ Offset to keys/values in the buffer ‒ Offset to the first child node ‒ Whether a node is a leaf  Pass a pointer to this memory buffer to the accelerator 11 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public ..
  • 12. ELIMINATING THE DIVERGENCE  Each work-item/thread executes a single query  May increase divergence within a wave-front ‒ Every query may follow a different path in the B+ Tree WI-1 WI-2 2 3 5 4 6 7 WI-2 1 2 3 4 5 6 7 d1 d2 d3 d4 d5 d6 d7 d8 8  Sort the keys to be searched ‒ Increases the chances of work-items within a wave-front to follow similar paths in the B+ Tree ‒ We use Radix Sort1 to sort the keys on the GPU [1] D. G. Merrill and A. S. Grimshaw, “Revisiting sorting for gpgpu stream architectures,” in Proceedings of the 19th intl. conf. on Parallel architectures and compilation techniques, ser. PACT ’10. New York, NY, USA: ACM, 2010. 12 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 13. IMPACT OF DIVERGENCE IN B+ TREE SEARCHES Impcat of Divergence 5 4 3 2 1 16K 32K 64K 128K Number of Queries w/ Order of B+ Tree TM AMD Radeon HD 7660 TM AMD Phenom II X6 1090T Impact of Divergence on GPU – 3.7-fold (average) Impact of Divergence on CPU – 1.8-fold (average) 13 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 128 64 32 16 8 128 64 32 16 8 128 64 32 16 8 0
  • 14. OUTLINE  Motivation and Contribution  Background ‒ AMD APU Architecture ‒ B+ Trees  Approach ‒ Transforming the Memory Layout ‒ Eliminating the Divergence  Results ‒ Performance ‒ Analysis  Summary and Next Steps 14 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 15. 60000 50000 40000 30000 20000 10000 0 100000 300000 500000 700000 900000 1100000 1300000 1500000 1700000 1900000 2100000 2300000 2500000 2700000 2900000 3100000 3300000 3500000 3700000 3900000 4100000  Software ‒ A B+ Tree w/ 4M records is used ‒ Search queries are created using Frequency EXPERIMENTAL SETUP ‒ normal_distribution() (C++-11 feature) ‒ The queries have been sorted ‒ CPU Implementation from ‒ http://www.amittai.com/prose/bplustree.html ‒ Driver: AMD CatalystTM v12.8 ‒ Programming Model: OpenCLTM  Hardware ‒ AMD RadeonTM HD 7660 APU (Trinity) ‒ 4 cores w/ 6GB DDR3, 6 CUs w/ 2GB DDR3 ‒ AMD PhenomTM II X6 1090T + AMD RadeonTM HD 7970 (Tahiti) ‒ 6 cores w/ 8GB DDR3, 32 CUs w/ 3GB GDDR5 ‒ Device Memory does not include data-copy time 15 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Bin E ID (PKey) Age 0000001 34 4 million 4194304 entries 50
  • 16. RESULTS – QUERIES PER SECOND 700M Queries/Sec Queries/Second (Million) 1000 100 10 16K 32K 64K 128K Number of Queries w/ Order of B+Tree AMD Phenom II X6 1090T (6-Threads+SSE) AMD PhenomTM II 1090T (6-Threads+SSE) AMD Radeon HDTM HD(Device Memory) AMD Radeon 7970 7970 (Device Memory) AMD Radeon HD 7970 (Pinned(Pinned Memory) AMD RadeonTM HD 7970 Memory) dGPU (device memory) ~350M Queries/Sec. (avg.) dGPU (pinned memory) ~9M Queries/Sec. (avg.) AMD PhenomTM CPU ~18M Queries/Sec. (avg.) 16 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 1
  • 17. RESULTS – QUERIES PER SECOND 140 Queries/Second (Million) 120 100 80 60 40 20 16K AMD Phenom II X6 1090T (6-Threads+SSE) AMD PhenomTM II 1090T (6-Threads+SSE) 32K Number of Queries w/ Order of B+Tree HD 7660 (Device Memory) 128K AMD Radeon HDTM HD(Pinned Memory) AMD Radeon 7660 7660 (Pinned Memory) APU (device memory) ~66M Queries/Sec. (avg.) APU (pinned memory) ~40M Queries/Sec. (avg.) APU (pinned memory) is faster than the CPU implementation 17 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public 128 64 32 16 8 4 128 64K AMD Radeon HD 7660 (Device Memory) TM AMD Radeon 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 0
  • 18. RESULTS - SPEEDUP 12 Speedup 10 8 4.9-fold speedup 6 4 2 16K 32K 64K 128K Number of Queries w/ Order of B+Tree AMD Radeon HDTM HD(Pinned Memory) AMD Radeon 7660 7660 (Pinned Memory) AMD Radeon HDTM HD(Device(Device Memory) AMD Radeon 7660 7660 Memory) Baseline: six-threaded, hand-tuned, SSE-optimized CPU implementation. Average Speedup – 4.3-fold (Device Memory), 2.5-fold (Pinned Memory) • Efficacy of IOMMUv2 + HSA on the APU Platform Discrete GPU (memory size = 3GB) APU (prototype software) < 1.5GB ✓ ✓ 18 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public Size of the B+ Tree 1.5GB – 2.7GB ✓ ✓ > 2.7GB ✗ ✓ 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 128 64 32 16 8 4 0
  • 19. ANALYSIS  The accelerators and the CPU yield best performance for different orders of the B+ Tree ‒ CPU  order = 64 ‒ Ability of CPUs to prefetch data is beneficial for higher orders nodes w/ metadata .. keys .. values .. ‒ APU and dGPU  order = 16 ‒ GPUs do not have a prefetcher  cache line should be most efficiently utilized ‒ GPUs have a cache-line size of 64 bytes ‒ Order = 16 is most beneficial (16 * 4 bytes) 19 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 20. ANALYSIS  Minimum batch size to match the CPU performance Order = 64 Order = 16 dGPU (device memory) 4K queries 2K queries dGPU (pinned memory N.A. N.A. APU (device memory) 10K queries 4K queries APU (pinned memory 20K queries 16K queries  reuse_factor - amortizing the cost of data-copies to the GPU 90% Queries 100% Queries dGPU 15 54 APU 100 N.A. 20 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 21. PROGRAMMABILITY GPU CPU-SSE int i = 0, j; typedef global unsigned int g_uint; node * c = root; typedef global mynode g_mynode; __m128i vkey = _mm_set1_epi32(key); int tid = get_global_id(0); __m128i vnodekey, *vptr; int i = 0; short int mask; g_mynode *c = (g_mynode *)root; /* find the leaf node */ /* find the leaf node */ while( !c->is_leaf ){ while(!c->is_leaf){ while (i < c->num_keys){ for(i = 0; i < (c->num_keys-3); i+=4){ if(keys[tid] >= ((g_uint *)((intptr_t)root + c->keys))[i]) vptr = (__m128i *)&(c->keys[i]); i++; vnodekey = _mm_load_si128(vptr); else break; mask = _mm_movemask_ps(_mm_cvtepi32_ps( _mm_cmplt_epi32(vkey, vnodekey))); } if((mask) & 8) break; c = (g_mynode *)((intptr_t)root + c->ptr + i*sizeof(mynode)); } for(j = i; j < c->num_keys; j++){ if(key < c->keys[j]) break; } } /* match the key in the leaf node */ for(i=0; i<c->num_keys; i++){ if((((g_uint *)((intptr_t)root + c->keys))[i]) == keys[tid]) break; c = (node *)c->pointers[j]; } } /* match the key in the leaf node */ /* retrieve the record */ for (i = 0; i < c->num_keys; i++) if(i != c->num_keys) if (c->keys[i] == key) break; records[tid] = ((g_uint *)((intptr_t)root + c->is_leaf + i*sizeof(g_uint)))[0]; /* retrieve the record */ if (i != c->num_keys) return (record *)c->pointers[i]; 21 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 22. RELATED WORK  J. Fix, A. Wilkes, and K. Skadron, "Accelerating Braided B+ Tree Searches on a GPU with CUDA." In Proceedings of the 2nd Workshop on Applications for Multi and Many Core Processors: Analysis, Implementation, and Performance, in conjunction with ISCA, 2011 ‒ Authors report ~10-fold speedup over single-thread-non-SSE CPU implementation, using a discrete NVIDIA GTX 480 GPU (do not take data-copies into account)  C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. A. Brandt, P. Dubey, “FAST: fast architecture sensitive tree search on modern CPUs and GPUs”, SIGMOD Conference, 2010 ‒ Authors report ~100M queries per second using a discrete NVIDIA GTX 280 GPU (do not take data-copies into account)  J. Sewall, J. Chhugani, C. Kim, N. Satish, P. Dubey, “PALM: Parallel, Architecture-Friendly, Latch-Free Modifications to B+ Trees on Multi-Core Processors”, Proceedings of VLDB Endowment, (VLDB 2011) ‒ Applicable for B+ Tree modifications on the GPU 22 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 23. POSSIBILITIES WITH HSA Memory 3 2 5 4 6 7 1 2 3 4 5 6 7 8 d1 d2 d3 d4 d5 d6 d7 d8 Parallel Reads for Searches Serial Modifications GPU Vector Cores x86 Cores Graphics Core Next, User-mode Queuing, Shared Virtual Memory 23 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 24. SUMMARY  B+ Tree is the fundamental data structure in many RDBMS ‒ Accelerating B+ Tree searches is critical ‒ Presents significant challenges on discrete GPUs  We have accelerated B+ Tree searches by exploiting coarsegrained parallelism on a APU ‒ 2.5-fold (avg.) speedup over 6-threads+SSE CPU implementation  Possible Next Steps ‒ HSA + IOMMUv2 would reduce the issue of modifying B+ Tree representation ‒ Investigate CPU-GPU co-scheduling ‒ Investigate modifications on the B+ Tree 24 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public
  • 25. DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Phenom, Radeon, Catalyst and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. 25 | APU ’13 AMD Developer Summit, San Jose, California, USA | November 14, 2013 | Public