Gpu Join Presentation


Published on

This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Gpu Join Presentation

  1. 1. Relational Joins on Graphics Processors Suman Karumuri Jamie Jablin
  2. 2. Background
  3. 3. Introduction • Utilizing hardware features of the GPU – Massive thread parallelism – Fast inter-processor communication – High memory bandwidth – Coalesced access
  4. 4. Relational Joins • Non-indexed nested-loop join (NINLJ) • Indexed nested-loop join (INLJ) • Sort-merge join (SMJ) • Hash join (HJ)
  5. 5. Block Non-indexed nested-loop join (NINLJ) foreach r in R: foreach s in S: if condition(r,s) output = <r,s>
  6. 6. Block Indexed nested-loop join (INLJ) foreach r in R: if S[].has(r.f1) if condition(r,s) output = <r,s>
  7. 7. Hash join (HJ) Hr = Hashtable() foreach r in R: Hr.add(r) if Hr.size() == MAX_MEMORY: for s in S: if Hr(s): output s Hr.clear()
  8. 8. Sort-merge join (SMJ) Sort(R), Sort(S) ; i, j = 0 while !R.empty() && !S.empty(): if (R[i] == S[j]) output += R[i] i++ ; j++ elif (R[i] < S[j]) i++ else j++
  9. 9. Algorithms on GPU • Tips for algorithm design – Use the inherent concurrency. – Keep SIMD nature in mind. – Algorithms should be side-effect free. – Memory properties: • High memory bandwidth. • Coalesced access (for spatial locality) • Cache in local memory (for temporal locality) • Access memory via indices and offsets.
  10. 10. GPU Primitives
  11. 11. Design and Implementation • A complete set of parallel primitives: – Map, scatter, gather, prefix scan, split, and sort • Low synchronization overhead. • Scalable to hundreds of processors. • Applicable to joins as well as other relational query operators.
  12. 12. Map
  13. 13. Scatter • Indexed writes to a relation.
  14. 14. Gather • Performs indexed reads from a relation.
  15. 15. Prefix Scan • A prefix scan applies a binary operator on the input of size n and produces an output of size n. • Ex: Prefix sum: cumulative sum of all elements to the left of the current element. – Exclusive (used in paper) – Inclusive
  16. 16. Split
  17. 17. Each thread constructs tHist from Rin
  18. 18. L[(p-1)*#thread+t] = tHist[t][p]
  19. 19. Prefix sum L(i) = sum(L[0…i-1]) Gives the start location of partitions
  20. 20. tOffset[t][p] = L[(p-1)*#thread+t]
  21. 21. Scatter tuples to Rout based on offset.
  22. 22. Sort • Bitonic sort – Uses sorting networks, O(N log2N). • Quick sort – partition using a random pivot until partition fits in local memory – Sort each partition using bitonic sort. – Partioning can be parallelized using split. – Complexity is O(N logN). – 30% faster than bitonic sort in experiments – Use Quick sort for sorting
  23. 23. Spatial and Temporal locality
  24. 24. Memory Optimizations • Coalesced memory improves memory bandwidth utilization (spatial locality)
  25. 25. Local Memory Optimization • Quick sort – Temporal locality – Use the bitonic sort to sort each chunk after the partitioning step.
  26. 26. Joins on GPGPU
  27. 27. NINLJ on GPU • Block nested • Uses Map primitive on both relations – Partition R into R’ and S into S’ blocks respectively. – Create R’ x S’ thread groups – A thread in a thread group processes one tuple from R’ and matches all tuples from S’. – All tuples in S’ are in local cache.
  28. 28. B+ Tree vs CSS Tree • B+ tree imposes – Memory stalls when traversed (no spatial locality) – Can’t perform multiple searches ( loses temporal locality). • CSS-Tree (Cache optimized search tree) – One dimensional array where nodes are indexed. – Replaces traversal with computation. – Can also perform parallel key lookups.
  29. 29. Indexed Nested Loop Join (INLP) • Uses Map primitive on outer relation • Uses CSS tree for index. • For each block in outer relation R – Start with a root node to find the next level • Binary search is shown to be better than sequential search. – Go down until you find the data node. • Upper level nodes are cached in local memory since they are frequently accessed.
  30. 30. Sort Merge Join • Sort the relations R, S using the sort primitive • Merge phase – Break S into chunks (s’) of size M. – Find first and last key values of each chunk in s’ and partition R into those many chunks. – Merge all chunks in parallel using map • Each thread group handles a pair • Each thread compares 1 tuple in R with s’ using binary search. • Chunk size is chosen to fit in local memory.
  31. 31. Hash Join • Uses split primitive on both relations • Developed a parallel version of radix hash join – Partitioning • Split R and S into the same number of partitions, so S partitions fit into the local memory – Matching • Choose smaller one of R and S partitions as inner partition to be loaded into local memory • Larger relation will be used as the outer relation • Each tuple from outer relation uses a search on the inner relation for matching.
  32. 32. Lock-Free Scheme for Result Output • Problems – Unknown join result size. Max size of joins doesn’t fit in memory. – Concurrent writes are not atomic.
  33. 33. Lock-Free Scheme for Result Output • Solution: Three-phase scheme – Each thread counts the number of join results. – Compute a prefix sum on the counts to get an array of write locations and the total number of results generated by the join. – Host code allocates memory on device. – Run join again with outputs. • Run joins twice. That’s ok, GPU’s are fast.
  34. 34. Experimental Results
  35. 35. Hardware Configuration • Theoretical Memory bandwidth – GPU: 86.4 GB/s – CPU: 10.4 GB/s • Practical Memory bandwidth – GPU: 69.2 GB/s – CPU: 5.6 GB/s
  36. 36. Workload • R and S tables with 2 integer columns. • SELECT R.rid, S.rid FROM R, S WHERE <predicate> • SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid • SELECT R.rid, S.rid FROM R, S WHERE R.rid<=S.rid<=R.rid + k • Tested on all combinations: – Fix R, Vary S. All values uniform distribution. |R| = 1M – Performance impact varying join selectivity. |R| = |S| = 16M – Non – uniform distribution of data sizes and also varying join selectivity. |R| = |S| = 16M • Also tested with columns as strings.
  37. 37. Implementation Details on CPU • Highly optimized primitives and join algorithms matching hardware architecture • Tuned for cache performance. • Compiled programs using MSVC 8.0 with full optimizations. • Used openMP for threading mechanisms. • 2-6X faster than their sequential counter parts.
  38. 38. Implementation Details on GPU • CUDA parameters – Number of thread groups (128) – Number of threads for each thread group (64) – Block size is 4MB (main memory to device memory)
  39. 39. Memory Optimizations Work
  40. 40. Works when join selectivity is varied
  41. 41. Better than in-memory database
  42. 42. CUDA vs. DirectX10 • DirectX10 is difficult to program, because the data is stored as textures. • NINLJ and INLJ have similar performance. • HJ and SMJ are slower because of texture decoding. • Summary: low level primitives on GPGPU are better than graphics primitives on GPU.
  43. 43. Criticisms • Applications of skew handling are unclear. • Primitives are sufficient to implement the given joins, but they do not prove the set of primitives to be minimal.
  44. 44. Limitations and future research directions • Lack of synchronization mechanisms for handling read/write conflicts on GPU. • More primitives. • More open GPGPU hardware spec for optimizations. • Power consumption on GPU. • Lack of support for complex data types. • On GPU in-memory database. • Automatic detection of thread groups and number of threads using program analysis techniques.
  45. 45. Conclusion • GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts. • NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X
  46. 46. Refrerences • Scan Primitives for GPU Computing, Sengupta et al • •
  47. 47. Thank You.
  48. 48. Scan Primitives for GPU Computing, Sengupta et al
  49. 49. Skew Handling • Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ) • Solution – Identify partitions that do not fit into the local memory – Decompose partitions into multiple chunks the size of local memory
  50. 50. Implementation Details on GPU • CUDA parameters – Number of threads for each thread group – Number of thread groups • DirectX10 – Join algorithms implemented using programmable pipeline • Vertex shader, geometry shader, and pixel shader