Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Gpu Join Presentation


Published on

This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.

Published in: Technology
  • To get professional research papers you must go for experts like ⇒ ⇐
    Are you sure you want to  Yes  No
    Your message goes here
  • If u need a hand in making your writing assignments - visit ⇒ ⇐ for more detailed information.
    Are you sure you want to  Yes  No
    Your message goes here

Gpu Join Presentation

  1. 1. Relational Joins on Graphics Processors Suman Karumuri Jamie Jablin
  2. 2. Background
  3. 3. Introduction • Utilizing hardware features of the GPU – Massive thread parallelism – Fast inter-processor communication – High memory bandwidth – Coalesced access
  4. 4. Relational Joins • Non-indexed nested-loop join (NINLJ) • Indexed nested-loop join (INLJ) • Sort-merge join (SMJ) • Hash join (HJ)
  5. 5. Block Non-indexed nested-loop join (NINLJ) foreach r in R: foreach s in S: if condition(r,s) output = <r,s>
  6. 6. Block Indexed nested-loop join (INLJ) foreach r in R: if S[].has(r.f1) if condition(r,s) output = <r,s>
  7. 7. Hash join (HJ) Hr = Hashtable() foreach r in R: Hr.add(r) if Hr.size() == MAX_MEMORY: for s in S: if Hr(s): output s Hr.clear()
  8. 8. Sort-merge join (SMJ) Sort(R), Sort(S) ; i, j = 0 while !R.empty() && !S.empty(): if (R[i] == S[j]) output += R[i] i++ ; j++ elif (R[i] < S[j]) i++ else j++
  9. 9. Algorithms on GPU • Tips for algorithm design – Use the inherent concurrency. – Keep SIMD nature in mind. – Algorithms should be side-effect free. – Memory properties: • High memory bandwidth. • Coalesced access (for spatial locality) • Cache in local memory (for temporal locality) • Access memory via indices and offsets.
  10. 10. GPU Primitives
  11. 11. Design and Implementation • A complete set of parallel primitives: – Map, scatter, gather, prefix scan, split, and sort • Low synchronization overhead. • Scalable to hundreds of processors. • Applicable to joins as well as other relational query operators.
  12. 12. Map
  13. 13. Scatter • Indexed writes to a relation.
  14. 14. Gather • Performs indexed reads from a relation.
  15. 15. Prefix Scan • A prefix scan applies a binary operator on the input of size n and produces an output of size n. • Ex: Prefix sum: cumulative sum of all elements to the left of the current element. – Exclusive (used in paper) – Inclusive
  16. 16. Split
  17. 17. Each thread constructs tHist from Rin
  18. 18. L[(p-1)*#thread+t] = tHist[t][p]
  19. 19. Prefix sum L(i) = sum(L[0…i-1]) Gives the start location of partitions
  20. 20. tOffset[t][p] = L[(p-1)*#thread+t]
  21. 21. Scatter tuples to Rout based on offset.
  22. 22. Sort • Bitonic sort – Uses sorting networks, O(N log2N). • Quick sort – partition using a random pivot until partition fits in local memory – Sort each partition using bitonic sort. – Partioning can be parallelized using split. – Complexity is O(N logN). – 30% faster than bitonic sort in experiments – Use Quick sort for sorting
  23. 23. Spatial and Temporal locality
  24. 24. Memory Optimizations • Coalesced memory improves memory bandwidth utilization (spatial locality)
  25. 25. Local Memory Optimization • Quick sort – Temporal locality – Use the bitonic sort to sort each chunk after the partitioning step.
  26. 26. Joins on GPGPU
  27. 27. NINLJ on GPU • Block nested • Uses Map primitive on both relations – Partition R into R’ and S into S’ blocks respectively. – Create R’ x S’ thread groups – A thread in a thread group processes one tuple from R’ and matches all tuples from S’. – All tuples in S’ are in local cache.
  28. 28. B+ Tree vs CSS Tree • B+ tree imposes – Memory stalls when traversed (no spatial locality) – Can’t perform multiple searches ( loses temporal locality). • CSS-Tree (Cache optimized search tree) – One dimensional array where nodes are indexed. – Replaces traversal with computation. – Can also perform parallel key lookups.
  29. 29. Indexed Nested Loop Join (INLP) • Uses Map primitive on outer relation • Uses CSS tree for index. • For each block in outer relation R – Start with a root node to find the next level • Binary search is shown to be better than sequential search. – Go down until you find the data node. • Upper level nodes are cached in local memory since they are frequently accessed.
  30. 30. Sort Merge Join • Sort the relations R, S using the sort primitive • Merge phase – Break S into chunks (s’) of size M. – Find first and last key values of each chunk in s’ and partition R into those many chunks. – Merge all chunks in parallel using map • Each thread group handles a pair • Each thread compares 1 tuple in R with s’ using binary search. • Chunk size is chosen to fit in local memory.
  31. 31. Hash Join • Uses split primitive on both relations • Developed a parallel version of radix hash join – Partitioning • Split R and S into the same number of partitions, so S partitions fit into the local memory – Matching • Choose smaller one of R and S partitions as inner partition to be loaded into local memory • Larger relation will be used as the outer relation • Each tuple from outer relation uses a search on the inner relation for matching.
  32. 32. Lock-Free Scheme for Result Output • Problems – Unknown join result size. Max size of joins doesn’t fit in memory. – Concurrent writes are not atomic.
  33. 33. Lock-Free Scheme for Result Output • Solution: Three-phase scheme – Each thread counts the number of join results. – Compute a prefix sum on the counts to get an array of write locations and the total number of results generated by the join. – Host code allocates memory on device. – Run join again with outputs. • Run joins twice. That’s ok, GPU’s are fast.
  34. 34. Experimental Results
  35. 35. Hardware Configuration • Theoretical Memory bandwidth – GPU: 86.4 GB/s – CPU: 10.4 GB/s • Practical Memory bandwidth – GPU: 69.2 GB/s – CPU: 5.6 GB/s
  36. 36. Workload • R and S tables with 2 integer columns. • SELECT R.rid, S.rid FROM R, S WHERE <predicate> • SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid • SELECT R.rid, S.rid FROM R, S WHERE R.rid<=S.rid<=R.rid + k • Tested on all combinations: – Fix R, Vary S. All values uniform distribution. |R| = 1M – Performance impact varying join selectivity. |R| = |S| = 16M – Non – uniform distribution of data sizes and also varying join selectivity. |R| = |S| = 16M • Also tested with columns as strings.
  37. 37. Implementation Details on CPU • Highly optimized primitives and join algorithms matching hardware architecture • Tuned for cache performance. • Compiled programs using MSVC 8.0 with full optimizations. • Used openMP for threading mechanisms. • 2-6X faster than their sequential counter parts.
  38. 38. Implementation Details on GPU • CUDA parameters – Number of thread groups (128) – Number of threads for each thread group (64) – Block size is 4MB (main memory to device memory)
  39. 39. Memory Optimizations Work
  40. 40. Works when join selectivity is varied
  41. 41. Better than in-memory database
  42. 42. CUDA vs. DirectX10 • DirectX10 is difficult to program, because the data is stored as textures. • NINLJ and INLJ have similar performance. • HJ and SMJ are slower because of texture decoding. • Summary: low level primitives on GPGPU are better than graphics primitives on GPU.
  43. 43. Criticisms • Applications of skew handling are unclear. • Primitives are sufficient to implement the given joins, but they do not prove the set of primitives to be minimal.
  44. 44. Limitations and future research directions • Lack of synchronization mechanisms for handling read/write conflicts on GPU. • More primitives. • More open GPGPU hardware spec for optimizations. • Power consumption on GPU. • Lack of support for complex data types. • On GPU in-memory database. • Automatic detection of thread groups and number of threads using program analysis techniques.
  45. 45. Conclusion • GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts. • NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X
  46. 46. Refrerences • Scan Primitives for GPU Computing, Sengupta et al • •
  47. 47. Thank You.
  48. 48. Scan Primitives for GPU Computing, Sengupta et al
  49. 49. Skew Handling • Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ) • Solution – Identify partitions that do not fit into the local memory – Decompose partitions into multiple chunks the size of local memory
  50. 50. Implementation Details on GPU • CUDA parameters – Number of threads for each thread group – Number of thread groups • DirectX10 – Join algorithms implemented using programmable pipeline • Vertex shader, geometry shader, and pixel shader