Your SlideShare is downloading. ×
0
Relational Joins on Graphics
        Processors
              Suman Karumuri
                 Jamie Jablin
Background
Introduction
• Utilizing hardware features of the GPU
  – Massive thread parallelism
  – Fast inter-processor communicatio...
Relational Joins
•   Non-indexed nested-loop join (NINLJ)
•   Indexed nested-loop join (INLJ)
•   Sort-merge join (SMJ)
• ...
Block Non-indexed nested-loop join
             (NINLJ)

foreach r in R:
 foreach s in S:
    if condition(r,s)
        ou...
Block Indexed nested-loop join
             (INLJ)

foreach r in R:
    if S[].has(r.f1)
        if condition(r,s)
       ...
Hash join (HJ)

Hr = Hashtable()
foreach r in R:
  Hr.add(r)
  if Hr.size() == MAX_MEMORY:
    for s in S:
     if Hr(s):
...
Sort-merge join (SMJ)
Sort(R), Sort(S) ; i, j = 0
while !R.empty() && !S.empty():
    if (R[i] == S[j])
         output +=...
Algorithms on GPU
• Tips for algorithm design
  – Use the inherent concurrency.
  – Keep SIMD nature in mind.
  – Algorith...
GPU Primitives
Design and Implementation
• A complete set of parallel primitives:
  – Map, scatter, gather, prefix scan, split, and
    s...
Map
Scatter
• Indexed writes to a relation.
Gather
• Performs indexed reads from a relation.
Prefix Scan
• A prefix scan applies a binary operator on the input
  of size n and produces an output of size n.
• Ex: Pre...
Split
Each thread constructs
    tHist from Rin
L[(p-1)*#thread+t]
         =
     tHist[t][p]
Prefix sum
 L(i) = sum(L[0…i-1])
Gives the start location
      of partitions
tOffset[t][p]
         =
L[(p-1)*#thread+t]
Scatter tuples to Rout
  based on offset.
Sort
• Bitonic sort
   – Uses sorting networks, O(N log2N).
• Quick sort
   – partition using a random pivot until partiti...
Spatial and Temporal locality
Memory Optimizations
• Coalesced memory improves memory
  bandwidth utilization (spatial locality)
Local Memory Optimization
• Quick sort
  – Temporal locality
  – Use the bitonic sort to sort each chunk after
    the par...
Joins on GPGPU
NINLJ on GPU
• Block nested
• Uses Map primitive on both relations
  – Partition R into R’ and S into S’ blocks
    respec...
B+ Tree vs CSS Tree
• B+ tree imposes
  – Memory stalls when traversed (no spatial locality)
  – Can’t perform multiple se...
Indexed Nested Loop Join (INLP)
• Uses Map primitive on outer relation
• Uses CSS tree for index.
• For each block in oute...
Sort Merge Join
• Sort the relations R, S using the sort primitive
• Merge phase
   – Break S into chunks (s’) of size M.
...
Hash Join
• Uses split primitive on both relations
• Developed a parallel version of radix hash join
  – Partitioning
    ...
Lock-Free Scheme for Result
             Output
• Problems
  – Unknown join result size. Max size of joins
    doesn’t fit...
Lock-Free Scheme for Result
              Output
• Solution: Three-phase scheme
  – Each thread counts the number of join ...
Experimental Results
Hardware Configuration




•   Theoretical Memory bandwidth
     – GPU: 86.4 GB/s
     – CPU: 10.4 GB/s
•   Practical Memo...
Workload
• R and S tables with 2 integer columns.
• SELECT R.rid, S.rid FROM R, S WHERE <predicate>
• SELECT R.rid, S.rid ...
Implementation Details on CPU
• Highly optimized primitives and join
  algorithms matching hardware architecture
• Tuned f...
Implementation Details on GPU
• CUDA parameters
  – Number of thread groups (128)
  – Number of threads for each thread gr...
Memory Optimizations Work
Works when join selectivity is
         varied
Better than in-memory database
CUDA vs. DirectX10
• DirectX10 is difficult to program, because
  the data is stored as textures.
• NINLJ and INLJ have si...
Criticisms
• Applications of skew handling are unclear.
• Primitives are sufficient to implement the
  given joins, but th...
Limitations and future research
              directions
• Lack of synchronization mechanisms for
  handling read/write co...
Conclusion
• GPU-based primitives and join algorithms
  achieve a speedup of 2-27X over
  optimized CPU-based counterparts...
Refrerences
• Scan Primitives for GPU Computing,
  Sengupta et al
• wikipedia.org
• monetdb.cwi.nl
Thank You.
Scan Primitives for GPU Computing, Sengupta et al
Skew Handling
• Skew in data results in an imbalanced
  partition size in partitioned-based
  algorithms (SMJ and HJ)
• So...
Implementation Details on GPU
• CUDA parameters
  – Number of threads for each thread group
  – Number of thread groups
• ...
Gpu Join Presentation
Gpu Join Presentation
Gpu Join Presentation
Gpu Join Presentation
Gpu Join Presentation
Upcoming SlideShare
Loading in...5
×

Gpu Join Presentation

2,846

Published on

This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,846
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
84
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Transcript of "Gpu Join Presentation"

  1. 1. Relational Joins on Graphics Processors Suman Karumuri Jamie Jablin
  2. 2. Background
  3. 3. Introduction • Utilizing hardware features of the GPU – Massive thread parallelism – Fast inter-processor communication – High memory bandwidth – Coalesced access
  4. 4. Relational Joins • Non-indexed nested-loop join (NINLJ) • Indexed nested-loop join (INLJ) • Sort-merge join (SMJ) • Hash join (HJ)
  5. 5. Block Non-indexed nested-loop join (NINLJ) foreach r in R: foreach s in S: if condition(r,s) output = <r,s>
  6. 6. Block Indexed nested-loop join (INLJ) foreach r in R: if S[].has(r.f1) if condition(r,s) output = <r,s>
  7. 7. Hash join (HJ) Hr = Hashtable() foreach r in R: Hr.add(r) if Hr.size() == MAX_MEMORY: for s in S: if Hr(s): output s Hr.clear()
  8. 8. Sort-merge join (SMJ) Sort(R), Sort(S) ; i, j = 0 while !R.empty() && !S.empty(): if (R[i] == S[j]) output += R[i] i++ ; j++ elif (R[i] < S[j]) i++ else j++
  9. 9. Algorithms on GPU • Tips for algorithm design – Use the inherent concurrency. – Keep SIMD nature in mind. – Algorithms should be side-effect free. – Memory properties: • High memory bandwidth. • Coalesced access (for spatial locality) • Cache in local memory (for temporal locality) • Access memory via indices and offsets.
  10. 10. GPU Primitives
  11. 11. Design and Implementation • A complete set of parallel primitives: – Map, scatter, gather, prefix scan, split, and sort • Low synchronization overhead. • Scalable to hundreds of processors. • Applicable to joins as well as other relational query operators.
  12. 12. Map
  13. 13. Scatter • Indexed writes to a relation.
  14. 14. Gather • Performs indexed reads from a relation.
  15. 15. Prefix Scan • A prefix scan applies a binary operator on the input of size n and produces an output of size n. • Ex: Prefix sum: cumulative sum of all elements to the left of the current element. – Exclusive (used in paper) – Inclusive
  16. 16. Split
  17. 17. Each thread constructs tHist from Rin
  18. 18. L[(p-1)*#thread+t] = tHist[t][p]
  19. 19. Prefix sum L(i) = sum(L[0…i-1]) Gives the start location of partitions
  20. 20. tOffset[t][p] = L[(p-1)*#thread+t]
  21. 21. Scatter tuples to Rout based on offset.
  22. 22. Sort • Bitonic sort – Uses sorting networks, O(N log2N). • Quick sort – partition using a random pivot until partition fits in local memory – Sort each partition using bitonic sort. – Partioning can be parallelized using split. – Complexity is O(N logN). – 30% faster than bitonic sort in experiments – Use Quick sort for sorting
  23. 23. Spatial and Temporal locality
  24. 24. Memory Optimizations • Coalesced memory improves memory bandwidth utilization (spatial locality)
  25. 25. Local Memory Optimization • Quick sort – Temporal locality – Use the bitonic sort to sort each chunk after the partitioning step.
  26. 26. Joins on GPGPU
  27. 27. NINLJ on GPU • Block nested • Uses Map primitive on both relations – Partition R into R’ and S into S’ blocks respectively. – Create R’ x S’ thread groups – A thread in a thread group processes one tuple from R’ and matches all tuples from S’. – All tuples in S’ are in local cache.
  28. 28. B+ Tree vs CSS Tree • B+ tree imposes – Memory stalls when traversed (no spatial locality) – Can’t perform multiple searches ( loses temporal locality). • CSS-Tree (Cache optimized search tree) – One dimensional array where nodes are indexed. – Replaces traversal with computation. – Can also perform parallel key lookups.
  29. 29. Indexed Nested Loop Join (INLP) • Uses Map primitive on outer relation • Uses CSS tree for index. • For each block in outer relation R – Start with a root node to find the next level • Binary search is shown to be better than sequential search. – Go down until you find the data node. • Upper level nodes are cached in local memory since they are frequently accessed.
  30. 30. Sort Merge Join • Sort the relations R, S using the sort primitive • Merge phase – Break S into chunks (s’) of size M. – Find first and last key values of each chunk in s’ and partition R into those many chunks. – Merge all chunks in parallel using map • Each thread group handles a pair • Each thread compares 1 tuple in R with s’ using binary search. • Chunk size is chosen to fit in local memory.
  31. 31. Hash Join • Uses split primitive on both relations • Developed a parallel version of radix hash join – Partitioning • Split R and S into the same number of partitions, so S partitions fit into the local memory – Matching • Choose smaller one of R and S partitions as inner partition to be loaded into local memory • Larger relation will be used as the outer relation • Each tuple from outer relation uses a search on the inner relation for matching.
  32. 32. Lock-Free Scheme for Result Output • Problems – Unknown join result size. Max size of joins doesn’t fit in memory. – Concurrent writes are not atomic.
  33. 33. Lock-Free Scheme for Result Output • Solution: Three-phase scheme – Each thread counts the number of join results. – Compute a prefix sum on the counts to get an array of write locations and the total number of results generated by the join. – Host code allocates memory on device. – Run join again with outputs. • Run joins twice. That’s ok, GPU’s are fast.
  34. 34. Experimental Results
  35. 35. Hardware Configuration • Theoretical Memory bandwidth – GPU: 86.4 GB/s – CPU: 10.4 GB/s • Practical Memory bandwidth – GPU: 69.2 GB/s – CPU: 5.6 GB/s
  36. 36. Workload • R and S tables with 2 integer columns. • SELECT R.rid, S.rid FROM R, S WHERE <predicate> • SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid • SELECT R.rid, S.rid FROM R, S WHERE R.rid<=S.rid<=R.rid + k • Tested on all combinations: – Fix R, Vary S. All values uniform distribution. |R| = 1M – Performance impact varying join selectivity. |R| = |S| = 16M – Non – uniform distribution of data sizes and also varying join selectivity. |R| = |S| = 16M • Also tested with columns as strings.
  37. 37. Implementation Details on CPU • Highly optimized primitives and join algorithms matching hardware architecture • Tuned for cache performance. • Compiled programs using MSVC 8.0 with full optimizations. • Used openMP for threading mechanisms. • 2-6X faster than their sequential counter parts.
  38. 38. Implementation Details on GPU • CUDA parameters – Number of thread groups (128) – Number of threads for each thread group (64) – Block size is 4MB (main memory to device memory)
  39. 39. Memory Optimizations Work
  40. 40. Works when join selectivity is varied
  41. 41. Better than in-memory database
  42. 42. CUDA vs. DirectX10 • DirectX10 is difficult to program, because the data is stored as textures. • NINLJ and INLJ have similar performance. • HJ and SMJ are slower because of texture decoding. • Summary: low level primitives on GPGPU are better than graphics primitives on GPU.
  43. 43. Criticisms • Applications of skew handling are unclear. • Primitives are sufficient to implement the given joins, but they do not prove the set of primitives to be minimal.
  44. 44. Limitations and future research directions • Lack of synchronization mechanisms for handling read/write conflicts on GPU. • More primitives. • More open GPGPU hardware spec for optimizations. • Power consumption on GPU. • Lack of support for complex data types. • On GPU in-memory database. • Automatic detection of thread groups and number of threads using program analysis techniques.
  45. 45. Conclusion • GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts. • NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X
  46. 46. Refrerences • Scan Primitives for GPU Computing, Sengupta et al • wikipedia.org • monetdb.cwi.nl
  47. 47. Thank You.
  48. 48. Scan Primitives for GPU Computing, Sengupta et al
  49. 49. Skew Handling • Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ) • Solution – Identify partitions that do not fit into the local memory – Decompose partitions into multiple chunks the size of local memory
  50. 50. Implementation Details on GPU • CUDA parameters – Number of threads for each thread group – Number of thread groups • DirectX10 – Join algorithms implemented using programmable pipeline • Vertex shader, geometry shader, and pixel shader
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×