• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Gpu Join Presentation

Gpu Join Presentation



This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.

This paper talks about algorithms to do database joins on a GPU. Some interesting work here, that will someday lead to implementing databases on a GPGPU like CUDA.



Total Views
Views on SlideShare
Embed Views



5 Embeds 20

http://www.slideshare.net 9
http://www.cs.brown.edu 8
http://cs.brown.edu 1
http://www.linkedin.com 1
https://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Gpu Join Presentation Gpu Join Presentation Presentation Transcript

    • Relational Joins on Graphics Processors Suman Karumuri Jamie Jablin
    • Background
    • Introduction • Utilizing hardware features of the GPU – Massive thread parallelism – Fast inter-processor communication – High memory bandwidth – Coalesced access
    • Relational Joins • Non-indexed nested-loop join (NINLJ) • Indexed nested-loop join (INLJ) • Sort-merge join (SMJ) • Hash join (HJ)
    • Block Non-indexed nested-loop join (NINLJ) foreach r in R: foreach s in S: if condition(r,s) output = <r,s>
    • Block Indexed nested-loop join (INLJ) foreach r in R: if S[].has(r.f1) if condition(r,s) output = <r,s>
    • Hash join (HJ) Hr = Hashtable() foreach r in R: Hr.add(r) if Hr.size() == MAX_MEMORY: for s in S: if Hr(s): output s Hr.clear()
    • Sort-merge join (SMJ) Sort(R), Sort(S) ; i, j = 0 while !R.empty() && !S.empty(): if (R[i] == S[j]) output += R[i] i++ ; j++ elif (R[i] < S[j]) i++ else j++
    • Algorithms on GPU • Tips for algorithm design – Use the inherent concurrency. – Keep SIMD nature in mind. – Algorithms should be side-effect free. – Memory properties: • High memory bandwidth. • Coalesced access (for spatial locality) • Cache in local memory (for temporal locality) • Access memory via indices and offsets.
    • GPU Primitives
    • Design and Implementation • A complete set of parallel primitives: – Map, scatter, gather, prefix scan, split, and sort • Low synchronization overhead. • Scalable to hundreds of processors. • Applicable to joins as well as other relational query operators.
    • Map
    • Scatter • Indexed writes to a relation.
    • Gather • Performs indexed reads from a relation.
    • Prefix Scan • A prefix scan applies a binary operator on the input of size n and produces an output of size n. • Ex: Prefix sum: cumulative sum of all elements to the left of the current element. – Exclusive (used in paper) – Inclusive
    • Split
    • Each thread constructs tHist from Rin
    • L[(p-1)*#thread+t] = tHist[t][p]
    • Prefix sum L(i) = sum(L[0…i-1]) Gives the start location of partitions
    • tOffset[t][p] = L[(p-1)*#thread+t]
    • Scatter tuples to Rout based on offset.
    • Sort • Bitonic sort – Uses sorting networks, O(N log2N). • Quick sort – partition using a random pivot until partition fits in local memory – Sort each partition using bitonic sort. – Partioning can be parallelized using split. – Complexity is O(N logN). – 30% faster than bitonic sort in experiments – Use Quick sort for sorting
    • Spatial and Temporal locality
    • Memory Optimizations • Coalesced memory improves memory bandwidth utilization (spatial locality)
    • Local Memory Optimization • Quick sort – Temporal locality – Use the bitonic sort to sort each chunk after the partitioning step.
    • Joins on GPGPU
    • NINLJ on GPU • Block nested • Uses Map primitive on both relations – Partition R into R’ and S into S’ blocks respectively. – Create R’ x S’ thread groups – A thread in a thread group processes one tuple from R’ and matches all tuples from S’. – All tuples in S’ are in local cache.
    • B+ Tree vs CSS Tree • B+ tree imposes – Memory stalls when traversed (no spatial locality) – Can’t perform multiple searches ( loses temporal locality). • CSS-Tree (Cache optimized search tree) – One dimensional array where nodes are indexed. – Replaces traversal with computation. – Can also perform parallel key lookups.
    • Indexed Nested Loop Join (INLP) • Uses Map primitive on outer relation • Uses CSS tree for index. • For each block in outer relation R – Start with a root node to find the next level • Binary search is shown to be better than sequential search. – Go down until you find the data node. • Upper level nodes are cached in local memory since they are frequently accessed.
    • Sort Merge Join • Sort the relations R, S using the sort primitive • Merge phase – Break S into chunks (s’) of size M. – Find first and last key values of each chunk in s’ and partition R into those many chunks. – Merge all chunks in parallel using map • Each thread group handles a pair • Each thread compares 1 tuple in R with s’ using binary search. • Chunk size is chosen to fit in local memory.
    • Hash Join • Uses split primitive on both relations • Developed a parallel version of radix hash join – Partitioning • Split R and S into the same number of partitions, so S partitions fit into the local memory – Matching • Choose smaller one of R and S partitions as inner partition to be loaded into local memory • Larger relation will be used as the outer relation • Each tuple from outer relation uses a search on the inner relation for matching.
    • Lock-Free Scheme for Result Output • Problems – Unknown join result size. Max size of joins doesn’t fit in memory. – Concurrent writes are not atomic.
    • Lock-Free Scheme for Result Output • Solution: Three-phase scheme – Each thread counts the number of join results. – Compute a prefix sum on the counts to get an array of write locations and the total number of results generated by the join. – Host code allocates memory on device. – Run join again with outputs. • Run joins twice. That’s ok, GPU’s are fast.
    • Experimental Results
    • Hardware Configuration • Theoretical Memory bandwidth – GPU: 86.4 GB/s – CPU: 10.4 GB/s • Practical Memory bandwidth – GPU: 69.2 GB/s – CPU: 5.6 GB/s
    • Workload • R and S tables with 2 integer columns. • SELECT R.rid, S.rid FROM R, S WHERE <predicate> • SELECT R.rid, S.rid FROM R, S WHERE R.rid=S.rid • SELECT R.rid, S.rid FROM R, S WHERE R.rid<=S.rid<=R.rid + k • Tested on all combinations: – Fix R, Vary S. All values uniform distribution. |R| = 1M – Performance impact varying join selectivity. |R| = |S| = 16M – Non – uniform distribution of data sizes and also varying join selectivity. |R| = |S| = 16M • Also tested with columns as strings.
    • Implementation Details on CPU • Highly optimized primitives and join algorithms matching hardware architecture • Tuned for cache performance. • Compiled programs using MSVC 8.0 with full optimizations. • Used openMP for threading mechanisms. • 2-6X faster than their sequential counter parts.
    • Implementation Details on GPU • CUDA parameters – Number of thread groups (128) – Number of threads for each thread group (64) – Block size is 4MB (main memory to device memory)
    • Memory Optimizations Work
    • Works when join selectivity is varied
    • Better than in-memory database
    • CUDA vs. DirectX10 • DirectX10 is difficult to program, because the data is stored as textures. • NINLJ and INLJ have similar performance. • HJ and SMJ are slower because of texture decoding. • Summary: low level primitives on GPGPU are better than graphics primitives on GPU.
    • Criticisms • Applications of skew handling are unclear. • Primitives are sufficient to implement the given joins, but they do not prove the set of primitives to be minimal.
    • Limitations and future research directions • Lack of synchronization mechanisms for handling read/write conflicts on GPU. • More primitives. • More open GPGPU hardware spec for optimizations. • Power consumption on GPU. • Lack of support for complex data types. • On GPU in-memory database. • Automatic detection of thread groups and number of threads using program analysis techniques.
    • Conclusion • GPU-based primitives and join algorithms achieve a speedup of 2-27X over optimized CPU-based counterparts. • NINLJ, 7.0X; INLJ, 6.1X; SMJ, 2.4X; HJ, 1.9X
    • Refrerences • Scan Primitives for GPU Computing, Sengupta et al • wikipedia.org • monetdb.cwi.nl
    • Thank You.
    • Scan Primitives for GPU Computing, Sengupta et al
    • Skew Handling • Skew in data results in an imbalanced partition size in partitioned-based algorithms (SMJ and HJ) • Solution – Identify partitions that do not fit into the local memory – Decompose partitions into multiple chunks the size of local memory
    • Implementation Details on GPU • CUDA parameters – Number of threads for each thread group – Number of thread groups • DirectX10 – Join algorithms implemented using programmable pipeline • Vertex shader, geometry shader, and pixel shader