Irregular Parallelism on the GPU:      Algorithms and Data Structures                       John OwensAssociate Professor,...
Outline•   Irregular parallelism    •   Get a new data structure! (sparse matrix-vector)    •   Reuse a good data structur...
Sparse Matrix-Vector Multiply
Sparse Matrix-Vector Multiply: What’s Hard? •     Dense approach is wasteful •     Unclear how to map work to parallel pro...
(DI                     A)                          Dia                             go                                na  ...
Diagonal Matrices                                               We’ve recently done                                       ...
Irregular Matrices: ELL                                                 0   2                                             ...
Irregular Matrices: COO                                              00   02                                              ...
Thread-per-{element,row}
Irregular Matrices: HYB                               “Typical”            “Exceptional”                                  ...
SpMV: Summary•   Ample parallelism for large matrices    •   Structured matrices (dense, diagonal): straightforward    •  ...
Compositing
The Programmable Pipeline  Input     Tess.        Vertex      Geom.               Frag.                                   ...
Composition                                                               Samples /                                       ...
Why?
Pixel-Parallel Composition   Subpixel 0             Subpixel 1   Subpixel 0   Subpixel 1                Pixel i           ...
Sample-Parallel Composition  Subpixel 0                    Subpixel 1                        Subpixel 0        Subpixel 1 ...
M randomly-distributed fragments
Nested Data Parallelism                              00   02                              10    11   12   13    14   15   ...
Hash Tables
Hash Tables & Sparsity Lefebvre and Hoppe, Siggraph 2006
Scalar Hashing                                                        #2                       #1       #                 ...
Scalar Hashing: Parallel Problems                                                        #2                               ...
Parallel Hashing: The Problem•   Hash tables are good for sparse data.•   Input: Set of key-value pairs to place in the ha...
Parallel Hashing: What We Want•   Fast construction time•   Fast access time    •   O(1) for any element, O(n) for n eleme...
Level : Distribute into buckets         Keys                                                                              ...
Parallel Hashing: Level•   FKS (Level 1) is good for a coarse categorization    •   Possible performance issue: atomics•  ...
Hashing in Parallel      ha              hb      1    0          1    0      3    1          3    1      2    2          2...
Cuckoo Hashing Construction                        h1   h2             T1         T2                        1     1       ...
Cuckoo Construction Mechanics•   Level 1 created buckets of no more than 512 items    •   Average: 409; probability of ove...
Timings on random voxel data
Why did we pick this implementation? •   Alternative: single stage, cuckoo hash only •   Problem 1: Uncoalesced reads/writ...
But it turns out …                               Thanks to expert reviewer                                                ...
Hashing: Big Ideas•   Classic serial hashing techniques are a poor fit for a GPU.    •   Serialization, load balance•   Sol...
Task Queues
Representing Smooth Surfaces•   Polygonal Meshes     •   Lack view adaptivity     •   Inefficient storage/transfer•   Parame...
Parallel Traversal                     Anjul Patney and John D. Owens. Real-Time Reyes-                     Style Adaptive...
Cracks & T-Junctions      Uniform Refinement       Adaptive Refinement                     Crack                            ...
Recursive Subdivision is Irregular                                                                                        ...
Static Task List           Input   Input     Input      Input        Input           SM      SM         SM        SM      ...
Private Work Queue Approach•   Allocate private work queue of tasks per    core    •   Each core can add to or remove work...
Work Stealing & Donating                              Lock      Lock        ...         I/O       I/O        I/O       I/O...
Ingredients for Our Scheme    Implementation   questions that we    need to address:   What is the proper         Warp Siz...
Warp Size Work Granularity• Problem: We want to emulate task level parallelism on  the GPU without loss in efficiency.• So...
Uberkernel Processor Utilization• Problem: Want to eliminate global kernel barriers for  better processor utilization• Ube...
Persistent Thread Scheduler Emulation• Problem: If input is irregular? How many threads do we  launch?• Launch enough to fi...
Task Donation• When a bin is full, processor can give work to someone  else.        P1          P2         P3         P4  ...
Queuing Schemesent for Irregular-Parallel Workloads on the GPU                            4)56(7,)8-)9-:)(2-()*+,,+3-683-7...
Smooth Surfaces, High Detail              16 samples per pixel    >15 frames per second on GeForce GTX280
MapReduce
MapReduce            http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
Why MapReduce?•   Simple programming model•   Parallel programming model•   Scalable•   Previous GPU work: neither multi-G...
Block Diagram                    !"#$              !"#$                     %#$               %#$                    &($  ...
Keys to Performance                                                 !"#$              !"#$                                ...
Partial Reduce                                                   !"#$              !"#$                                   ...
Accumulate                                                 !"#$              !"#$                                         ...
Benchmarks—Which•   Matrix Multiplication (MM)•   Word Occurrence (WO)•   Sparse-Integer Occurrence (SIO)•   Linear Regres...
Benchmarks—Why•   Needed to stress aspects of GPMR    •   Unbalanced work (WO)    •   Multiple emits/Non-uniform number of...
Benchmarks—Results
WO. We test GPMR against all available input sets.Benchmarks—Results                               MM          KMC        ...
Benchmarks—Results                     Good
Benchmarks—Results                     Good
Benchmarks - Results                       Good
MapReduce Summary•   Time is right to explore diverse programming models on    GPUs    •   Few threads -> many, heavy thre...
Broader Thoughts
Nested Data Parallelism this?                How do we generalize                                                How do we...
Hashing   Tradeoffs between construction           time, access time, and space?            Incremental data structures?
Work Stealing & Donating          Generalize and abstract task                                  queues? [structural patter...
Horizontal vs. Vertical Development                                         Little code     Applications                  ...
Energy Cost per Operation            Energy Cost of OperationsOperation                        Energy (pJ)64b Floating FMA...
Dwarfs•   1. Dense Linear Algebra•   2. Sparse Linear Algebra   •   9. Graph Traversal•   3. Spectral Methods        •   1...
Dwarfs•   1. Dense Linear Algebra•   2. Sparse Linear Algebra   •   9. Graph Traversal•   3. Spectral Methods        •   1...
The Programmable Pipeline                                                Bricks & mortar: how do we                       ...
The Programmable Pipeline  Input     Tess.        Vertex      Geom.               Frag.                                   ...
Owens Group Interests•   Irregular parallelism                             •   Optimization: self-tuning                  ...
Thanks to ...•   Nathan Bell, Michael Garland, David Luebke, Sumit Gupta,    Dan Alcantara, Anjul Patney, and Stanley Tzen...
“If you were plowing a field, whichwould you rather use? Two strong      oxen or 1024 chickens?”         —Seymour Cray
Parallel Prefix Sum (Scan)•   Given an array A = [a0, a1, …, an-1]    and a binary associative operator ⊕ with identity I,•...
Segmented Scan•   Example: if ⊕ is addition, then scan on the set    •   [3 1 7 | 0 4 1 | 6 3]•   returns the set    •   [...
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)
Upcoming SlideShare
Loading in …5
×

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

1,702 views
1,538 views

Published on

http://cs264.org

http://j.mp/gUKodD

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,702
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
39
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

[Harvard CS264] 12 - Irregular Parallelism on the GPU: Algorithms and Data Structures (John Owens, UC Davis)

  1. 1. Irregular Parallelism on the GPU: Algorithms and Data Structures John OwensAssociate Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization University of California, Davis
  2. 2. Outline• Irregular parallelism • Get a new data structure! (sparse matrix-vector) • Reuse a good data structure! (composite) • Get a new algorithm! (hash tables) • Convert serial to parallel! (task queue) • Concentrate on communication not computation! (MapReduce)• Broader thoughts
  3. 3. Sparse Matrix-Vector Multiply
  4. 4. Sparse Matrix-Vector Multiply: What’s Hard? • Dense approach is wasteful • Unclear how to map work to parallel processors • Irregular data access “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Bell & Garland, SC ’
  5. 5. (DI A) Dia go na lStructured (EL L) EL LP AC K (CS R) Co m pre ss e dR ow (HY B) Hy bri d Sparse Matrix Formats (CO O)Unstructured Co o rdi na te
  6. 6. Diagonal Matrices We’ve recently done -2 0 1 work on tridiagonal systems—PPoPP ’ 10• Diagonals should be mostly populated• Map one thread per row • Good parallel efficiency • Good memory behavior [column-major storage]
  7. 7. Irregular Matrices: ELL 0 2 0 1 2 3 4 5 0 2 3 0 1 2 1 2 3 4 5 5 padding• Assign one thread per row again• But now: • Load imbalance hurts parallel efficiency
  8. 8. Irregular Matrices: COO 00 02 10 11 12 13 14 15 20 22 23 30 31 32 41 42 43 44 45 55 no padding!• General format; insensitive to sparsity pattern, but ~3x slower than ELL• Assign one thread per element, combine results from all elements in a row to get output element • Req segmented reduction, communication btwn threads
  9. 9. Thread-per-{element,row}
  10. 10. Irregular Matrices: HYB “Typical” “Exceptional” 0 2 0 1 2 13 14 15 0 2 3 0 1 2 1 2 3 44 45 5• Combine regularity of ELL + flexibility of COO
  11. 11. SpMV: Summary• Ample parallelism for large matrices • Structured matrices (dense, diagonal): straightforward • Sparse matrices: Tradeoff between parallel efficiency and load balance• Take-home message: Use data structure appropriate to your matrix • Insight: Irregularity is manageable if you regularize the common case
  12. 12. Compositing
  13. 13. The Programmable Pipeline Input Tess. Vertex Geom. Frag. Raster ComposeAssembly Shading Shading Shading Shading Split Dice Shading Sampling Composition Ray-PrimitiveRay Generation Ray Traversal Shading Intersection
  14. 14. Composition Samples / Fragments Fragment Generation Fragments (position, depth, color) Composite S1 Subpixels (position, color) Filter S3 S2 Final pixels Subpixels locations S4Anjul Patney, Stanley Tzeng, and John D. PixelOwens. Fragment-Parallel Composite andFilter. Computer Graphics Forum (Proceedingsof the Eurographics Symposium on Rendering),29(4):1251–1258, June 2010.
  15. 15. Why?
  16. 16. Pixel-Parallel Composition Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Pixel i Pixel i+1
  17. 17. Sample-Parallel Composition Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Segmented Scan Segmented Reduction Pixel i Pixel i+1 Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Pixel i Pixel i+1
  18. 18. M randomly-distributed fragments
  19. 19. Nested Data Parallelism 00 02 10 11 12 13 14 15 20 30 22 31 23 32 = 41 42 43 44 45 55 Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Segmented Scan Segmented Reduction Pixel i Pixel i+1
  20. 20. Hash Tables
  21. 21. Hash Tables & Sparsity Lefebvre and Hoppe, Siggraph 2006
  22. 22. Scalar Hashing #2 #1 # #1 key key key #2Linear Probing Double Probing Chaining
  23. 23. Scalar Hashing: Parallel Problems #2 #1 # #1 key key key #2• Construction and Lookup • Variable time/work per entry• Construction • Synchronization / shared access to data structure
  24. 24. Parallel Hashing: The Problem• Hash tables are good for sparse data.• Input: Set of key-value pairs to place in the hash table• Output: Data structure that allows: • Determining if key has been placed in hash table • Given the key, fetching its value• Could also: • Sort key-value pairs by key (construction) • Binary-search sorted list (lookup)• Recalculate at every change
  25. 25. Parallel Hashing: What We Want• Fast construction time• Fast access time • O(1) for any element, O(n) for n elements in parallel• Reasonable memory usage• Algorithms and data structures may sit at different places in this space • Perfect spatial hashing has good lookup times and memory usage but is very slow to construct
  26. 26. Level : Distribute into buckets Keys Bucket ids Atomic add 8 5 6 8 Bucket sizes h Local offsets 1 3 7 0 5 2 4 6 0 8 13 19 Global offsets Data distributed into bucketsDan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael Mitzenmacher, John D. Owens, andNina Amenta. Real-Time Parallel Hashing on the GPU. Siggraph Asia 2009 (ACM TOG 28(5)), pp. 154:1–154:9, Dec. 2009.
  27. 27. Parallel Hashing: Level• FKS (Level 1) is good for a coarse categorization • Possible performance issue: atomics• Bad for a fine categorization • Space requirement for n elements to (probabilistically) guarantee no collisions is O(n2)
  28. 28. Hashing in Parallel ha hb 1 0 1 0 3 1 3 1 2 2 2 2 0 3 1 3
  29. 29. Cuckoo Hashing Construction h1 h2 T1 T2 1 1 0 1 0 0 1 0 1 1 0 0• Lookup procedure: in parallel, for each element: • Calculate h1 & look in T1; • Calculate h2 & look in T2; still O(1) lookup
  30. 30. Cuckoo Construction Mechanics• Level 1 created buckets of no more than 512 items • Average: 409; probability of overflow: < 10-6• Level 2: Assign each bucket to a thread block, construct cuckoo hash per bucket entirely within shared memory • Semantic: Multiple writes to same location must have one and only one winner• Our implementation uses 3 tables of 192 elements each (load factor: 71%)• What if it fails? New hash functions & start over.
  31. 31. Timings on random voxel data
  32. 32. Why did we pick this implementation? • Alternative: single stage, cuckoo hash only • Problem 1: Uncoalesced reads/writes to global memory (and may require atomics) • Problem 2: Penalty for failure is large
  33. 33. But it turns out … Thanks to expert reviewer (and current collaborator) Vasily Volkov! • … this simpler approach is faster. • Single hash table in global memory (1.4x data size), 4 different hash functions into it • For each thread: • Atomically exchange my item with item in hash table • If my new item is empty, I’m done • If my new item used hash function n, use hash function n+1 to reinsert
  34. 34. Hashing: Big Ideas• Classic serial hashing techniques are a poor fit for a GPU. • Serialization, load balance• Solving this problem required a different algorithm • Both hashing algorithms were new to the parallel literature • Hybrid algorithm was entirely new
  35. 35. Task Queues
  36. 36. Representing Smooth Surfaces• Polygonal Meshes • Lack view adaptivity • Inefficient storage/transfer• Parametric Surfaces • Collections of smooth patches • Hard to animate• Subdivision Surfaces • Widely popular, easy for modeling • Flexible animation [Boubekeur and Schlick 2007]
  37. 37. Parallel Traversal Anjul Patney and John D. Owens. Real-Time Reyes- Style Adaptive Surface Subdivision. ACM Transactions on Graphics, 27(5):143:1–143:8, December 2008.
  38. 38. Cracks & T-Junctions Uniform Refinement Adaptive Refinement Crack T-JunctionPrevious work was post-process after all steps; ours was inherent within subdivision process
  39. 39. Recursive Subdivision is Irregular 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 1 1 2 1 2 1 1 1 1 1 2 2Patney, Ebeida, and Owens. “Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces”. HPG ’ . 1
  40. 40. Static Task List Input Input Input Input Input SM SM SM SM SM Atomic Ptr Daniel Cederman and Output Philippas Tsigas, On Dynamic restart Load Balancing on Graphics Processors. Graphics kernel Hardware 2008, June 2008.
  41. 41. Private Work Queue Approach• Allocate private work queue of tasks per core • Each core can add to or remove work from its local queue• Cores mark self as idle if {queue exhausts storage, queue is empty}• Cores periodically check global idle counter• If global idle counter reaches threshold, rebalance work gProximity: Fast Hierarchy Operations on GPU Architectures, Lauterbach, Mo, and Manocha, EG ’
  42. 42. Work Stealing & Donating Lock Lock ... I/O I/O I/O I/O I/O Deque Deque Deque Deque Deque SM SM SM SM SM• Cederman and Tsigas: Stealing == best performance and scalability (follows Arora CPU-based work)• We showed how to do this with multiple kernels in an uberkernel and persistent-thread programming style• We added donating to minimize memory usage
  43. 43. Ingredients for Our Scheme Implementation questions that we need to address: What is the proper Warp Size granularity for tasks? Work Granularity How many threads to Persistent Threads launch? How to avoid global Uberkernels synchronizations? How to distribute Task Donation tasks evenly? 44
  44. 44. Warp Size Work Granularity• Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.• Solution: we choose block sizes of 32 threads / block. • Removes messy synchronization barriers. • Can view each block as a MIMD thread. We call these blocks processors P P P P P P P P P Physical Physical Physical Processor Processor Processor 45
  45. 45. Uberkernel Processor Utilization• Problem: Want to eliminate global kernel barriers for better processor utilization• Uberkernels pack multiple execution routes into one kernel. Data Flow Data Flow Pipeline Stage 1 Uberkernel Stage 1 Stage 2 Pipeline Stage 2 46
  46. 46. Persistent Thread Scheduler Emulation• Problem: If input is irregular? How many threads do we launch?• Launch enough to fill the GPU, and keep them alive so they keep fetching work. Life of a Thread: Thread: Persistent Fetch Process Write Spawn Death Data Data Output How do we know when to stop?When there is no more work left 47
  47. 47. Task Donation• When a bin is full, processor can give work to someone else. P1 P2 P3 P4 This processor ‘s bin Smaller memory usage is full and donates its work to someone More complicated else’s bin 48
  48. 48. Queuing Schemesent for Irregular-Parallel Workloads on the GPU 4)56(7,)8-)9-:)(2-()*+,,+3-683-73;+-<75+-+(-()*+,,)( 4)56(7,)8-)9-:)(2-()*+,,+3-683-73;+-<75+-+(-()*+,,)( "0!! 25k 25k 1)(2 1)(2 "!! "0!!! "0!!! .3;+ .3;+ "!!! "!!!! "!!!! &0! 1)(2-()*+,,+3 1)(2-()*+,,+3 .3;+-7<+(6<7)8, .3;+-7<+(6<7)8, &0!! &0!!! &0!!! &!! &!!!! &!!! &!!!! Gray bars: Load 0! 0!!! 0!! 0!!! balance ! ! ! ! ! "! #! $! %! &!! &"! ! "! #! $! %! &!! &"! ()*+,,)(-./ ()*+,,)(-./ (a) Block Queuing (b) Distributed Queuing Black line: Idle time $!! 5)67(8,)9-):-;)(3-()*+,,+4-794-84<+-=86+-+(-()*+,,)( 2)(3 160 &$! 1!! 5)67(8,)9-):-;)(3-()*+,,+4-794-84<+-=86+-+(-()*+,,)( 2)(3 140 &#! .4<+ &#! .4<+ &"! 1!! #!! &"! &!! #!! &!! 2)(3-()*+,,+4 2)(3-()*+,,+4 .4<+-8=+(7=8)9, .4<+-8=+(7=8)9, 0!! %! 0!! %! $! "!! $! "!! #! #! &!! &!! "! "! Stanley Tzeng, Anjul Patney, and John D. ! ! "! #! $! %! &!! &"! ! ! ! "! #! $! %! &!! &"! ! Owens. Task Management for Irregular- ()*+,,)(-./ ()*+,,)(-./ Parallel Workloads on the GPU. In Proceedings of High Performance Graphics (c) Task Stealing (d) Task Donating 2010, pages 29–37, June 2010.
  49. 49. Smooth Surfaces, High Detail 16 samples per pixel >15 frames per second on GeForce GTX280
  50. 50. MapReduce
  51. 51. MapReduce http://m.blog.hu/dw/dwbi/image/2009/Q4/mapreduce_small.png
  52. 52. Why MapReduce?• Simple programming model• Parallel programming model• Scalable• Previous GPU work: neither multi-GPU nor out-of-core
  53. 53. Block Diagram !"#$ !"#$ %#$ %#$ &($ )($ &($8$ &($9$ +#2$ :$$:$$:$ +#2$ )($8$ )($9$ 63;12"01.$ 63;12"01.$ *+,$-$ *+,$-$ +./+0$ +./+0$ !12"31$-$ !12"31$-$ +.//%#$ +.//%#$ 45#$ 45#$ 6%.7$ 6%.7$ 63;12"01.$ 63;12"01.$ Jeff A. Stuart and John D. Owens. Multi- GPU MapReduce on GPU Clusters. In Proceedings of the 25th IEEE !12"31$ !12"31$ International Parallel and Distributed Processing Symposium, May 2011.
  54. 54. Keys to Performance !"#$ !"#$ %#$ %#$ &($ )($ &($8$ &($9$ +#2$ :$$:$$:$ +#2$ )($8$ )($9$• Process data in chunks 63;12"01.$ 63;12"01.$ • More efficient transmission & *+,$-$ *+,$-$ computation +./+0$ +./+0$ !12"31$-$ !12"31$-$ • Also allows out of core +.//%#$ +.//%#$• Overlap computation and 45#$ 45#$ communication• Accumulate 6%.7$ 6%.7$• Partial Reduce 63;12"01.$ 63;12"01.$ !12"31$ !12"31$
  55. 55. Partial Reduce !"#$ !"#$ %#$ %#$ &($ )($ &($8$ &($9$ +#2$ :$$:$$:$ +#2$ )($8$ )($9$• User-specified function combines 63;12"01.$ 63;12"01.$ pairs with the same key *+,$-$ *+,$-$• Each map chunk spawns a partial +./+0$ !12"31$-$ +./+0$ !12"31$-$ reduction on that chunk +.//%#$ +.//%#$• Only good when cost of reduction 45#$ 45#$ is less than cost of transmission• Good for ~larger number of keys 6%.7$ 6%.7$• Reduces GPU->CPU bw 63;12"01.$ 63;12"01.$ !12"31$ !12"31$
  56. 56. Accumulate !"#$ !"#$ %#$ %#$ &($ )($ &($8$ &($9$ +#2$ :$$:$$:$ +#2$ )($8$ )($9$• Mapper has explicit knowledge 63;12"01.$ 63;12"01.$ about nature of key-value pairs *+,$-$ *+,$-$• Each map chunk accumulates its +./+0$ !12"31$-$ +.//%#$ +./+0$ !12"31$-$ +.//%#$ key-value outputs with GPU- resident key-value accumulated 45#$ 45#$ outputs• Good for small number of keys 6%.7$ 6%.7$• Also reduces GPU->CPU bw 63;12"01.$ 63;12"01.$ !12"31$ !12"31$
  57. 57. Benchmarks—Which• Matrix Multiplication (MM)• Word Occurrence (WO)• Sparse-Integer Occurrence (SIO)• Linear Regression (LR)• K-Means Clustering (KMC)• (Volume Renderer—presented @ MapReduce ’10) Jeff A. Stuart, Cheng-Kai Chen, Kwan-Liu Ma, and John D. Owens. Multi-GPU Volume Rendering using MapReduce. In HPDC 10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing / MAPREDUCE 10: The First International Workshop on MapReduce and its Applications, pages 841–848, June 2010
  58. 58. Benchmarks—Why• Needed to stress aspects of GPMR • Unbalanced work (WO) • Multiple emits/Non-uniform number of emits (LR, KMC, WO) • Sparsity of keys (SIO) • Accumulation (WO, LR, KMC) • Many key-value pairs (SIO) • Compute Bound Scalability (MM)
  59. 59. Benchmarks—Results
  60. 60. WO. We test GPMR against all available input sets.Benchmarks—Results MM KMC LR SIO WO 1-GPU Speedup 162.712 2.991 1.296 1.450 11.080 4-GPU Speedup 559.209 11.726 4.085 2.322 18.441vs. CPU TABLE 2: Speedup for GPMR over Phoenix on our large (second- biggest) input data from our first set. The exception is MM, for which we use our small input set (Phoenix required almost twenty seconds TAB to multiply two 1024 × 1024 matrices). wri all littl boil MM KMC WO fun GPM 1-GPU Speedup 2.695 37.344 3.098 of t 4-GPU Speedup 10.760 129.425 11.709vs. GPU TABLE 3: Speedup for GPMR over Mars on 4096 × 4096 Matrix Multiplication, an 8M-point K-Means Clustering, and a 512 MB Word Occurrence. These sizes represent the largest problems that can meet the in-core memory requirements of Mars.
  61. 61. Benchmarks—Results Good
  62. 62. Benchmarks—Results Good
  63. 63. Benchmarks - Results Good
  64. 64. MapReduce Summary• Time is right to explore diverse programming models on GPUs • Few threads -> many, heavy threads -> light• Lack of {bandwidth, GPU primitives for communication, access to NIC} are challenges• What happens if GPUs get a lightweight serial processor? • Future hybrid hardware is exciting!
  65. 65. Broader Thoughts
  66. 66. Nested Data Parallelism this? How do we generalize How do we abstract it? What 00 02 are design alternatives? 10 11 12 13 14 15 20 30 22 31 23 32 = 41 42 43 44 45 55 Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1 Segmented Scan Segmented Reduction Pixel i Pixel i+1
  67. 67. Hashing Tradeoffs between construction time, access time, and space? Incremental data structures?
  68. 68. Work Stealing & Donating Generalize and abstract task queues? [structural pattern?] Persistent threads / ... Lock Lock other sticky CUDA constructs? I/O I/O I/O Whole pipelines? I/O I/O Deque Deque Priorities / real-time constraints? Deque Deque Deque SM SM SM SM SM• Cederman and Tsigas: Stealing == best performance and scalability• We showed how to do this with an uberkernel and persistent thread programming style• We added donating to minimize memory usage
  69. 69. Horizontal vs. Vertical Development Little code Applications reuse! App 1 App 2 App 3 Higher-Level Libraries App App App Primitive Libraries 1 2 3Algorithm/Data Structure (Domain-specific, Libraries Algorithm, Data Structure, etc.) Programming System Primitives Programming System Primitives Programming System Primitives Hardware Hardware Hardware CPU GPU (Usual) GPU (Our Goal) [CUDPP library]
  70. 70. Energy Cost per Operation Energy Cost of OperationsOperation Energy (pJ)64b Floating FMA (2 ops) 10064b Integer Add 1Write 64b DFF 0.5Read 64b Register (64 x 32 bank) 3.5Read 64b RAM (64 x 2K) 25Read tags (24 x 2K) 8Move 64b 1mm 6Move 64b 20mm 120Move 64b off chip 256Read 64b from DRAM 2000 From Bill Dally talk, September 2009
  71. 71. Dwarfs• 1. Dense Linear Algebra• 2. Sparse Linear Algebra • 9. Graph Traversal• 3. Spectral Methods • 10. Dynamic Programming• 4. N-Body Methods • 11. Backtrack and• 5. Structured Grids Branch-and-Bound• 6. Unstructured Grids • 12. Graphical Models• 7. MapReduce • 13. Finite State Machines• 8. Combinational Logic
  72. 72. Dwarfs• 1. Dense Linear Algebra• 2. Sparse Linear Algebra • 9. Graph Traversal• 3. Spectral Methods • 10. Dynamic Programming• 4. N-Body Methods • 11. Backtrack and• 5. Structured Grids Branch-and-Bound• 6. Unstructured Grids • 12. Graphical Models• 7. MapReduce • 13. Finite State Machines• 8. Combinational Logic
  73. 73. The Programmable Pipeline Bricks & mortar: how do we allow programmers to build stages without worrying about assembling them together? Input Tess. Vertex Geom. Frag. Raster ComposeAssembly Shading Shading Shading Shading Split Dice Shading Sampling Composition Ray-PrimitiveRay Generation Ray Traversal Shading Intersection
  74. 74. The Programmable Pipeline Input Tess. Vertex Geom. Frag. Raster ComposeAssembly Shading Shading Shading Shading [Irregular] Parallelism Split Dice Shading Sampling Composition HeterogeneityRay Generation Ray Traversal Ray-Primitive Intersection Shading What does it look like when you build applications with GPUs as the primary processor?
  75. 75. Owens Group Interests• Irregular parallelism • Optimization: self-tuning and tools• Fundamental data structures and algorithms • Abstract models of GPU computation• Computational patterns • Multi-GPU computing: for parallelism programming models and abstractions
  76. 76. Thanks to ...• Nathan Bell, Michael Garland, David Luebke, Sumit Gupta, Dan Alcantara, Anjul Patney, and Stanley Tzeng for helpful comments and slide material.• Funding agencies: Department of Energy (SciDAC Institute for Ultrascale Visualization, Early Career Principal Investigator Award), NSF, Intel Science and Technology Center for Visual Computing, LANL, BMW, NVIDIA, HP, UC MICRO, Microsoft, ChevronTexaco, Rambus
  77. 77. “If you were plowing a field, whichwould you rather use? Two strong oxen or 1024 chickens?” —Seymour Cray
  78. 78. Parallel Prefix Sum (Scan)• Given an array A = [a0, a1, …, an-1] and a binary associative operator ⊕ with identity I,• scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)]• Example: if ⊕ is addition, then scan on the set • [3 1 7 0 4 1 6 3]• returns the set • [0 3 4 11 11 15 16 22]
  79. 79. Segmented Scan• Example: if ⊕ is addition, then scan on the set • [3 1 7 | 0 4 1 | 6 3]• returns the set • [0 3 4 | 0 0 4 | 0 6]• Same computational complexity as scan, but additionally have to keep track of segments (we use head flags to mark which elements are segment heads)• Useful for nested data parallelism (quicksort)

×