0
–  Data-independent tasks–  Tasks with statically-known data dependences–  SIMD divergence–  Lacking fine-grained synchron...
–  Data-independent tasks–  Tasks with statically-known data dependences–  SIMD divergence–  Lacking fine-grained synchron...
32-­‐bit	  Key-­‐value	  Sor7ng	                 Keys-­‐only	  Sor7ng	                    DEVICE           (106	  keys	  /...
32-­‐bit	  Key-­‐value	  Sor7ng	         Keys-­‐only	  Sor7ng	                    DEVICE           (106	  keys	  /	  sec)	...
32-­‐bit	  Key-­‐value	  Sor7ng	            Keys-­‐only	  Sor7ng	                           DEVICE                        ...
 
Input	                                   Thread	     Thread	     Thread	     Thread	                    Output	  –  Each o...
Input	                                          ?	     Output	  –  Each output element has dependences upon any / all inpu...
–  Threads are decomposed by output   element                                               Thread	          Thread	      ...
+         +      +   +–  O(n) global work from passes of pairwise-neighbor-reduction–  Static dependences, uniform output
allocation–  Repeated pairwise swapping     • Bubble sort is O(n2)                                             –  Repeated...
allocation–  Repeated pairwise swapping     • Bubble sort is O(n2)                                             –  Repeated...
	  –  Variable output per thread–  Need dynamic, cooperative allocation
Input	                        Thread	     Thread	     Thread	     Thread	     Thread	     Thread	        Thread	     Threa...
• For 30,000 producers and consumers?–  Locks serialize everything
Input	            2	     1	     0	     3	     2	     –    O(n) work                                                       ...
Thread	     Thread	     Thread	     Thread	     Thread	  Input	  	  (	  &	  allocaOon	  	  requirement)	                  ...
Input	  	  (	  &	  allocaOon	  	               requirement)	                                       2	          1	         ...
Key sequence    1110   0011        1010   0111   1100   1000        0101   0001                                    0s     ...
Key sequence      1110            0011              1010         0111     1100        1000             0101        0001   ...
0s                                                                1s Allocation requirements      1       0   1        0  ...
 
Determine	  allocaCon	  size	                                                                Global	  Device	  Memory	  Ho...
Determine	  allocaCon	  size	                                                                                             ...
Determine	  allocaCon	                                      1.  Heavy SMT (over-threading) yields                         ...
Determine	  allocaCon	                                      1.  Heavy SMT (over-threading) yields                         ...
Device	              Memory	  Bandwidth	       Compute	  Throughput	            Memory	  wall	      Memory	  wall	        ...
Device	              Memory	  Bandwidth	       Compute	  Throughput	            Memory	  wall	      Memory	  wall	        ...
25	                                                                                                                       ...
25	  Thread-­‐InstrucOons	  /	  32-­‐bit	  scan	  element	                                                                ...
25	  Thread-­‐InstrucOons	  /	  32-­‐bit	  scan	  element	                                                                ...
25	                                  Thread-­‐InstrucOons	  /	  32-­‐bit	  scan	  element	                                ...
25	  Thread-­‐InstrucOons	  /	  32-­‐bit	  scan	  element	                                                                ...
35	                                                                                                   30	                 ...
50	                                                                                        45	                            ...
 
–  Virtual processors abstract a diversity of hardware configurations–  Leads to a host of inefficiencies–  E.g., only sev...
–  Virtual processors abstract a diversity of hardware configurations–  Leads to a host of inefficiencies–  E.g., only sev...
…	  Grid A                                                             threadblock	           grid-size = (N / tilesize) C...
…	                                                                  threadblock	  –  Thread-dependent predicates–  Setup a...
…	                                                                  threadblock	  –  Thread-dependent predicates–  Setup a...
…	                                                                    threadblock	  –  Thread-dependent predicates–  Setup...
log tilesize (N) -level tree                                                                Two-level tree                ...
log tilesize (N) -level tree                                                                Two-level tree                ...
20	  Thread-­‐instrucOons	  /	  Element	                                             16	                                  ...
C = number of CTAs                                                        N = problem size–    16.1M / 150 CTAs / 1024 =  ...
C = number of CTAs                                                        N = problem size–    16.1M / 150 CTAs / 1024 =  ...
C = number of CTAs                                                          N = problem size–    floor(16.1M / (1024 * 150...
C = number of CTAs                                                                     N = problem size–    floor(16.1M / ...
 
–  If you breathe on your code, run it through the VP    •  Kernel runtimes    •  Instruction counts–  Indispensible for t...
1100	                                            1000	                                                                    ...
800	                                                                                                                      ...
180	                                                     160	  Kernel	  Bandwidth	  (GiBytes	  /	  sec)	                  ...
180	                                                           160	  Kernel	  Bandwidth	  (Bytes	  x109	  /	  sec)	       ...
–  Implement device “memcpy” for tile-processing    •  Optimize for “full tiles”–  Specialize for different SM versions, i...
–  Use templated code to   generate various   instances–  Run with cudaprof env   vars to collect data
160	                                        128-­‐Thread	  CTA	  (64B	  ld)	                                   150	  Bandw...
 
m0              m1           m2               m3           m4              m5           m6              m7                ...
m0              m1           m2               m3           m4              m5           m6              m7                ...
barrier	  Tree-­‐based:	              barrier	                               barrier	                                     ...
barrier	  Tree-­‐based:	              barrier	                               barrier	                                     ...
DMA	                                                           t0	        t1	     t2	     …	   t 	                        ...
 
–  Different SMs (varied local storage: registers/smem)–  Different input types (e.g., sorting chars vs. ulongs)–  # of st...
Subsequence	  of	  Keys	  211	     122	                                   302	                             232	     300	  ...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, Univer...
Upcoming SlideShare
Loading in...5
×

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

1,061

Published on

http://cs264.org

http://goo.gl/1K2fI

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,061
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)"

  1. 1. –  Data-independent tasks–  Tasks with statically-known data dependences–  SIMD divergence–  Lacking fine-grained synchronization–  Lacking writeable, coherent caches
  2. 2. –  Data-independent tasks–  Tasks with statically-known data dependences–  SIMD divergence–  Lacking fine-grained synchronization–  Lacking writeable, coherent caches
  3. 3. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)  NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*) * Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS 09
  4. 4. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)  NVIDIA  GTX  480 775 1005NVIDIA  GTX  280 449 534NVIDIA  8800  GT 129 171
  5. 5. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)  NVIDIA  GTX  480 775 1005NVIDIA  GTX  280 449 534NVIDIA  8800  GT 129 171Intel    Knights  Ferry  MIC  32-­‐core* 560Intel    Core  i7  quad-­‐core  * 240Intel    Core-­‐2  quad-­‐core* 138 *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
  6. 6.  
  7. 7. Input   Thread   Thread   Thread   Thread   Output  –  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element •  The output (and at least one input) index is a static function of thread-id
  8. 8. Input   ?   Output  –  Each output element has dependences upon any / all input elements–  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
  9. 9. –  Threads are decomposed by output element Thread   Thread   Thread   Thread  –  Repeatedly iterate over recycled input streams–  Output stream size is statically known before each pass Thread   Thread   Thread   Thread  
  10. 10. + + + +–  O(n) global work from passes of pairwise-neighbor-reduction–  Static dependences, uniform output
  11. 11. allocation–  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2)–  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  12. 12. allocation–  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2)–  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  13. 13.   –  Variable output per thread–  Need dynamic, cooperative allocation
  14. 14. Input   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   ?   Output  •  Where do I put something in a list?   Where do I enqueue something? –  Duplicate removal –  Search space exploration –  Sorting –  Graph traversal –  Histogram compilation –  General work queues
  15. 15. • For 30,000 producers and consumers?–  Locks serialize everything
  16. 16. Input   2   1   0   3   2   –  O(n) work –  For allocation: use scan results asPrefix  Sum   0   2   3   3   6   a scattering vector –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  17. 17. Thread   Thread   Thread   Thread   Thread  Input    (  &  allocaOon    requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results asResult  of     a scattering vectorprefix  scan  (sum)   0   2   3   3   6   –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  18. 18. Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   Thread   Thread   Thread   Thread   Thread   –  Popularized by Blelloch et al. in the ‘90sOutput   0   1   2   3   4   5   6   7   –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  19. 19. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0s 1sOutput key sequence 1110 1010 1100 1000 0011 0111 0101 0001
  20. 20. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0 1 2 3 4 5 6 7 0s 1sAllocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  21. 21. 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (bin relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Adjusted allocations 0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7(global relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 4 1 5 2 3 6 7 Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 1 2 3 4 5 6 7
  22. 22.  
  23. 23. Determine  allocaCon  size   Global  Device  Memory  Host  Program   CUDPP  scan   CUDPP  Scan   CUDPP  scan   Distribute  output   Host   GPU   Un-fused
  24. 24. Determine  allocaCon  size   Determine  allocaCon   Global  Device  Memory   Global  Device  Memory   Scan  Host  Program   Host  Program   CUDPP  scan   CUDPP  Scan   Scan   CUDPP  scan   Scan   Distribute  output   Distribute  output   Host   GPU   Host   GPU   Un-fused Fused
  25. 25. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of freeHost  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everythingHost   GPU   Fused
  26. 26. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of freeHost  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everythingHost   GPU   Fused
  27. 27. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0  Quadro  FX  5600   76.8   152.3   0.504   7.9  
  28. 28. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0  Quadro  FX  5600   76.8   152.3   0.504   7.9  
  29. 29. 25   GTX285  r+w  memory  wall    Thread-­‐InstrucOons  /  32-­‐bit  scan  element   (17.8  instrucOons  per     20   input  word)   15   10   Insert  work  here   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  30. 30. 25  Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   10   Insert  work  here   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  31. 31. 25  Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   Insert  work  here   10   Our  Scan  Kernel   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  32. 32. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w   20   memory  wall   (17.8)   15  –  Increase granularity / Insert  work  here   redundant computation • ghost cells 10   Our  Scan  Kernel   • radix bits–  Orthogonal kernel fusion 5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  33. 33. 25  Thread-­‐InstrucOons  /  32-­‐bit  scan  element   CUDPP  Scan  Kernel   20   15   10   Our  Scan  Kernel   5   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  34. 34. 35   30   GTX285  Radix   Thread-­‐InstrucOons  /  32-­‐bit  scan  element  –  Partially-coalesced writes Scader  Kernel  Wall  –  2x write overhead 25   GTX285  Scan   20   Insert  work  here   Kernel  Wall   15  –  4 total concurrent scan operations (radix 16) 10   Our  Scan  Kernel   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  35. 35. 50   45   480  Radix   40   Scader  Kernel   Thread-­‐instructoins  /  32-­‐bit  word   Wall   35   30  –  Need kernels with tunable 25   local (or redundant) work 285  Radix   • ghost cells 20   Scader  Kernel   Wall   • radix bits 15   10   5   0   0   10   20   30   40   50   60   70   80   90   Problem  Size  (millions)  
  36. 36.  
  37. 37. –  Virtual processors abstract a diversity of hardware configurations–  Leads to a host of inefficiencies–  E.g., only several hundred CTAs
  38. 38. –  Virtual processors abstract a diversity of hardware configurations–  Leads to a host of inefficiencies–  E.g., only several hundred CTAs
  39. 39. …  Grid A threadblock   grid-size = (N / tilesize) CTAs …  Grid B threadblock   grid-size = 150 CTAs (or other small constant)
  40. 40. …   threadblock  –  Thread-dependent predicates–  Setup and initialization code (notably for smem)–  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  41. 41. …   threadblock  –  Thread-dependent predicates–  Setup and initialization code (notably for smem)–  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  42. 42. …   threadblock  –  Thread-dependent predicates–  Setup and initialization code (notably for smem)–  Offset calculations (notably for smem) –  Common values are hoisted and kept live –  Spills are really bad
  43. 43. log tilesize (N) -level tree Two-level tree load, store)–  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible–  2-4 instructions per access (offset calcs,
  44. 44. log tilesize (N) -level tree Two-level tree load, store)–  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible–  2-4 instructions per access (offset calcs,
  45. 45. 20  Thread-­‐instrucOons  /  Element   16   12   Compute  Load   8   285  Scan  Kernel  Wall   4   0   0   1000   2000   3000   4000   5000   6000   7000   8000   9000   Grid  Size  (#  of  threadblocks)  
  46. 46. C = number of CTAs N = problem size–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA–  conditional evaluation–  singleton loads
  47. 47. C = number of CTAs N = problem size–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA–  conditional evaluation–  singleton loads
  48. 48. C = number of CTAs N = problem size–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA T = tile size B = tiles per CTA–  16.1M % (1024 * 150) = 136.4 extra tiles
  49. 49. C = number of CTAs N = problem size–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs) T = tile size B = tiles per CTA–  109 + 1 = 110 tiles per CTA (136 CTAs)–  16.1M % (1024 * 150) = 0.4 extra tiles
  50. 50.  
  51. 51. –  If you breathe on your code, run it through the VP •  Kernel runtimes •  Instruction counts–  Indispensible for tuning •  Host-side timing requires too many iterations •  Only 1-2 cudaprof iterations for consistent counter-based perf data–  Write tools to parse the output •  “Dummy” kernels useful for demarcation
  52. 52. 1100   1000   GTX  480   900   C2050  (no  ECC)  SorOng  Rate  (106  keys/sec)   800   GTX  285   700   C2050  (ECC)   600   GTX  280   500   C1060   400   9800  GTX+   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272   Problem  size  (millions)  
  53. 53. 800   GTX  480   700   C2050  (no  ECC)   GTX  285  SorOng  Rate  (millions  of  pairs/sec)   600   GTX  280   C2050  (ECC)   C1060   500   9800  GTX+   400   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   Problem  size  (millions)  
  54. 54. 180   160  Kernel  Bandwidth  (GiBytes  /  sec)   140   120   100   80   60   40   merrill_tree  Reduce   20   merrill_rts  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  55. 55. 180   160  Kernel  Bandwidth  (Bytes  x109  /  sec)   140   120   100   80   60   40   merrill_linear  Reduce   20   merrill_linear  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  56. 56. –  Implement device “memcpy” for tile-processing •  Optimize for “full tiles”–  Specialize for different SM versions, input types, etc.
  57. 57. –  Use templated code to generate various instances–  Run with cudaprof env vars to collect data
  58. 58. 160   128-­‐Thread  CTA  (64B  ld)   150  Bandwidth  (GiBytes/sec)   140   One-­‐way   130   120   Single   110   Single  (no-­‐overlap)   Double   100   Double  (no-­‐overlap)   90   Quad   80   Quad  (no-­‐overlap)   0   20   40   60   80   100   120   Words  Copied  (millions)   us   cudaMemcpy()   128-­‐Thread  CTA  (128B  ld/st)   140   Bandwidth  (GiBytes/sec)   130   Two-­‐way   120   Single   110   Single  (no-­‐overlap)   100   Double   90   Double  (no-­‐overlap)   Quad   80   Quad  (no-­‐overlap)   70   Intrinsic  Copy   0   20   40   60   80   100   120   Words  Copied  (millions)  
  59. 59.  
  60. 60. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3  t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0  t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1  t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3  t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  61. 61. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3  t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0  t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1  t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3  t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  62. 62. barrier  Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  63. 63. barrier  Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  64. 64. DMA   t0   t1   t2   …   t   T/4   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐   tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4   -­‐1   t3T/4     t3T/ 4+1   t3T/ 4+2   …   t   T  -­‐  1–  Barriers make O(n) code O(n log n) -­‐1 1 t0   t1   t2   t3   t0   t1   t2   t3  –  The rest are “DMA engine” threads t0   t1   t2   t3   Worker    –  Use threadblocks to cover pipeline t0   t1   t2     t3 latencies, e.g., for Fermi: t0   t1   t2     t3 t0   t1   t2     t3 •  2 worker warps per CTA t0   t1   t2     t3 •  6-7 CTAs
  65. 65.  
  66. 66. –  Different SMs (varied local storage: registers/smem)–  Different input types (e.g., sorting chars vs. ulongs)–  # of steps for each algorithm phase is configuration-driven–  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros–  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem
  67. 67. Subsequence  of  Keys  211   122   302   232   300   021   022   013   021   123   330   023   130   220   020   301   112   221   023   322   003   012   022   130   010   121   323   020   101   212   220   333   Digit  Flag  Vectors   Digit   Decoding   0s   0   0   0   0   1   0   0   0   0   0   1   0   1   1   1   0   0   0   0   0   0   0   0   1   1   0   0   1   0   0   1   0   1s   1   0   0   0   0   1   0   0   1   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   2s   0   1   1   1   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0   1   0   1   1   0   0   0   0   0   0   1   0   0   3s   0   0   0   0   0   0   0   1   0   1   0   1   0   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   0   0   0   1   Serial     (regs)   …   …   …   …   ReducOon   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT €
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×