Advertisement

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

npinto
Apr. 30, 2011
Advertisement

More Related Content

Advertisement
Advertisement

[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

  1. –  Data-independent tasks –  Tasks with statically-known data dependences –  SIMD divergence –  Lacking fine-grained synchronization –  Lacking writeable, coherent caches
  2. –  Data-independent tasks –  Tasks with statically-known data dependences –  SIMD divergence –  Lacking fine-grained synchronization –  Lacking writeable, coherent caches
  3. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*) * Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09
  4. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  480 775 1005 NVIDIA  GTX  280 449 534 NVIDIA  8800  GT 129 171
  5. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)   NVIDIA  GTX  480 775 1005 NVIDIA  GTX  280 449 534 NVIDIA  8800  GT 129 171 Intel    Knight's  Ferry  MIC  32-­‐core* 560 Intel    Core  i7  quad-­‐core  * 240 Intel    Core-­‐2  quad-­‐core* 138 *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
  6.  
  7. Input   Thread   Thread   Thread   Thread   Output   –  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element •  The output (and at least one input) index is a static function of thread-id
  8. Input   ?   Output   –  Each output element has dependences upon any / all input elements –  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
  9. –  Threads are decomposed by output element Thread   Thread   Thread   Thread   –  Repeatedly iterate over recycled input streams –  Output stream size is statically known before each pass Thread   Thread   Thread   Thread  
  10. + + + + –  O(n) global work from passes of pairwise-neighbor-reduction –  Static dependences, uniform output
  11. allocation –  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2) –  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  12. allocation –  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2) –  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  13.    –  Variable output per thread –  Need dynamic, cooperative allocation
  14. Input   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   ?   Output   •  Where do I put something in a list?   Where do I enqueue something? –  Duplicate removal –  Search space exploration –  Sorting –  Graph traversal –  Histogram compilation –  General work queues
  15. • For 30,000 producers and consumers? –  Locks serialize everything
  16. Input   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Prefix  Sum   0   2   3   3   6   a scattering vector –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  17. Thread   Thread   Thread   Thread   Thread   Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  18. Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   Thread   Thread   Thread   Thread   Thread   –  Popularized by Blelloch et al. in the ‘90s Output   0   1   2   3   4   5   6   7   –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  19. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0s 1s Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001
  20. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0 1 2 3 4 5 6 7 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  21. 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (bin relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Adjusted allocations 0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7 (global relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 4 1 5 2 3 6 7 Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 1 2 3 4 5 6 7
  22.  
  23. Determine  allocaCon  size   Global  Device  Memory   Host  Program   CUDPP  scan   CUDPP  Scan   CUDPP  scan   Distribute  output   Host   GPU   Un-fused
  24. Determine  allocaCon  size   Determine  allocaCon   Global  Device  Memory   Global  Device  Memory   Scan   Host  Program   Host  Program   CUDPP  scan   CUDPP  Scan   Scan   CUDPP  scan   Scan   Distribute  output   Distribute  output   Host   GPU   Host   GPU   Un-fused Fused
  25. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of free Host  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everything Host   GPU   Fused
  26. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of free Host  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everything Host   GPU   Fused
  27. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0   Quadro  FX  5600   76.8   152.3   0.504   7.9  
  28. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0   Quadro  FX  5600   76.8   152.3   0.504   7.9  
  29. 25   GTX285  r+w  memory  wall     Thread-­‐InstrucOons  /  32-­‐bit  scan  element   (17.8  instrucOons  per     20   input  word)   15   10   Insert  work  here   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  30. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   10   Insert  work  here   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  31. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   Insert  work  here   10   Our  Scan  Kernel   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  32. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w   20   memory  wall   (17.8)   15   –  Increase granularity / Insert  work  here   redundant computation • ghost cells 10   Our  Scan  Kernel   • radix bits –  Orthogonal kernel fusion 5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  33. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   CUDPP  Scan  Kernel   20   15   10   Our  Scan  Kernel   5   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  34. 35   30   GTX285  Radix   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   –  Partially-coalesced writes Scader  Kernel  Wall   –  2x write overhead 25   GTX285  Scan   20   Insert  work  here   Kernel  Wall   15   –  4 total concurrent scan operations (radix 16) 10   Our  Scan  Kernel   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  35. 50   45   480  Radix   40   Scader  Kernel   Thread-­‐instructoins  /  32-­‐bit  word   Wall   35   30   –  Need kernels with tunable 25   local (or redundant) work 285  Radix   • ghost cells 20   Scader  Kernel   Wall   • radix bits 15   10   5   0   0   10   20   30   40   50   60   70   80   90   Problem  Size  (millions)  
  36.  
  37. –  Virtual processors abstract a diversity of hardware configurations –  Leads to a host of inefficiencies –  E.g., only several hundred CTAs
  38. –  Virtual processors abstract a diversity of hardware configurations –  Leads to a host of inefficiencies –  E.g., only several hundred CTAs
  39. …   Grid A threadblock   grid-size = (N / tilesize) CTAs …   Grid B threadblock   grid-size = 150 CTAs (or other small constant)
  40. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  41. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  42. …   threadblock   –  Thread-dependent predicates –  Setup and initialization code (notably for smem) –  Offset calculations (notably for smem) –  Common values are hoisted and kept live –  Spills are really bad
  43. log tilesize (N) -level tree Two-level tree load, store) –  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible –  2-4 instructions per access (offset calcs,
  44. log tilesize (N) -level tree Two-level tree load, store) –  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible –  2-4 instructions per access (offset calcs,
  45. 20   Thread-­‐instrucOons  /  Element   16   12   Compute  Load   8   285  Scan  Kernel  Wall   4   0   0   1000   2000   3000   4000   5000   6000   7000   8000   9000   Grid  Size  (#  of  threadblocks)  
  46. C = number of CTAs N = problem size –  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA –  conditional evaluation –  singleton loads
  47. C = number of CTAs N = problem size –  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA –  conditional evaluation –  singleton loads
  48. C = number of CTAs N = problem size –  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA T = tile size B = tiles per CTA –  16.1M % (1024 * 150) = 136.4 extra tiles
  49. C = number of CTAs N = problem size –  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs) T = tile size B = tiles per CTA –  109 + 1 = 110 tiles per CTA (136 CTAs) –  16.1M % (1024 * 150) = 0.4 extra tiles
  50.  
  51. –  If you breathe on your code, run it through the VP •  Kernel runtimes •  Instruction counts –  Indispensible for tuning •  Host-side timing requires too many iterations •  Only 1-2 cudaprof iterations for consistent counter-based perf data –  Write tools to parse the output •  “Dummy” kernels useful for demarcation
  52. 1100   1000   GTX  480   900   C2050  (no  ECC)   SorOng  Rate  (106  keys/sec)   800   GTX  285   700   C2050  (ECC)   600   GTX  280   500   C1060   400   9800  GTX+   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272   Problem  size  (millions)  
  53. 800   GTX  480   700   C2050  (no  ECC)   GTX  285   SorOng  Rate  (millions  of  pairs/sec)   600   GTX  280   C2050  (ECC)   C1060   500   9800  GTX+   400   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   Problem  size  (millions)  
  54. 180   160   Kernel  Bandwidth  (GiBytes  /  sec)   140   120   100   80   60   40   merrill_tree  Reduce   20   merrill_rts  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  55. 180   160   Kernel  Bandwidth  (Bytes  x109  /  sec)   140   120   100   80   60   40   merrill_linear  Reduce   20   merrill_linear  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  56. –  Implement device “memcpy” for tile-processing •  Optimize for “full tiles” –  Specialize for different SM versions, input types, etc.
  57. –  Use templated code to generate various instances –  Run with cudaprof env vars to collect data
  58. 160   128-­‐Thread  CTA  (64B  ld)   150   Bandwidth  (GiBytes/sec)   140   One-­‐way   130   120   Single   110   Single  (no-­‐overlap)   Double   100   Double  (no-­‐overlap)   90   Quad   80   Quad  (no-­‐overlap)   0   20   40   60   80   100   120   Words  Copied  (millions)   us   cudaMemcpy()   128-­‐Thread  CTA  (128B  ld/st)   140   Bandwidth  (GiBytes/sec)   130   Two-­‐way   120   Single   110   Single  (no-­‐overlap)   100   Double   90   Double  (no-­‐overlap)   Quad   80   Quad  (no-­‐overlap)   70   Intrinsic  Copy   0   20   40   60   80   100   120   Words  Copied  (millions)  
  59.  
  60. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3   t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7) t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0   t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1   t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3   t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  61. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3   t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7) t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0   t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1   t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3   t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  62. barrier   Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1 vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  63. barrier   Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1 vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  64. DMA   t0   t1   t2   …   t   T/4   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐   tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4   -­‐1   t3T/4     t3T/ 4+1   t3T/ 4+2   …   t   T  -­‐  1 –  Barriers make O(n) code O(n log n) -­‐1 1 t0   t1   t2   t3   t0   t1   t2   t3   –  The rest are “DMA engine” threads t0   t1   t2   t3   Worker     –  Use threadblocks to cover pipeline t0   t1   t2     t3 latencies, e.g., for Fermi: t0   t1   t2     t3 t0   t1   t2     t3 •  2 worker warps per CTA t0   t1   t2     t3 •  6-7 CTAs
  65.  
  66. –  Different SMs (varied local storage: registers/smem) –  Different input types (e.g., sorting chars vs. ulongs) –  # of steps for each algorithm phase is configuration-driven –  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros –  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem
Advertisement