[Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

  • 952 views
Uploaded on

http://cs264.org …

http://cs264.org

http://goo.gl/1K2fI

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
952
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
21
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. –  Data-independent tasks–  Tasks with statically-known data dependences–  SIMD divergence–  Lacking fine-grained synchronization–  Lacking writeable, coherent caches
  • 2. –  Data-independent tasks–  Tasks with statically-known data dependences–  SIMD divergence–  Lacking fine-grained synchronization–  Lacking writeable, coherent caches
  • 3. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)  NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*) * Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS 09
  • 4. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)  NVIDIA  GTX  480 775 1005NVIDIA  GTX  280 449 534NVIDIA  8800  GT 129 171
  • 5. 32-­‐bit  Key-­‐value  Sor7ng   Keys-­‐only  Sor7ng   DEVICE (106  keys  /  sec)     (106  pairs/  sec)  NVIDIA  GTX  480 775 1005NVIDIA  GTX  280 449 534NVIDIA  8800  GT 129 171Intel    Knights  Ferry  MIC  32-­‐core* 560Intel    Core  i7  quad-­‐core  * 240Intel    Core-­‐2  quad-­‐core* 138 *Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.
  • 6.  
  • 7. Input   Thread   Thread   Thread   Thread   Output  –  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element •  The output (and at least one input) index is a static function of thread-id
  • 8. Input   ?   Output  –  Each output element has dependences upon any / all input elements–  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.
  • 9. –  Threads are decomposed by output element Thread   Thread   Thread   Thread  –  Repeatedly iterate over recycled input streams–  Output stream size is statically known before each pass Thread   Thread   Thread   Thread  
  • 10. + + + +–  O(n) global work from passes of pairwise-neighbor-reduction–  Static dependences, uniform output
  • 11. allocation–  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2)–  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  • 12. allocation–  Repeated pairwise swapping • Bubble sort is O(n2) –  Repeatedly check each vertex or edge • Bitonic sort is O(nlog2n) • Breadth-first search becomes O(V2)–  Need partitioning: dynamic, cooperative • O(V+E) is work-optimal –  Need queue: dynamic, cooperative allocation
  • 13.   –  Variable output per thread–  Need dynamic, cooperative allocation
  • 14. Input   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   ?   Output  •  Where do I put something in a list?   Where do I enqueue something? –  Duplicate removal –  Search space exploration –  Sorting –  Graph traversal –  Histogram compilation –  General work queues
  • 15. • For 30,000 producers and consumers?–  Locks serialize everything
  • 16. Input   2   1   0   3   2   –  O(n) work –  For allocation: use scan results asPrefix  Sum   0   2   3   3   6   a scattering vector –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 17. Thread   Thread   Thread   Thread   Thread  Input    (  &  allocaOon    requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results asResult  of     a scattering vectorprefix  scan  (sum)   0   2   3   3   6   –  Popularized by Blelloch et al. in the ‘90s –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 18. Input    (  &  allocaOon     requirement)   2   1   0   3   2   –  O(n) work –  For allocation: use scan results as Result  of     a scattering vector prefix  scan  (sum)   0   2   3   3   6   Thread   Thread   Thread   Thread   Thread   –  Popularized by Blelloch et al. in the ‘90sOutput   0   1   2   3   4   5   6   7   –  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009
  • 19. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0s 1sOutput key sequence 1110 1010 1100 1000 0011 0111 0101 0001
  • 20. Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 0 1 2 3 4 5 6 7 0s 1sAllocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
  • 21. 0s 1s Allocation requirements 1 0 1 0 1 1 0 0 0 1 0 1 0 0 1 1 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Scanned allocations 0 1 1 2 2 3 4 4 0 0 1 1 2 2 2 3 (bin relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Adjusted allocations 0 1 1 2 2 3 4 4 4 4 5 5 6 6 6 7(global relocation offsets) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 4 1 5 2 3 6 7 Key sequence 1110 0011 1010 0111 1100 1000 0101 0001 Output key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 1 2 3 4 5 6 7
  • 22.  
  • 23. Determine  allocaCon  size   Global  Device  Memory  Host  Program   CUDPP  scan   CUDPP  Scan   CUDPP  scan   Distribute  output   Host   GPU   Un-fused
  • 24. Determine  allocaCon  size   Determine  allocaCon   Global  Device  Memory   Global  Device  Memory   Scan  Host  Program   Host  Program   CUDPP  scan   CUDPP  Scan   Scan   CUDPP  scan   Scan   Distribute  output   Distribute  output   Host   GPU   Host   GPU   Un-fused Fused
  • 25. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of freeHost  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everythingHost   GPU   Fused
  • 26. Determine  allocaCon   1.  Heavy SMT (over-threading) yields Global  Device  Memory   Scan   usable “bubbles” of freeHost  Program   computation Scan   2.  Propagate live data between steps in fast registers / smem Scan   3.  Use scan (or variant) as a “runtime” Distribute  output   for everythingHost   GPU   Fused
  • 27. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0  Quadro  FX  5600   76.8   152.3   0.504   7.9  
  • 28. Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall     (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)   GTX  480   169.0   672.0   0.251   15.9   GTX  285   159.0   354.2   0.449   8.9   GTX  280   141.7   311.0   0.456   8.8   Tesla  C1060   102.0   312.0   0.327   12.2   9800  GTX+   70.4   235.0   0.300   13.4   8800  GT   57.6   168.0   0.343   11.7   9800  GT   57.6   168.0   0.343   11.7   8800  GTX   86.4   172.8   0.500   8.0  Quadro  FX  5600   76.8   152.3   0.504   7.9  
  • 29. 25   GTX285  r+w  memory  wall    Thread-­‐InstrucOons  /  32-­‐bit  scan  element   (17.8  instrucOons  per     20   input  word)   15   10   Insert  work  here   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 30. 25  Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   10   Insert  work  here   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 31. 25  Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w  memory   20   wall  (17.8)   15   Insert  work  here   10   Our  Scan  Kernel   5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 32. 25   Thread-­‐InstrucOons  /  32-­‐bit  scan  element   GTX285  r+w   20   memory  wall   (17.8)   15  –  Increase granularity / Insert  work  here   redundant computation • ghost cells 10   Our  Scan  Kernel   • radix bits–  Orthogonal kernel fusion 5   Data  Movement   Skeleton   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 33. 25  Thread-­‐InstrucOons  /  32-­‐bit  scan  element   CUDPP  Scan  Kernel   20   15   10   Our  Scan  Kernel   5   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 34. 35   30   GTX285  Radix   Thread-­‐InstrucOons  /  32-­‐bit  scan  element  –  Partially-coalesced writes Scader  Kernel  Wall  –  2x write overhead 25   GTX285  Scan   20   Insert  work  here   Kernel  Wall   15  –  4 total concurrent scan operations (radix 16) 10   Our  Scan  Kernel   5   0   0   16   32   48   64   80   96   112   Problem  Size  (millions)  
  • 35. 50   45   480  Radix   40   Scader  Kernel   Thread-­‐instructoins  /  32-­‐bit  word   Wall   35   30  –  Need kernels with tunable 25   local (or redundant) work 285  Radix   • ghost cells 20   Scader  Kernel   Wall   • radix bits 15   10   5   0   0   10   20   30   40   50   60   70   80   90   Problem  Size  (millions)  
  • 36.  
  • 37. –  Virtual processors abstract a diversity of hardware configurations–  Leads to a host of inefficiencies–  E.g., only several hundred CTAs
  • 38. –  Virtual processors abstract a diversity of hardware configurations–  Leads to a host of inefficiencies–  E.g., only several hundred CTAs
  • 39. …  Grid A threadblock   grid-size = (N / tilesize) CTAs …  Grid B threadblock   grid-size = 150 CTAs (or other small constant)
  • 40. …   threadblock  –  Thread-dependent predicates–  Setup and initialization code (notably for smem)–  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  • 41. …   threadblock  –  Thread-dependent predicates–  Setup and initialization code (notably for smem)–  Offset calculations (notably for smem) –  Common values are hoisted and kept live
  • 42. …   threadblock  –  Thread-dependent predicates–  Setup and initialization code (notably for smem)–  Offset calculations (notably for smem) –  Common values are hoisted and kept live –  Spills are really bad
  • 43. log tilesize (N) -level tree Two-level tree load, store)–  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible–  2-4 instructions per access (offset calcs,
  • 44. log tilesize (N) -level tree Two-level tree load, store)–  O( N / tilesize) gmem accesses –  GPU is least efficient here: get it over with as quick as possible–  2-4 instructions per access (offset calcs,
  • 45. 20  Thread-­‐instrucOons  /  Element   16   12   Compute  Load   8   285  Scan  Kernel  Wall   4   0   0   1000   2000   3000   4000   5000   6000   7000   8000   9000   Grid  Size  (#  of  threadblocks)  
  • 46. C = number of CTAs N = problem size–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA–  conditional evaluation–  singleton loads
  • 47. C = number of CTAs N = problem size–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA T = tile size B = tiles per CTA–  conditional evaluation–  singleton loads
  • 48. C = number of CTAs N = problem size–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA T = tile size B = tiles per CTA–  16.1M % (1024 * 150) = 136.4 extra tiles
  • 49. C = number of CTAs N = problem size–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs) T = tile size B = tiles per CTA–  109 + 1 = 110 tiles per CTA (136 CTAs)–  16.1M % (1024 * 150) = 0.4 extra tiles
  • 50.  
  • 51. –  If you breathe on your code, run it through the VP •  Kernel runtimes •  Instruction counts–  Indispensible for tuning •  Host-side timing requires too many iterations •  Only 1-2 cudaprof iterations for consistent counter-based perf data–  Write tools to parse the output •  “Dummy” kernels useful for demarcation
  • 52. 1100   1000   GTX  480   900   C2050  (no  ECC)  SorOng  Rate  (106  keys/sec)   800   GTX  285   700   C2050  (ECC)   600   GTX  280   500   C1060   400   9800  GTX+   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272   Problem  size  (millions)  
  • 53. 800   GTX  480   700   C2050  (no  ECC)   GTX  285  SorOng  Rate  (millions  of  pairs/sec)   600   GTX  280   C2050  (ECC)   C1060   500   9800  GTX+   400   300   200   100   0   0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   Problem  size  (millions)  
  • 54. 180   160  Kernel  Bandwidth  (GiBytes  /  sec)   140   120   100   80   60   40   merrill_tree  Reduce   20   merrill_rts  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 55. 180   160  Kernel  Bandwidth  (Bytes  x109  /  sec)   140   120   100   80   60   40   merrill_linear  Reduce   20   merrill_linear  Scan   0   0   20   40   60   80   100   120   Problem  Size  (millions)  
  • 56. –  Implement device “memcpy” for tile-processing •  Optimize for “full tiles”–  Specialize for different SM versions, input types, etc.
  • 57. –  Use templated code to generate various instances–  Run with cudaprof env vars to collect data
  • 58. 160   128-­‐Thread  CTA  (64B  ld)   150  Bandwidth  (GiBytes/sec)   140   One-­‐way   130   120   Single   110   Single  (no-­‐overlap)   Double   100   Double  (no-­‐overlap)   90   Quad   80   Quad  (no-­‐overlap)   0   20   40   60   80   100   120   Words  Copied  (millions)   us   cudaMemcpy()   128-­‐Thread  CTA  (128B  ld/st)   140   Bandwidth  (GiBytes/sec)   130   Two-­‐way   120   Single   110   Single  (no-­‐overlap)   100   Double   90   Double  (no-­‐overlap)   Quad   80   Quad  (no-­‐overlap)   70   Intrinsic  Copy   0   20   40   60   80   100   120   Words  Copied  (millions)  
  • 59.  
  • 60. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3  t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0  t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1  t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3  t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  • 61. m0 m1 m2 m3 m4 m5 m6 m7 m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11t0 x0 x1 x2 x3 x4 x5 x6 x7 t0   i i i i x0 x1 x2 x3 x4 x5 x6 x7 ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   ⊕2   ⊕3  t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7) t1   i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7) ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   ⊕0   ⊕1   t2   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i ⊕0   ⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7   t3   i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7) =0   ⊕0  t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3) =0 ⊕0   =1 ⊕1  t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5) =0 ⊕0   =1 ⊕1   =2 ⊕2   =3 ⊕3  t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) –  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size –  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
  • 62. barrier  Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  • 63. barrier  Tree-­‐based:   barrier   barrier   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2   …   t T/4  -­‐1   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐  1 tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   t   T  -­‐  1vs.  raking-­‐based:   barrier   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3 t0   t1   t2     t3
  • 64. DMA   t0   t1   t2   …   t   T/4   tT/4     tT/4  +   1   tT/4  +   2   …   t   T/2  -­‐   tT/2     tT/2  +   1   tT/2  +   2   …   t 3T/4   -­‐1   t3T/4     t3T/ 4+1   t3T/ 4+2   …   t   T  -­‐  1–  Barriers make O(n) code O(n log n) -­‐1 1 t0   t1   t2   t3   t0   t1   t2   t3  –  The rest are “DMA engine” threads t0   t1   t2   t3   Worker    –  Use threadblocks to cover pipeline t0   t1   t2     t3 latencies, e.g., for Fermi: t0   t1   t2     t3 t0   t1   t2     t3 •  2 worker warps per CTA t0   t1   t2     t3 •  6-7 CTAs
  • 65.  
  • 66. –  Different SMs (varied local storage: registers/smem)–  Different input types (e.g., sorting chars vs. ulongs)–  # of steps for each algorithm phase is configuration-driven–  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros–  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem
  • 67. Subsequence  of  Keys  211   122   302   232   300   021   022   013   021   123   330   023   130   220   020   301   112   221   023   322   003   012   022   130   010   121   323   020   101   212   220   333   Digit  Flag  Vectors   Digit   Decoding   0s   0   0   0   0   1   0   0   0   0   0   1   0   1   1   1   0   0   0   0   0   0   0   0   1   1   0   0   1   0   0   1   0   1s   1   0   0   0   0   1   0   0   1   0   0   0   0   0   0   1   0   1   0   0   0   0   0   0   0   1   0   0   1   0   0   0   2s   0   1   1   1   0   0   1   0   0   0   0   0   0   0   0   0   1   0   0   1   0   1   1   0   0   0   0   0   0   1   0   0   3s   0   0   0   0   0   0   0   1   0   1   0   1   0   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   0   0   0   1   Serial     (regs)   …   …   …   …   ReducOon   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   …   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   …   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   …   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   …   tT  -­‐  1   t0   t1   t2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/4  +  2   tT/2  -­‐  1   tT/2     tT/2  +  1   tT/2  +  2   t3T/4  -­‐1   t3T/4     t3T/4+1   t3T/4+2   tT  -­‐  1   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   Serial     (shmem)   ReducOon   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   Scan     (shmem)   t0   t1   t2   t3   SIMD  Kogge-­‐Stone     t0   t1   t2   t3   9   t0   t1   t2   t3   0s  total   t0   t1   t2   t3   7   t0   t1   t2   t3   1s  total   9   2s  total   t0   t1   t2   t3   7   t0   t1   t2   t3   3s  total   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   (shmem)   Serial  Scan     t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   (regs)   Serial  Scan     0s   0   0   0   0   1   1   1   1   1   1   2   2   3   4   5   5   5   5   5   5   5   5   5   6   7   7   7   8   8   8   9   9   1s   1   1   1   1   1   2   2   2   3   3   3   3   3   3   3   4   4   5   5   5   5   5   5   5   5   6   6   6   7   7   7   7   2s   0   1   2   3   3   3   4   4   4   4   4   4   4   4   4   4   5   5   5   6   6   7   8   8   8   8   8   8   8   9   9   9   3s   0   0   0   0   0   0   0   1   1   2   2   3   3   3   3   3   3   3   4   4   5   5   5   5   5   5   6   6   6   6   6   7   0s  total   1s  total   2s  total   3s  total   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   t0   t1   t2   t3   Local  Exchange  Offsets   10   17   18   19   1   11   20   26   12   27   2   28   3   4   5   13   21   14   29   22   30   23   24   6   7   15   31   8   16   25   9   32   Scader   …   …   …   …   (SIMD  Kogge-­‐Stone  Scan,  smem  exchange)   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   t0   t1   tT/4  -­‐2   tT/4  -­‐1   tT/4     tT/4  +  1   tT/2  -­‐  2   tT/2  -­‐  1   tT/2     tT/2  +  1   t3T/4  -­‐2   t3T/4  -­‐1   t3T/4     t3T/4+1   tT  -­‐  2   tT  -­‐  1   …   …   …   …   Exchanged  Keys   300   330   130   220   020   130   010   020   220   211   021   021   301   221   121   101   122   302   232   022   122   112   322   022   212   013   123   023   023   003   323   333   9   18   0s  carry-­‐in   0s  carry-­‐out   25   32   1s  carry-­‐in   1s  carry-­‐out   33   42   2s  carry-­‐in   2s  carry-­‐out   49   56   3s  carry-­‐in   3s  carry-­‐out   Global  Scatter  Offsets   10   11   12   13   14   15   16   17   18   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   50   51   52   53   54   55   56  
  • 68. •  Resource-allocation as runtime1.  Kernel memory-wall analysis (kernel fusion)2.  Algorithm serialization3.  Tune for data-movement4.  Warp-synchronous programming5.  Flexible granularity via meta-programming
  • 69. –  Back40Computing (a Google Code Project) •  http://code.google.com/p/back40computing/–  Default sorting method for Thrust •  http://code.google.com/p/thrust/
  • 70. –  A single host-side procedure call launches a kernel that performs orthogonal program stepsMyUberKernel<<<grid_size, num_threads>>>(d_device_storage);–  No existing public repositories of kernel “subroutines” for scavenging
  • 71. GATHER(key) GATHER (value) Extract radix digit Encode flag bit –  Callbacks, iterators, visitors, functors, etc. (into flag vectors) LOCAL MULTI-SCAN (flag vectors) – ReduceKernel<<<grid_size, num_threads>>> (CountingIterator(100)); Decode local rank (from flag vectors)EXCHANGE (key) EXCHANGE (value) Extract radix digit (again) Update global radix digit partition offsets –  E.g., fused kernel left can’t be composed using a callback-based patternSCATTER (key) SCATTER (value) Fused radix sorting kernel • Digit extraction • Local prefix scan • Scatter accordingly
  • 72. –  Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types. • Specializing for device configurations makes it even worse–  The alternative is to ship source for #include’ing • Have to be willing to share source–  Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission–  Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?