Similar to [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)(20)
Input
Thread
Thread
Thread
Thread
Output
– Each output is dependent upon a finite subset of the input
• Threads are decomposed by output element
• The output (and at least one input) index is a static function of thread-id
Input
?
Output
– Each output element has dependences upon any / all input elements
– E.g., sorting, reduction, compaction, duplicate removal, histogram generation,
map-reduce, etc.
– Threads are decomposed by output
element
Thread
Thread
Thread
Thread
– Repeatedly iterate over recycled
input streams
– Output stream size is statically
known before each pass Thread
Thread
Thread
Thread
+ + + +
– O(n) global work from passes of pairwise-neighbor-reduction
– Static dependences, uniform output
allocation
– Repeated pairwise swapping
• Bubble sort is O(n2)
– Repeatedly check each vertex or edge
• Bitonic sort is O(nlog2n)
• Breadth-first search becomes O(V2)
– Need partitioning: dynamic, cooperative • O(V+E) is work-optimal
– Need queue: dynamic, cooperative
allocation
allocation
– Repeated pairwise swapping
• Bubble sort is O(n2)
– Repeatedly check each vertex or edge
• Bitonic sort is O(nlog2n)
• Breadth-first search becomes O(V2)
– Need partitioning: dynamic, cooperative • O(V+E) is work-optimal
– Need queue: dynamic, cooperative
allocation
– Variable output per thread
– Need dynamic, cooperative allocation
Input
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
Thread
?
Output
• Where do I put something in a list? Where do I enqueue something?
– Duplicate removal – Search space exploration
– Sorting – Graph traversal
– Histogram compilation – General work queues
Input
2
1
0
3
2
– O(n) work
– For allocation: use scan results as
Prefix
Sum
0
2
3
3
6
a scattering vector
– Popularized by Blelloch et al. in the
‘90s
– Merrill et al. Parallel Scan for
Stream Architectures. Technical
Report CS2009-14, University of
Virginia. 2009
Thread
Thread
Thread
Thread
Thread
Input
(
&
allocaOon
requirement)
2
1
0
3
2
– O(n) work
– For allocation: use scan results as
Result
of
a scattering vector
prefix
scan
(sum)
0
2
3
3
6
– Popularized by Blelloch et al. in the
‘90s
– Merrill et al. Parallel Scan for
Stream Architectures. Technical
Report CS2009-14, University of
Virginia. 2009
Input
(
&
allocaOon
requirement)
2
1
0
3
2
– O(n) work
– For allocation: use scan results as
Result
of
a scattering vector
prefix
scan
(sum)
0
2
3
3
6
Thread
Thread
Thread
Thread
Thread
– Popularized by Blelloch et al. in the
‘90s
Output
0
1
2
3
4
5
6
7
– Merrill et al. Parallel Scan for
Stream Architectures. Technical
Report CS2009-14, University of
Virginia. 2009
Determine
allocaCon
size
Global
Device
Memory
Host
Program
CUDPP
scan
CUDPP
Scan
CUDPP
scan
Distribute
output
Host
GPU
Un-fused
Determine
allocaCon
size
Determine
allocaCon
Global
Device
Memory
Global
Device
Memory
Scan
Host
Program
Host
Program
CUDPP
scan
CUDPP
Scan
Scan
CUDPP
scan
Scan
Distribute
output
Distribute
output
Host
GPU
Host
GPU
Un-fused Fused
Determine
allocaCon
1. Heavy SMT (over-threading) yields
Global
Device
Memory
Scan
usable “bubbles” of free
Host
Program
computation
Scan
2. Propagate live data between steps
in fast registers / smem
Scan
3. Use scan (or variant) as a “runtime”
Distribute
output
for everything
Host
GPU
Fused
Determine
allocaCon
1. Heavy SMT (over-threading) yields
Global
Device
Memory
Scan
usable “bubbles” of free
Host
Program
computation
Scan
2. Propagate live data between steps
in fast registers / smem
Scan
3. Use scan (or variant) as a “runtime”
Distribute
output
for everything
Host
GPU
Fused
50
45
480
Radix
40
Scader
Kernel
Thread-‐instructoins
/
32-‐bit
word
Wall
35
30
– Need kernels with tunable
25
local (or redundant) work 285
Radix
• ghost cells 20
Scader
Kernel
Wall
• radix bits
15
10
5
0
0
10
20
30
40
50
60
70
80
90
Problem
Size
(millions)
– Virtual processors abstract a diversity of hardware configurations
– Leads to a host of inefficiencies
– E.g., only several hundred CTAs
– Virtual processors abstract a diversity of hardware configurations
– Leads to a host of inefficiencies
– E.g., only several hundred CTAs
…
Grid A
threadblock
grid-size = (N / tilesize) CTAs
…
Grid B
threadblock
grid-size = 150 CTAs (or other small constant)
…
threadblock
– Thread-dependent predicates
– Setup and initialization code (notably for
smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
…
threadblock
– Thread-dependent predicates
– Setup and initialization code (notably for
smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
…
threadblock
– Thread-dependent predicates
– Setup and initialization code (notably for
smem)
– Offset calculations (notably for smem)
– Common values are hoisted and kept live
– Spills are really bad
log tilesize (N) -level tree
Two-level tree
load, store)
– O( N / tilesize) gmem accesses – GPU is least efficient here: get it over with
as quick as possible
– 2-4 instructions per access (offset calcs,
log tilesize (N) -level tree
Two-level tree
load, store)
– O( N / tilesize) gmem accesses – GPU is least efficient here: get it over with
as quick as possible
– 2-4 instructions per access (offset calcs,
C = number of CTAs
N = problem size
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
T = tile size
B = tiles per CTA
– conditional evaluation
– singleton loads
C = number of CTAs
N = problem size
– 16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA
T = tile size
B = tiles per CTA
– conditional evaluation
– singleton loads
C = number of CTAs
N = problem size
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA
T = tile size
B = tiles per CTA
– 16.1M % (1024 * 150) = 136.4 extra tiles
C = number of CTAs
N = problem size
– floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)
T = tile size
B = tiles per CTA
– 109 + 1 = 110 tiles per CTA (136 CTAs)
– 16.1M % (1024 * 150) = 0.4 extra tiles
– If you breathe on your code, run it through the VP
• Kernel runtimes
• Instruction counts
– Indispensible for tuning
• Host-side timing requires too many iterations
• Only 1-2 cudaprof iterations for consistent counter-based perf data
– Write tools to parse the output
• “Dummy” kernels useful for demarcation
m0 m1 m2 m3 m4 m5 m6 m7
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
t0 x0 x1 x2 x3 x4 x5 x6 x7 t0
i i i i x0 x1 x2 x3 x4 x5 x6 x7
⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7
⊕0
⊕1
⊕2
⊕3
t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7)
t1
i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)
⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7
⊕0
⊕1
t2
i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)
t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i
⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7
t3
i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7)
=0
⊕0
t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3)
=0 ⊕0
=1 ⊕1
t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5)
=0 ⊕0
=1 ⊕1
=2 ⊕2
=3 ⊕3
t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)
– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
m0 m1 m2 m3 m4 m5 m6 m7
m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11
t0 x0 x1 x2 x3 x4 x5 x6 x7 t0
i i i i x0 x1 x2 x3 x4 x5 x6 x7
⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7
⊕0
⊕1
⊕2
⊕3
t1 x0 ⊕(x0..x1) x2 ⊕(x2..x3) x4 ⊕(x4..x5) x6 ⊕(x6..x7)
t1
i i i i x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6) ⊕(x6..x7)
⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7
⊕0
⊕1
t2
i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6) ⊕(x4..x7)
t2 x0 ⊕(x0..x1) x2 ⊕(x0..x3) x4 ⊕(x4..x5) x6 ⊕(x4..x7) i
⊕0
⊕1
⊕2
⊕3
⊕4
⊕5
⊕6
⊕7
t3
i i i i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6) ⊕(x0..x7)
=0
⊕0
t3 x0 ⊕(x0..x1) x2 i x4 ⊕(x4..x5) x6 ⊕(x0..x3)
=0 ⊕0
=1 ⊕1
t4 x0 i x2 ⊕(x0..x1) x4 ⊕(x0..x3) x6 ⊕(x0..x5)
=0 ⊕0
=1 ⊕1
=2 ⊕2
=3 ⊕3
t5 i x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)
– SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size
– Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size
barrier
Tree-‐based:
barrier
barrier
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
…
t
T/4
-‐1
tT/4
tT/4
+
1
tT/4
+
2
…
t
T/2
-‐
1 tT/2
tT/2
+
1
tT/2
+
2
…
t
3T/4
-‐1
t3T/4
t3T/4+1
t3T/4+2
…
t
T
-‐
1
vs.
raking-‐based:
barrier
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
barrier
Tree-‐based:
barrier
barrier
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
…
t
T/4
-‐1
tT/4
tT/4
+
1
tT/4
+
2
…
t
T/2
-‐
1 tT/2
tT/2
+
1
tT/2
+
2
…
t
3T/4
-‐1
t3T/4
t3T/4+1
t3T/4+2
…
t
T
-‐
1
vs.
raking-‐based:
barrier
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
t0
t1
t2
t3
DMA
t0
t1
t2
…
t
T/4
tT/4
tT/4
+
1
tT/4
+
2
…
t
T/2
-‐
tT/2
tT/2
+
1
tT/2
+
2
…
t
3T/4
-‐1
t3T/4
t3T/
4+1
t3T/
4+2
…
t
T
-‐
1
– Barriers make O(n) code O(n log n)
-‐1 1
t0
t1
t2
t3
t0
t1
t2
t3
– The rest are “DMA engine” threads t0
t1
t2
t3
Worker
– Use threadblocks to cover pipeline
t0
t1
t2
t3
latencies, e.g., for Fermi: t0
t1
t2
t3
t0
t1
t2
t3
• 2 worker warps per CTA
t0
t1
t2
t3
• 6-7 CTAs
– Different SMs (varied local storage: registers/smem)
– Different input types (e.g., sorting chars vs. ulongs)
– # of steps for each algorithm phase is configuration-driven
– Template expansion + Constant-propagation + Static loop unrolling +
Preprocessor Macros
– Compiler produces a target assembly that is well-tuned for the specifically
targeted hardware and problem