SlideShare a Scribd company logo
Optimizing Parallel Reduction in CUDA
Mark Harris
NVIDIA Developer Technology
2
Parallel Reduction
Common and important data parallel primitive
Easy to implement in CUDA
Harder to get it right
Serves as a great optimization example
We’ll walk step by step through 7 different versions
Demonstrates several important optimization strategies
3
Parallel Reduction
Tree-based approach used within each thread block
Need to be able to use multiple thread blocks
To process very large arrays
To keep all multiprocessors on the GPU busy
Each thread block reduces a portion of the array
But how do we communicate partial results between
thread blocks?
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4
Problem: Global Synchronization
If we could synchronize across all thread blocks, could easily
reduce very large arrays, right?
Global sync after each block produces its result
Once all blocks reach sync, continue recursively
But CUDA has no global synchronization. Why?
Expensive to build in hardware for GPUs with high processor
count
Would force programmer to run fewer blocks (no more than #
multiprocessors * # resident blocks / multiprocessor) to avoid
deadlock, which may reduce overall efficiency
Solution: decompose into multiple kernels
Kernel launch serves as a global synchronization point
Kernel launch has negligible HW overhead, low SW overhead
5
Solution: Kernel Decomposition
Avoid global sync by decomposing computation
into multiple kernel invocations
In the case of reductions, code for all levels is the
same
Recursive kernel invocation
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
Level 0:
8 blocks
Level 1:
1 block
// Already tried this
6
What is Our Optimization Goal?
We should strive to reach GPU peak performance
Choose the right metric:
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity
1 flop per element loaded (bandwidth-optimal)
Therefore we should strive for peak bandwidth
Will use G80 GPU for this example
384-bit memory interface, 900 MHz DDR
384 * 1800 / 8 = 86.4 GB/s
7
Reduction #1: Interleaved Addressing
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
8
Parallel Reduction: Interleaved Addressing
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Values (shared memory)
0 2 4 6 8 10 12 14
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
Values
0 4 8 12
18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2
Values
0 8
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Values
0
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Values
Thread
IDs
Step 1
Stride 1
Step 2
Stride 2
Step 3
Stride 4
Step 4
Stride 8
Thread
IDs
Thread
IDs
Thread
IDs
9
Reduction #1: Interleaved Addressing
__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Problem: highly divergent
warps are very inefficient, and
% operator is very slow
10
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Note: Block Size = 128 threads for all tests
Bandwidth
Time (222 ints)
11
for (unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}
Reduction #2: Interleaved Addressing
Just replace divergent branch in inner loop:
With strided index and non-divergent branch:
12
Parallel Reduction: Interleaved Addressing
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Values (shared memory)
0 1 2 3 4 5 6 7
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
Values
0 1 2 3
18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2
Values
0 1
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Values
0
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
Values
Thread
IDs
Step 1
Stride 1
Step 2
Stride 2
Step 3
Stride 4
Step 4
Stride 8
Thread
IDs
Thread
IDs
Thread
IDs
New Problem: Shared Memory Bank Conflicts
13
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
14
Parallel Reduction: Sequential Addressing
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Values (shared memory)
0 1 2 3 4 5 6 7
8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
Values
0 1 2 3
8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Values
0 1
21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Values
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Values
Thread
IDs
Step 1
Stride 8
Step 2
Stride 4
Step 3
Stride 2
Step 4
Stride 1
Thread
IDs
Thread
IDs
Thread
IDs
Sequential addressing is conflict free
15
for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
Reduction #3: Sequential Addressing
Just replace strided indexing in inner loop:
With reversed loop and threadID-based indexing:
// Already using this
16
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
Insert text here
17
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
Idle Threads
Problem:
Half of the threads are idle on first loop iteration!
This is wasteful…
18
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
Reduction #4: First Add During Load
Halve the number of blocks, and replace single load:
With two loads and first add of the reduction:
19
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
20
Instruction Bottleneck
At 17 GB/s, we’re far from bandwidth bound
And we know reduction has low arithmetic intensity
Therefore a likely bottleneck is instruction overhead
Ancillary instructions that are not loads, stores, or
arithmetic for the core computation
In other words: address arithmetic and loop overhead
Strategy: unroll loops
21
Unrolling the Last Warp
As reduction proceeds, # “active” threads decreases
When s <= 32, we have only one warp left
Instructions are SIMD synchronous within a warp
That means when s <= 32:
We don’t need to __syncthreads()
We don’t need “if (tid < s)” because it doesn’t save any
work
Let’s unroll the last 6 iterations of the inner loop
__device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
// later…
for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid < 32) warpReduce(sdata, tid);
22
Reduction #5: Unroll the Last Warp
Note: This saves useless work in all warps, not just the last one!
Without unrolling, all warps execute every iteration of the for loop and if statement
IMPORTANT:
For this to be correct,
we must use the
“volatile” keyword!
23
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Kernel 5:
unroll last warp
0.536 ms 31.289 GB/s 1.8x 15.01x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
24
Complete Unrolling
If we knew the number of iterations at compile time,
we could completely unroll the reduction
Luckily, the block size is limited by the GPU to 512 threads
Also, we are sticking to power-of-2 block sizes
So we can easily unroll for a fixed block size
But we need to be generic – how can we unroll for block
sizes that we don’t know at compile time?
Templates to the rescue!
CUDA supports C++ template parameters on device and
host functions
25
Unrolling with Templates
Specify block size as a function template parameter:
template <unsigned int blockSize>
__global__ void reduce5(int *g_idata, int *g_odata)
26
Reduction #6: Completely Unrolled
if (blockSize >= 512) {
if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) {
if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) {
if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) warpReduce<blockSize>(sdata, tid);
Note: all code in RED will be evaluated at compile time.
Results in a very efficient inner loop!
Template <unsigned int blockSize>
__device__ void warpReduce(volatile int* sdata, int tid) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
27
Invoking Template Kernels
Don’t we still need block size at compile time?
Nope, just a switch statement for 10 possible block sizes:
switch (threads)
{
case 512:
reduce5<512><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 256:
reduce5<256><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 128:
reduce5<128><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 64:
reduce5< 64><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 32:
reduce5< 32><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 16:
reduce5< 16><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 8:
reduce5< 8><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 4:
reduce5< 4><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 2:
reduce5< 2><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 1:
reduce5< 1><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
}
28
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Kernel 5:
unroll last warp
0.536 ms 31.289 GB/s 1.8x 15.01x
Kernel 6:
completely unrolled
0.381 ms 43.996 GB/s 1.41x 21.16x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
29
Parallel Reduction Complexity
Log(N) parallel steps, each step S does N/2S
independent ops
Step Complexity is O(log N)
For N=2D, performs S[1..D]2D-S = N-1 operations
Work Complexity is O(N) – It is work-efficient
i.e. does not perform more operations than a sequential
algorithm
With P threads physically in parallel (P processors),
time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
In a thread block, N=P, so O(log N)
30
What About Cost?
Cost of a parallel algorithm is processors time
complexity
Allocate threads instead of processors: O(N) threads
Time complexity is O(log N), so cost is O(N log N) : not
cost efficient!
Brent’s theorem suggests O(N/log N) threads
Each thread does O(log N) sequential work
Then all O(N/log N) threads cooperate for O(log N) steps
Cost = O((N/log N) * log N) = O(N)  cost efficient
Sometimes called algorithm cascading
Can lead to significant speedups in practice
31
Algorithm Cascading
Combine sequential and parallel reduction
Each thread loads and sums multiple elements into
shared memory
Tree-based reduction in shared memory
Brent’s theorem says each thread should sum
O(log n) elements
i.e. 1024 or 2048 elements per block vs. 256
In my experience, beneficial to push it even further
Possibly better latency hiding with more work per thread
More threads per block reduces levels in tree of recursive
kernel invocations
High kernel launch overhead in last levels with few blocks
On G80, best perf with 64-256 blocks of 128 threads
1024-4096 elements per thread
32
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
Reduction #7: Multiple Adds / Thread
Replace load and add of two elements:
With a while loop to add as many as necessary:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
33
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
Reduction #7: Multiple Adds / Thread
Replace load and add of two elements:
With a while loop to add as many as necessary:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
Note: gridSize loop stride
to maintain coalescing!
34
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Kernel 5:
unroll last warp
0.536 ms 31.289 GB/s 1.8x 15.01x
Kernel 6:
completely unrolled
0.381 ms 43.996 GB/s 1.41x 21.16x
Kernel 7:
multiple elements per thread
0.268 ms 62.671 GB/s 1.42x 30.04x
Kernel 7 on 32M elements: 73 GB/s!
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
35
template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) warpReduce(sdata, tid);
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Final Optimized Kernel
// I guess for global memory, 2 loads in 1 loop good enough
// For shared memory, better load as mush as possible at once (near instruction bottleneck)
36
Performance Comparison
0.01
0.1
1
10
131072
262144
524288
1048576
2097152
4194304
8388608
16777216
33554432
# Elements
Time
(ms)
1: Interleaved Addressing:
Divergent Branches
2: Interleaved Addressing:
Bank Conflicts
3: Sequential Addressing
4: First add during global
load
5: Unroll last warp
6: Completely unroll
7: Multiple elements per
thread (max 64 blocks)
37
Types of optimization
Interesting observation:
Algorithmic optimizations
Changes to addressing, algorithm cascading
11.84x speedup, combined!
Code optimizations
Loop unrolling
2.54x speedup, combined
38
Conclusion
Understand CUDA performance characteristics
Memory coalescing
Divergent branching
Bank conflicts
Latency hiding
Use peak performance metrics to guide optimization
Understand parallel algorithm complexity theory
Know how to identify type of bottleneck
e.g. memory, core computation, or instruction overhead
Optimize your algorithm, then unroll loops
Use template parameters to generate optimal code
Questions: mharris@nvidia.com

More Related Content

What's hot

Stock price prediction using Neural Net
Stock price prediction using Neural NetStock price prediction using Neural Net
Stock price prediction using Neural Net
Rajat Sharma
 
「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」
「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」
「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」
Nagi Teramo
 
Scientific Calculator.pptx
Scientific Calculator.pptxScientific Calculator.pptx
Scientific Calculator.pptx
AnuragSingh91510
 
Unit 3 daa
Unit 3 daaUnit 3 daa
Unit 3 daa
Nv Thejaswini
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graph
Syed Zaid Irshad
 
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
idescitation
 
Minimax
MinimaxMinimax
Minimax
Nagarajan
 
Introduction iii
Introduction iiiIntroduction iii
Introduction iiichandsek666
 
Artificial intelligence and first order logic
Artificial intelligence and first order logicArtificial intelligence and first order logic
Artificial intelligence and first order logic
parsa rafiq
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
fadwa_stuka
 
Dijkstra’S Algorithm
Dijkstra’S AlgorithmDijkstra’S Algorithm
Dijkstra’S Algorithm
ami_01
 
COMPUTER GRAPHICS LAB MANUAL
COMPUTER GRAPHICS LAB MANUALCOMPUTER GRAPHICS LAB MANUAL
COMPUTER GRAPHICS LAB MANUAL
Vivek Kumar Sinha
 
3D Display Method
3D Display Method3D Display Method
3D Display Method
Khaled Sany
 
Passes of compilers
Passes of compilersPasses of compilers
Passes of compilers
Vairavel C
 
Longest Common Subsequence
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
Krishma Parekh
 
Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1
Deepak John
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
Shiwani Gupta
 
Traveling salesman problem
Traveling salesman problemTraveling salesman problem
Traveling salesman problem
Jayesh Chauhan
 
All pair shortest path
All pair shortest pathAll pair shortest path
All pair shortest path
Arafat Hossan
 
AI 6 | Adversarial Search
AI 6 | Adversarial SearchAI 6 | Adversarial Search
AI 6 | Adversarial Search
Mohammad Imam Hossain
 

What's hot (20)

Stock price prediction using Neural Net
Stock price prediction using Neural NetStock price prediction using Neural Net
Stock price prediction using Neural Net
 
「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」
「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」
「plyrパッケージで君も前処理スタ☆」改め「plyrパッケージ徹底入門」
 
Scientific Calculator.pptx
Scientific Calculator.pptxScientific Calculator.pptx
Scientific Calculator.pptx
 
Unit 3 daa
Unit 3 daaUnit 3 daa
Unit 3 daa
 
Game playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graphGame playing (tic tac-toe), andor graph
Game playing (tic tac-toe), andor graph
 
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
 
Minimax
MinimaxMinimax
Minimax
 
Introduction iii
Introduction iiiIntroduction iii
Introduction iii
 
Artificial intelligence and first order logic
Artificial intelligence and first order logicArtificial intelligence and first order logic
Artificial intelligence and first order logic
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Dijkstra’S Algorithm
Dijkstra’S AlgorithmDijkstra’S Algorithm
Dijkstra’S Algorithm
 
COMPUTER GRAPHICS LAB MANUAL
COMPUTER GRAPHICS LAB MANUALCOMPUTER GRAPHICS LAB MANUAL
COMPUTER GRAPHICS LAB MANUAL
 
3D Display Method
3D Display Method3D Display Method
3D Display Method
 
Passes of compilers
Passes of compilersPasses of compilers
Passes of compilers
 
Longest Common Subsequence
Longest Common SubsequenceLongest Common Subsequence
Longest Common Subsequence
 
Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1Anlysis and design of algorithms part 1
Anlysis and design of algorithms part 1
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Traveling salesman problem
Traveling salesman problemTraveling salesman problem
Traveling salesman problem
 
All pair shortest path
All pair shortest pathAll pair shortest path
All pair shortest path
 
AI 6 | Adversarial Search
AI 6 | Adversarial SearchAI 6 | Adversarial Search
AI 6 | Adversarial Search
 

Similar to Optimizing Parallel Reduction in CUDA : NOTES

GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
Jun Young Park
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Rakib Hossain
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
inside-BigData.com
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
GregSmith458515
 
Fatkulin presentation
Fatkulin presentationFatkulin presentation
Fatkulin presentationEnkitec
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
Alcides Fonseca
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
ceyifo9332
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
CanSecWest
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
Mail.ru Group
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
IJERA Editor
 

Similar to Optimizing Parallel Reduction in CUDA : NOTES (20)

Reduction
ReductionReduction
Reduction
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Speedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql LoaderSpeedrunning the Open Street Map osm2pgsql Loader
Speedrunning the Open Street Map osm2pgsql Loader
 
Fatkulin presentation
Fatkulin presentationFatkulin presentation
Fatkulin presentation
 
Programar para GPUs
Programar para GPUsProgramar para GPUs
Programar para GPUs
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Defense_Presentation
Defense_PresentationDefense_Presentation
Defense_Presentation
 
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacketCsw2016 wheeler barksdale-gruskovnjak-execute_mypacket
Csw2016 wheeler barksdale-gruskovnjak-execute_mypacket
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 

More from Subhajit Sahu

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
Subhajit Sahu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Adjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTESAdjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Experiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTESExperiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTES
Subhajit Sahu
 
PageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTESPageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Subhajit Sahu
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
Subhajit Sahu
 
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
Subhajit Sahu
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
Subhajit Sahu
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
Subhajit Sahu
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Subhajit Sahu
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
Subhajit Sahu
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
Subhajit Sahu
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
Subhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Subhajit Sahu
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Subhajit Sahu
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Subhajit Sahu
 

More from Subhajit Sahu (20)

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Adjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTESAdjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Experiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTESExperiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTES
 
PageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTESPageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 

Recently uploaded

Drugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptxDrugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptx
ThalapathyVijay15
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
kywwoyk
 
web-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jerweb-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jer
freshgammer09
 
F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
ArjunJain44
 
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
Amil baba
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
kywwoyk
 
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
PinkySharma900491
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
eemet
 
Cyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber securityCyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber security
perweeng31
 

Recently uploaded (9)

Drugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptxDrugs used in parkinsonism and other movement disorders.pptx
Drugs used in parkinsonism and other movement disorders.pptx
 
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
一比一原版UVM毕业证佛蒙特大学毕业证成绩单如何办理
 
web-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jerweb-tech-lab-manual-final-abhas.pdf. Jer
web-tech-lab-manual-final-abhas.pdf. Jer
 
F5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptxF5 LTM TROUBLESHOOTING Guide latest.pptx
F5 LTM TROUBLESHOOTING Guide latest.pptx
 
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...
 
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
一比一原版SDSU毕业证圣地亚哥州立大学毕业证成绩单如何办理
 
Cyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber securityCyber Sequrity.pptx is life of cyber security
Cyber Sequrity.pptx is life of cyber security
 

Optimizing Parallel Reduction in CUDA : NOTES

  • 1. Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology
  • 2. 2 Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as a great optimization example We’ll walk step by step through 7 different versions Demonstrates several important optimization strategies
  • 3. 3 Parallel Reduction Tree-based approach used within each thread block Need to be able to use multiple thread blocks To process very large arrays To keep all multiprocessors on the GPU busy Each thread block reduces a portion of the array But how do we communicate partial results between thread blocks? 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3
  • 4. 4 Problem: Global Synchronization If we could synchronize across all thread blocks, could easily reduce very large arrays, right? Global sync after each block produces its result Once all blocks reach sync, continue recursively But CUDA has no global synchronization. Why? Expensive to build in hardware for GPUs with high processor count Would force programmer to run fewer blocks (no more than # multiprocessors * # resident blocks / multiprocessor) to avoid deadlock, which may reduce overall efficiency Solution: decompose into multiple kernels Kernel launch serves as a global synchronization point Kernel launch has negligible HW overhead, low SW overhead
  • 5. 5 Solution: Kernel Decomposition Avoid global sync by decomposing computation into multiple kernel invocations In the case of reductions, code for all levels is the same Recursive kernel invocation 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 4 7 5 9 11 14 25 3 1 7 0 4 1 6 3 Level 0: 8 blocks Level 1: 1 block // Already tried this
  • 6. 6 What is Our Optimization Goal? We should strive to reach GPU peak performance Choose the right metric: GFLOP/s: for compute-bound kernels Bandwidth: for memory-bound kernels Reductions have very low arithmetic intensity 1 flop per element loaded (bandwidth-optimal) Therefore we should strive for peak bandwidth Will use G80 GPU for this example 384-bit memory interface, 900 MHz DDR 384 * 1800 / 8 = 86.4 GB/s
  • 7. 7 Reduction #1: Interleaved Addressing __global__ void reduce0(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; }
  • 8. 8 Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) 0 2 4 6 8 10 12 14 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Values 0 4 8 12 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Values 0 8 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values 0 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Thread IDs Thread IDs
  • 9. 9 Reduction #1: Interleaved Addressing __global__ void reduce1(int *g_idata, int *g_odata) { extern __shared__ int sdata[]; // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } Problem: highly divergent warps are very inefficient, and % operator is very slow
  • 10. 10 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Note: Block Size = 128 threads for all tests Bandwidth Time (222 ints)
  • 11. 11 for (unsigned int s=1; s < blockDim.x; s *= 2) { if (tid % (2*s) == 0) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } for (unsigned int s=1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) { sdata[index] += sdata[index + s]; } __syncthreads(); } Reduction #2: Interleaved Addressing Just replace divergent branch in inner loop: With strided index and non-divergent branch:
  • 12. 12 Parallel Reduction: Interleaved Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) 0 1 2 3 4 5 6 7 11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Values 0 1 2 3 18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Values 0 1 24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values 0 41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values Thread IDs Step 1 Stride 1 Step 2 Stride 2 Step 3 Stride 4 Step 4 Stride 8 Thread IDs Thread IDs Thread IDs New Problem: Shared Memory Bank Conflicts
  • 13. 13 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Kernel 2: interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s 2.33x 2.33x Step Speedup Bandwidth Time (222 ints) Cumulative Speedup
  • 14. 14 Parallel Reduction: Sequential Addressing 10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory) 0 1 2 3 4 5 6 7 8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 Values 0 1 2 3 8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values 0 1 21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values 0 41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values Thread IDs Step 1 Stride 8 Step 2 Stride 4 Step 3 Stride 2 Step 4 Stride 1 Thread IDs Thread IDs Thread IDs Sequential addressing is conflict free
  • 15. 15 for (unsigned int s=1; s < blockDim.x; s *= 2) { int index = 2 * s * tid; if (index < blockDim.x) { sdata[index] += sdata[index + s]; } __syncthreads(); } for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } Reduction #3: Sequential Addressing Just replace strided indexing in inner loop: With reversed loop and threadID-based indexing: // Already using this
  • 16. 16 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Kernel 2: interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s 2.33x 2.33x Kernel 3: sequential addressing 1.722 ms 9.741 GB/s 2.01x 4.68x Step Speedup Bandwidth Time (222 ints) Cumulative Speedup Insert text here
  • 17. 17 for (unsigned int s=blockDim.x/2; s>0; s>>=1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); } Idle Threads Problem: Half of the threads are idle on first loop iteration! This is wasteful…
  • 18. 18 // each thread loads one element from global to shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // perform first level of reduction, // reading from global memory, writing to shared memory unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; __syncthreads(); Reduction #4: First Add During Load Halve the number of blocks, and replace single load: With two loads and first add of the reduction:
  • 19. 19 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Kernel 2: interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s 2.33x 2.33x Kernel 3: sequential addressing 1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4: first add during global load 0.965 ms 17.377 GB/s 1.78x 8.34x Step Speedup Bandwidth Time (222 ints) Cumulative Speedup
  • 20. 20 Instruction Bottleneck At 17 GB/s, we’re far from bandwidth bound And we know reduction has low arithmetic intensity Therefore a likely bottleneck is instruction overhead Ancillary instructions that are not loads, stores, or arithmetic for the core computation In other words: address arithmetic and loop overhead Strategy: unroll loops
  • 21. 21 Unrolling the Last Warp As reduction proceeds, # “active” threads decreases When s <= 32, we have only one warp left Instructions are SIMD synchronous within a warp That means when s <= 32: We don’t need to __syncthreads() We don’t need “if (tid < s)” because it doesn’t save any work Let’s unroll the last 6 iterations of the inner loop
  • 22. __device__ void warpReduce(volatile int* sdata, int tid) { sdata[tid] += sdata[tid + 32]; sdata[tid] += sdata[tid + 16]; sdata[tid] += sdata[tid + 8]; sdata[tid] += sdata[tid + 4]; sdata[tid] += sdata[tid + 2]; sdata[tid] += sdata[tid + 1]; } // later… for (unsigned int s=blockDim.x/2; s>32; s>>=1) { if (tid < s) sdata[tid] += sdata[tid + s]; __syncthreads(); } if (tid < 32) warpReduce(sdata, tid); 22 Reduction #5: Unroll the Last Warp Note: This saves useless work in all warps, not just the last one! Without unrolling, all warps execute every iteration of the for loop and if statement IMPORTANT: For this to be correct, we must use the “volatile” keyword!
  • 23. 23 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Kernel 2: interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s 2.33x 2.33x Kernel 3: sequential addressing 1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4: first add during global load 0.965 ms 17.377 GB/s 1.78x 8.34x Kernel 5: unroll last warp 0.536 ms 31.289 GB/s 1.8x 15.01x Step Speedup Bandwidth Time (222 ints) Cumulative Speedup
  • 24. 24 Complete Unrolling If we knew the number of iterations at compile time, we could completely unroll the reduction Luckily, the block size is limited by the GPU to 512 threads Also, we are sticking to power-of-2 block sizes So we can easily unroll for a fixed block size But we need to be generic – how can we unroll for block sizes that we don’t know at compile time? Templates to the rescue! CUDA supports C++ template parameters on device and host functions
  • 25. 25 Unrolling with Templates Specify block size as a function template parameter: template <unsigned int blockSize> __global__ void reduce5(int *g_idata, int *g_odata)
  • 26. 26 Reduction #6: Completely Unrolled if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } if (tid < 32) warpReduce<blockSize>(sdata, tid); Note: all code in RED will be evaluated at compile time. Results in a very efficient inner loop! Template <unsigned int blockSize> __device__ void warpReduce(volatile int* sdata, int tid) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32]; if (blockSize >= 32) sdata[tid] += sdata[tid + 16]; if (blockSize >= 16) sdata[tid] += sdata[tid + 8]; if (blockSize >= 8) sdata[tid] += sdata[tid + 4]; if (blockSize >= 4) sdata[tid] += sdata[tid + 2]; if (blockSize >= 2) sdata[tid] += sdata[tid + 1]; }
  • 27. 27 Invoking Template Kernels Don’t we still need block size at compile time? Nope, just a switch statement for 10 possible block sizes: switch (threads) { case 512: reduce5<512><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 256: reduce5<256><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 128: reduce5<128><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 64: reduce5< 64><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 32: reduce5< 32><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 16: reduce5< 16><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 8: reduce5< 8><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 4: reduce5< 4><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 2: reduce5< 2><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; case 1: reduce5< 1><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break; }
  • 28. 28 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Kernel 2: interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s 2.33x 2.33x Kernel 3: sequential addressing 1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4: first add during global load 0.965 ms 17.377 GB/s 1.78x 8.34x Kernel 5: unroll last warp 0.536 ms 31.289 GB/s 1.8x 15.01x Kernel 6: completely unrolled 0.381 ms 43.996 GB/s 1.41x 21.16x Step Speedup Bandwidth Time (222 ints) Cumulative Speedup
  • 29. 29 Parallel Reduction Complexity Log(N) parallel steps, each step S does N/2S independent ops Step Complexity is O(log N) For N=2D, performs S[1..D]2D-S = N-1 operations Work Complexity is O(N) – It is work-efficient i.e. does not perform more operations than a sequential algorithm With P threads physically in parallel (P processors), time complexity is O(N/P + log N) Compare to O(N) for sequential reduction In a thread block, N=P, so O(log N)
  • 30. 30 What About Cost? Cost of a parallel algorithm is processors time complexity Allocate threads instead of processors: O(N) threads Time complexity is O(log N), so cost is O(N log N) : not cost efficient! Brent’s theorem suggests O(N/log N) threads Each thread does O(log N) sequential work Then all O(N/log N) threads cooperate for O(log N) steps Cost = O((N/log N) * log N) = O(N)  cost efficient Sometimes called algorithm cascading Can lead to significant speedups in practice
  • 31. 31 Algorithm Cascading Combine sequential and parallel reduction Each thread loads and sums multiple elements into shared memory Tree-based reduction in shared memory Brent’s theorem says each thread should sum O(log n) elements i.e. 1024 or 2048 elements per block vs. 256 In my experience, beneficial to push it even further Possibly better latency hiding with more work per thread More threads per block reduces levels in tree of recursive kernel invocations High kernel launch overhead in last levels with few blocks On G80, best perf with 64-256 blocks of 128 threads 1024-4096 elements per thread
  • 32. 32 unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; __syncthreads(); Reduction #7: Multiple Adds / Thread Replace load and add of two elements: With a while loop to add as many as necessary: unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } __syncthreads();
  • 33. 33 unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; __syncthreads(); Reduction #7: Multiple Adds / Thread Replace load and add of two elements: With a while loop to add as many as necessary: unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } __syncthreads(); Note: gridSize loop stride to maintain coalescing!
  • 34. 34 Performance for 4M element reduction Kernel 1: interleaved addressing with divergent branching 8.054 ms 2.083 GB/s Kernel 2: interleaved addressing with bank conflicts 3.456 ms 4.854 GB/s 2.33x 2.33x Kernel 3: sequential addressing 1.722 ms 9.741 GB/s 2.01x 4.68x Kernel 4: first add during global load 0.965 ms 17.377 GB/s 1.78x 8.34x Kernel 5: unroll last warp 0.536 ms 31.289 GB/s 1.8x 15.01x Kernel 6: completely unrolled 0.381 ms 43.996 GB/s 1.41x 21.16x Kernel 7: multiple elements per thread 0.268 ms 62.671 GB/s 1.42x 30.04x Kernel 7 on 32M elements: 73 GB/s! Step Speedup Bandwidth Time (222 ints) Cumulative Speedup
  • 35. 35 template <unsigned int blockSize> __device__ void warpReduce(volatile int *sdata, unsigned int tid) { if (blockSize >= 64) sdata[tid] += sdata[tid + 32]; if (blockSize >= 32) sdata[tid] += sdata[tid + 16]; if (blockSize >= 16) sdata[tid] += sdata[tid + 8]; if (blockSize >= 8) sdata[tid] += sdata[tid + 4]; if (blockSize >= 4) sdata[tid] += sdata[tid + 2]; if (blockSize >= 2) sdata[tid] += sdata[tid + 1]; } template <unsigned int blockSize> __global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) { extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*(blockSize*2) + tid; unsigned int gridSize = blockSize*2*gridDim.x; sdata[tid] = 0; while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; } __syncthreads(); if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); } if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); } if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); } if (tid < 32) warpReduce(sdata, tid); if (tid == 0) g_odata[blockIdx.x] = sdata[0]; } Final Optimized Kernel // I guess for global memory, 2 loads in 1 loop good enough // For shared memory, better load as mush as possible at once (near instruction bottleneck)
  • 36. 36 Performance Comparison 0.01 0.1 1 10 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 # Elements Time (ms) 1: Interleaved Addressing: Divergent Branches 2: Interleaved Addressing: Bank Conflicts 3: Sequential Addressing 4: First add during global load 5: Unroll last warp 6: Completely unroll 7: Multiple elements per thread (max 64 blocks)
  • 37. 37 Types of optimization Interesting observation: Algorithmic optimizations Changes to addressing, algorithm cascading 11.84x speedup, combined! Code optimizations Loop unrolling 2.54x speedup, combined
  • 38. 38 Conclusion Understand CUDA performance characteristics Memory coalescing Divergent branching Bank conflicts Latency hiding Use peak performance metrics to guide optimization Understand parallel algorithm complexity theory Know how to identify type of bottleneck e.g. memory, core computation, or instruction overhead Optimize your algorithm, then unroll loops Use template parameters to generate optimal code Questions: mharris@nvidia.com