Cuda

Introduction to the
CUDA Platform

CUDA Parallel Computing Platform
Hardware
Capabilities
GPUDirectSMX
Dynamic
Parallelism
HyperQ
Programming
Approaches
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum Flexibility
Easily Accelerate
Apps
Development
Environment
Nsight IDE
Linux, Mac and Windows
GPU Debugging and
Profiling
CUDA-GDB
debugger
NVIDIA Visual
Profiler
Open Compiler
Tool Chain
Enables compiling new languages to CUDA
platform, and CUDA languages to other
architectures
www.nvidia.com/getcuda
© NVIDIA 2013

Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Easily Accelerate
Applications
3 Ways to Accelerate Applications
Maximum
Flexibility
© NVIDIA 2013

3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
© NVIDIA 2013

Libraries: Easy, High-Quality
Acceleration
• Ease of use: Using libraries enables GPU acceleration without in-depth
knowledge of GPU programming
• “Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus
enabling acceleration with minimal code changes
• Quality: Libraries offer high-quality implementations of functions
encountered in a broad range of applications
• Performance: NVIDIA libraries are tuned by experts
© NVIDIA 2013

Some GPU-accelerated Libraries
NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP
Vector Signal
Image Processing
GPU Accelerated
Linear Algebra
Matrix Algebra
on GPU and
Multicore
NVIDIA cuFFT
C++ STL
Features for
CUDAIMSL Library
Building-block
Algorithms for
CUDA
ArrayFire Matrix
Computations
Sparse Linear
Algebra
© NVIDIA 2013

3 Steps to CUDA-accelerated
application
• Step 1: Substitute library calls with equivalent CUDA library calls
saxpy ( … ) cublasSaxpy ( … )
• Step 2: Manage data locality
- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.
• Step 3: Rebuild and link the CUDA-accelerated library
nvcc myobj.o –l cublas
© NVIDIA 2013

Explore the CUDA (Libraries)
Ecosystem
• CUDA Tools and Ecosystem
described in detail on NVIDIA
Developer Zone:
developer.nvidia.com/cuda-tools-ecosystem
© NVIDIA 2013

OpenACC Directives
© NVIDIA 2013
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C
code
Simple Compiler hints
Compiler Parallelizes
code
Works on many-core
GPUs & multicore CPUs
OpenACC
compiler
Hint

• Easy: Directives are the easy path to accelerate
compute intensive applications
• Open: OpenACC is an open GPU directives standard,
making GPU programming straightforward and
portable across parallel and multi-core processors
• Powerful: GPU Directives allow complete access to the
massive parallel power of a GPU
OpenACC
The Standard for GPU Directives
© NVIDIA 2013

Real-Time Object
Detection
Global Manufacturer of
Navigation Systems
Valuation of Stock
Portfolios using Monte
Carlo
Global Technology Consulting
Company
Interaction of Solvents
and Biomolecules
University of Texas at San Antonio
Directives: Easy & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing
CUDA kernels. The most important thing is avoiding restructuring of existing code for
production applications.
” -- Developer at the Global Manufacturer of
Navigation Systems
“
5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
© NVIDIA 2013

Start Now with OpenACC Directives
Free trial license to PGI
Accelerator
Tools for quick ramp
www.nvidia.com/gpudirectives
Sign up for a free trial of
the directives compiler
now!
© NVIDIA 2013

GPU Programming Languages
OpenACC, CUDA FortranFortran
OpenACC, CUDA CC
Thrust, CUDA C++C++
PyCUDA, CopperheadPython
Alea.cuBaseF#
MATLAB, Mathematica, LabVIEWNumerical analytics
© NVIDIA 2013

// generate 32M random numbers on host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
h_vec.end(),
rand);
// transfer data to device (GPU)
thrust::device_vector<int> d_vec = h_vec;
// sort data on device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(),
d_vec.end(),
h_vec.begin());
Rapid Parallel C++ Development
• Resembles C++ STL
• High-level interface
• Enhances developer
productivity
• Enables performance
portability between GPUs and
multicore CPUs
• Flexible
• CUDA, OpenMP, and TBB
backends
• Extensible and customizable
• Integrates with existing
software
• Open source
http://developer.nvidia.com/thrust or http://thrust.googlecode.com

MATLAB
http://www.mathworks.com/discovery/
matlab-gpu.html
Learn More
These languages are supported on all CUDA-capable GPUs.
You might already have a CUDA-capable GPU in your laptop
or desktop PC!
CUDA C/C++
http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
http://developer.nvidia.com/thrust
CUDA Fortran
http://developer.nvidia.com/cuda-toolkit
GPU.NET
http://tidepowerd.com
PyCUDA (Python)
http://mathema.tician.de/software/pycuda
Mathematica
http://www.wolfram.com/mathematica/new
-in-8/cuda-and-opencl-support/
© NVIDIA 2013

Getting Started
© NVIDIA 2013
• Download CUDA Toolkit & SDK: www.nvidia.com/getcuda
• Nsight IDE (Eclipse or Visual Studio): www.nvidia.com/nsight
• Programming Guide/Best Practices:
• docs.nvidia.com
• Questions:
• NVIDIA Developer forums: devtalk.nvidia.com
• Search or ask on: www.stackoverflow.com/tags/cuda
• General: www.nvidia.com/cudazone

GPUCPU
Add GPUs: Accelerate Science Applications
© NVIDIA 2013

Small Changes, Big Speed-up
Application Code
+
GPU CPU
Use GPU to
Parallelize
Compute-Intensive
Functions
Rest of Sequential
CPU Code
© NVIDIA 2013

Fastest Performance on Scientific Applications
Tesla K20X Speed-Up over Sandy Bridge CPUs
CPU results: Dual socket E5-2687w, 3.10 GHz, GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs
*MATLAB results comparing one i7-2600K CPU vs with Tesla K20 GPU
Disclaimer: Non-NVIDIA implementations may not have been fully optimized
0.0x 5.0x 10.0x 15.0x 20.0x
AMBER
SPECFEM3D
Chroma
MATLAB (FFT)*Engineering
Earth
Science
Physics
Molecular
Dynamics
© NVIDIA 2013

Why Computing Perf/Watt Matters?
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Megawatts
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
10x performance/socket
> 5x energy efficiency
Era of GPU-accelerated
computing is here
© NVIDIA 2013

World’s Fastest, Most Energy Efficient Accelerator
Tesla K20X
Tesla K20
Xeon CPU,
E5-2690
Xeon Phi
225W
0.0
1.0
2.0
3.0
0.0 0.5 1.0 1.5
SGEMM(TFLOPS)
DGEMM (TFLOPS)
Tesla K20X vs Xeon CPU
8x Faster SGEMM
6x Faster DGEMM
Tesla K20X vs Xeon Phi
90% Faster SGEMM
60% Faster DGEMM
© NVIDIA 2013

CUDA C/C++ BASICS
NVIDIA Corporation
© NVIDIA 2013

What is CUDA?
• CUDA Architecture
– Expose GPU parallelism for general-purpose computing
– Retain performance
• CUDA C/C++
– Based on industry-standard C/C++
– Small set of extensions to enable heterogeneous
programming
– Straightforward APIs to manage devices, memory etc.
• This session introduces CUDA C/C++
© NVIDIA 2013

Introduction to CUDA C/C++
• What will you learn in this session?
– Start from “Hello World!”
– Write and launch CUDA C/C++ kernels
– Manage GPU memory
– Manage communication and synchronization
© NVIDIA 2013

Prerequisites
• You (probably) need experience with C or C++
• You don’t need GPU experience
• You don’t need parallel programming
experience
• You don’t need graphics experience
© NVIDIA 2013

Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013

HELLO WORLD!
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Handling errors
Managing devices
CONCEPTS

 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013

#include <iostream>
#include <algorithm>
using namespace std;
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
void fill_ints(int *x, int n) {
fill_n(x, n, 1);
}
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
// Alloc space for host copies and setup values
in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);
// Alloc space for device copies
cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
// Launch stencil_1d() kernel on GPU
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);
// Copy result back to host
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
}
serial code
parallel code
serial code
parallel fn
© NVIDIA 2013

Simple Processing Flow
1. Copy input data from CPU memory
to GPU memory
PCI Bus
© NVIDIA 2013

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
© NVIDIA 2013
PCI Bus

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory
© NVIDIA 2013
PCI Bus

Hello World!
int main(void) {
printf("Hello World!n");
return 0;
}
Standard C that runs on the host
NVIDIA compiler (nvcc) can be used
to compile programs with no device
code
Output:
$ nvcc
hello_world.
cu
$ a.out
Hello World!
$
© NVIDIA 2013

Hello World! with Device Code
__global__ void mykernel(void) {
}
int main(void) {
mykernel<<<1,1>>>();
return 0;
}
 Two new syntactic elements…
© NVIDIA 2013

__global__ void mykernel(void) {
}
• CUDA C/C++ keyword __global__ indicates a function that:
– Runs on the device
– Is called from host code
• nvcc separates source code into host and device
components
– Device functions (e.g. mykernel()) processed by NVIDIA compiler
– Host functions (e.g. main()) processed by standard host compiler
• gcc, cl.exe
© NVIDIA 2013

Hello World! with Device COde
• Triple angle brackets mark a call from host
code to device code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment
• That’s all that is required to execute a function
on the GPU!
© NVIDIA 2013

__global__ void mykernel(void){
}
int main(void) {
return 0;
}
• mykernel() does nothing,
somewhat anticlimactic!
Output:
$ nvcc
hello.cu
$ a.out
Hello World!
$
© NVIDIA 2013

Parallel Programming in CUDA C/C++
• But wait… GPU computing is about
massive parallelism!
• We need a more interesting example…
• We’ll start by adding two integers and
build up to vector addition
a b c
© NVIDIA 2013

Addition on the Device
• A simple kernel to add two integers
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
• As before __global__ is a CUDA C/C++ keyword
meaning
– add() will execute on the device
– add() will be called from the host
© NVIDIA 2013

Addition on the Device
• Note that we use pointers for the variables
*c = *a + *b;
}
• add() runs on the device, so a, b and c must
point to device memory
• We need to allocate memory on the GPU
© NVIDIA 2013

Memory Management
• Host and device memory are separate entities
– Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
– Host pointers point to CPU memory
May be passed to/from device code
May not be dereferenced in device code
• Simple CUDA API for handling device memory
– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(), memcpy()
© NVIDIA 2013

Addition on the Device: add()
• Returning to our add() kernel
*c = *a + *b;
}
• Let’s take a look at main()…
© NVIDIA 2013

Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
// Setup input values
a = 2;
b = 7;
© NVIDIA 2013

Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add<<<1,1>>>(d_a, d_b, d_c);
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013

RUNNING IN
PARALLEL
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013

Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
• Instead of executing add() once, execute N
times in parallel
© NVIDIA 2013

Vector Addition on the Device
• With add() running in parallel we can do vector addition
• Terminology: each parallel invocation of add() is referred to
as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
• By using blockIdx.x to index into the array, each block handles
a different index
© NVIDIA 2013

Vector Addition on the Device
}
• On the device, each block can execute in parallel:
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
Block 0 Block 1 Block 2 Block 3
© NVIDIA 2013

Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
}
• Let’s take a look at main()…
© NVIDIA 2013

Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device copies of a, b, c
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
© NVIDIA 2013

Vector Addition on the Device: main()
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add<<<N,1>>>(d_a, d_b, d_c);
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
return 0;
}
© NVIDIA 2013

Review (1 of 2)
• Difference between host and device
– Host CPU
– Device GPU
• Using __global__ to declare a function as device code
– Executes on the device
– Called from the host
• Passing parameters from host code to a device function
© NVIDIA 2013

Review (2 of 2)
• Basic device memory management
– cudaMalloc()
– cudaMemcpy()
– cudaFree()
• Launching parallel kernels
– Launch N copies of add() with add<<<N,1>>>(…);
– Use blockIdx.x to access block index
© NVIDIA 2013

INTRODUCING
THREADS
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013

CUDA Threads
• Terminology: a block can be split into parallel threads
• Let’s change add() to use parallel threads instead of
parallel blocks
• We use threadIdx.x instead of blockIdx.x
• Need to make one change in main()…
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
© NVIDIA 2013

Vector Addition Using Threads: main()
#define N 512
int main(void) {
© NVIDIA 2013

Vector Addition Using Threads: main()
// Launch add() kernel on GPU with N threads
add<<<1,N>>>(d_a, d_b, d_c);
// Cleanup
return 0;
}
© NVIDIA 2013

COMBINING THREADS
AND BLOCKS
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013

Combining Blocks and Threads
• We’ve seen parallel vector addition using:
– Many blocks with one thread each
– One block with many threads
• Let’s adapt vector addition to use both blocks and
threads
• Why? We’ll come to that…
• First let’s discuss data indexing…
© NVIDIA 2013

0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
Indexing Arrays with Blocks and
Threads
• With M threads/block a unique index for each thread
is given by:
int index = threadIdx.x + blockIdx.x * M;
• No longer as simple as using blockIdx.x and threadIdx.x
– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3
© NVIDIA 2013

Indexing Arrays: Example
• Which thread will operate on the red
element?
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6
threadIdx.x = 5
blockIdx.x = 2
0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
M = 8
© NVIDIA 2013

Vector Addition with Blocks and
Threads
• What changes need to be made in main()?
• Use the built-in variable blockDim.x for threads per
block
int index = threadIdx.x + blockIdx.x * blockDim.x;
• Combined version of add() to use parallel
threads and parallel blocks
c[index] = a[index] + b[index];
}
© NVIDIA 2013

Addition with Blocks and Threads: main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
© NVIDIA 2013

Addition with Blocks and Threads: main()
// Launch add() kernel on GPU
add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);
// Cleanup
return 0;
}
© NVIDIA 2013

Handling Arbitrary Vector Sizes
• Update the kernel launch:
add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);
• Typical problems are not friendly multiples of
blockDim.x
• Avoid accessing beyond the end of the arrays:
__global__ void add(int *a, int *b, int *c, int n) {
if (index < n)
c[index] = a[index] + b[index];
}
© NVIDIA 2013

Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?
• Unlike parallel blocks, threads have mechanisms
to:
– Communicate
– Synchronize
• To look closer, we need a new example…
© NVIDIA 2013

COOPERATING
THREADS
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013

1D Stencil
• Consider applying a 1D stencil to a 1D array of
elements
– Each output element is the sum of input elements within a
radius
• If radius is 3, then each output element is the sum of
7 input elements:
© NVIDIA 2013
radius radius

Implementing Within a Block
• Each thread processes one output element
– blockDim.x elements per block
• Input elements are read several times
– With radius 3, each input element is read seven times
© NVIDIA 2013

Sharing Data Between Threads
• Terminology: within a block, threads share data via
shared memory
• Extremely fast on-chip memory, user-managed
• Declare using __shared__, allocated per block
• Data is not visible to threads in other blocks
© NVIDIA 2013

Implementing With Shared Memory
• Cache data in shared memory
– Read (blockDim.x + 2 * radius) input elements from global
memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory
– Each block needs a halo of radius elements at each boundary
blockDim.x output elements
halo on left halo on right
© NVIDIA 2013

int lindex = threadIdx.x + RADIUS;
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
© NVIDIA 2013
Stencil Kernel

int result = 0;
// Store the result
}
Stencil Kernel
© NVIDIA 2013

Data Race!
© NVIDIA 2013
 The stencil example will not work…
 Suppose thread 15 reads the halo before thread 0 has fetched it…
temp[lindex – RADIUS = in[gindex – RADIUS];
}
int result = 0;
result += temp[lindex + 1];
Store at temp[18]
Load from temp[19]
Skipped, threadIdx > RADIUS

__syncthreads()
• void __syncthreads();
• Synchronizes all threads within a block
– Used to prevent RAW / WAR / WAW hazards
• All threads must reach the barrier
– In conditional code, the condition must be
uniform across the block
© NVIDIA 2013

Stencil Kernel
int lindex = threadIdx.x + radius;
temp[lindex – RADIUS] = in[gindex – RADIUS];
}
// Synchronize (ensure all the data is available)
__syncthreads();
© NVIDIA 2013

Stencil Kernel
int result = 0;
// Store the result
}
© NVIDIA 2013

Review (1 of 2)
• Launching parallel threads
– Launch N blocks with M threads per block with
kernel<<<N,M>>>(…);
– Use blockIdx.x to access block index within grid
– Use threadIdx.x to access thread index within block
• Allocate elements to threads:
© NVIDIA 2013

Review (2 of 2)
• Use __shared__ to declare a variable/array in
shared memory
– Data is shared between threads in a block
– Not visible to threads in other blocks
• Use __syncthreads() as a barrier
– Use to prevent data hazards
© NVIDIA 2013

MANAGING THE
DEVICE
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Handling errors
Managing devices
CONCEPTS
© NVIDIA 2013

Coordinating Host & Device
• Kernel launches are asynchronous
– Control returns to the CPU immediately
• CPU needs to synchronize before consuming the
results
cudaMemcpy() Blocks the CPU until the copy is complete
Copy begins when all preceding CUDA calls have
completed
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchro
nize()
Blocks the CPU until all preceding CUDA calls have
completed
© NVIDIA 2013

Reporting Errors
• All CUDA API calls return an error code (cudaError_t)
– Error in the API call itself
OR
– Error in an earlier asynchronous operation (e.g. kernel)
• Get the error code for the last error:
cudaError_t cudaGetLastError(void)
• Get a string to describe the error:
char *cudaGetErrorString(cudaError_t)
printf("%sn", cudaGetErrorString(cudaGetLastError()));
© NVIDIA 2013

Device Management
• Application can query and select GPUs
cudaGetDeviceCount(int *count)
cudaSetDevice(int device)
cudaGetDevice(int *device)
cudaGetDeviceProperties(cudaDeviceProp *prop, int device)
• Multiple threads can share a device
• A single thread can manage multiple devices
cudaSetDevice(i) to select current device
cudaMemcpy(…) for peer-to-peer copies✝
✝ requires OS and device support
© NVIDIA 2013

Introduction to CUDA C/C++
• What have we learned?
– Write and launch CUDA C/C++ kernels
• __global__, blockIdx.x, threadIdx.x, <<<>>>
– Manage GPU memory
• cudaMalloc(), cudaMemcpy(), cudaFree()
– Manage communication and synchronization
• __shared__, __syncthreads()
• cudaMemcpy() vs cudaMemcpyAsync(),
cudaDeviceSynchronize()
© NVIDIA 2013

Compute Capability
• The compute capability of a device describes its architecture, e.g.
– Number of registers
– Sizes of memories
– Features & capabilities
• The following presentations concentrate on Fermi devices
– Compute Capability >= 2.0
Compute
Capability
Selected Features
(see CUDA C Programming Guide for complete list)
Tesla models
1.0 Fundamental CUDA support 870
1.3 Double precision, improved memory accesses,
atomics
10-series
2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC,
P2P,
concurrent kernels/copies, function pointers,
recursion
20-series
© NVIDIA 2013

IDs and Dimensions
– A kernel is launched as a
grid of blocks of threads
• blockIdx and
threadIdx are 3D
• We showed only one
dimension (x)
• Built-in variables:
– threadIdx
– blockIdx
– blockDim
– gridDim
Device
Grid 1
Bloc
k
(0,0,
0)
Bloc
k
(1,0,
0)
Bloc
k
(2,0,
0)
Bloc
k
(1,1,
0)
Bloc
k
(2,1,
0)
Bloc
k
(0,1,
0)
Block (1,1,0)
Thre
ad
(0,0,
0)
Thre
ad
(1,0,
0)
Thre
ad
(2,0,
0)
Thre
ad
(3,0,
0)
Thre
ad
(4,0,
0)
Thre
ad
(0,1,
0)
Thre
ad
(1,1,
0)
Thre
ad
(2,1,
0)
Thre
ad
(3,1,
0)
Thre
ad
(4,1,
0)
Thre
ad
(0,2,
0)
Thre
ad
(1,2,
0)
Thre
ad
(2,2,
0)
Thre
ad
(3,2,
0)
Thre
ad
(4,2,
0)
© NVIDIA 2013

Textures
• Read-only object
– Dedicated cache
• Dedicated filtering hardware
(Linear, bilinear, trilinear)
• Addressable as 1D, 2D or 3D
• Out-of-bounds address handling
(Wrap, clamp)
0 1 2 3
0
1
2
4
(2.5, 0.5)
(1.0, 1.0)
© NVIDIA 2013

Topics we skipped
• We skipped some details, you can learn more:
– CUDA Programming Guide
– CUDA Zone – tools, training, webinars and more
developer.nvidia.com/cuda
• Need a quick primer for later:
– Multi-dimensional indexing
– Textures
© NVIDIA 2013

An Introduction to the
Thrust Parallel Algorithms Library

What is Thrust?
• High-Level Parallel Algorithms Library
• Parallel Analog of the C++ Standard Template
Library (STL)
• Performance-Portable Abstraction Layer
• Productive way to program CUDA

Example
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <cstdlib>
int main(void)
{
// generate 32M random numbers on the host
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer data to the device
// sort data on the device
// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
return 0;
}

Easy to Use
• Distributed with CUDA Toolkit
• Header-only library
• Architecture agnostic
• Just compile and run!
$ nvcc -O2 -arch=sm_20 program.cu -o program

Productivity
• Containers
host_vector
device_vector
• Memory Mangement
– Allocation
– Transfers
• Algorithm Selection
– Location is implicit
// allocate host vector with two elements
thrust::host_vector<int> h_vec(2);
// copy host data to device memory
// write device values from the host
d_vec[0] = 27;
d_vec[1] = 13;
// read device values from the host
int sum = d_vec[0] + d_vec[1];
// invoke algorithm on device
// memory automatically released

Productivity
• Large set of algorithms
– ~75 functions
– ~125 variations
• Flexible
– User-defined types
– User-defined operators
Algorithm Description
reduce Sum of a sequence
find First position of a value in a sequence
mismatch First position where two sequences differ
inner_product Dot product of two sequences
equal Whether two sequences are equal
min_element Position of the smallest value
count Number of instances of a value
is_sorted Whether sequence is in sorted order
transform_reduce Sum of transformed sequence

Thrust
CUDA
C/C++
CUBLAS,
CUFFT,
NPP
STL
CUDA
Fortran
C/C++
OpenMP
TBB
Interoperability

Portability
• Support for CUDA, TBB and OpenMP
– Just recompile!
GeForce GTX 280
$ time ./monte_carlo
pi is approximately 3.14159
real 0m6.190s
user 0m6.052s
sys 0m0.116s
NVIDA GeForce GTX 580 Core2 Quad Q6600
$ time ./monte_carlo
pi is approximately 3.14159
real 1m26.217s
user 11m28.383s
sys 0m0.020s
Intel Core i7 2600K
nvcc -DTHRUST_DEVICE_SYSTEM=THRUST_HOST_SYSTEM_OMP

Backend System Options
Device Systems
THRUST_DEVICE_SYSTEM_CUDA
THRUST_DEVICE_SYSTEM_OMP
THRUST_DEVICE_SYSTEM_TBB
Host Systems
THRUST_HOST_SYSTEM_CPP
THRUST_HOST_SYSTEM_OMP
THRUST_HOST_SYSTEM_TBB

Multiple Backend Systems
• Mix different backends freely within the same app
thrust::omp::vector<float> my_omp_vec(100);
thrust::cuda::vector<float> my_cuda_vec(100);
...
// reduce in parallel on the CPU
thrust::reduce(my_omp_vec.begin(), my_omp_vec.end());
// sort in parallel on the GPU
thrust::sort(my_cuda_vec.begin(), my_cuda_vec.end());

Potential Workflow
• Implement
Application with
Thrust
• Profile
Application
• Specialize
Components as
Necessary
Thrust
Implementation
Profile
Application
Specialize
Components
Application
Bottleneck
Optimized Code

Sort
Radix Sort
G80 GT200 Fermi Kepler
Merge Sort
G80 GT200 Fermi Kepler
Performance Portability
Thrust
CUDA
Transform Scan Sort Reduce
OpenMP
Transform Scan Sort Reduce

Extensibility
• Customize temporary allocation
• Create new backend systems
• Modify algorithm behavior
• New in Thrust v1.6

Robustness
• Reliable
– Supports all CUDA-capable GPUs
• Well-tested
– ~850 unit tests run daily
• Robust
– Handles many pathological use cases

Openness
• Open Source Software
– Apache License
– Hosted on GitHub
• Welcome to
– Suggestions
– Criticism
– Bug Reports
– Contributions
thrust.github.com

Resources
• Documentation
• Examples
• Mailing List
• Webinars
• Publications
thrust.github.com

CUDA-Accelerated
Libraries
Drop-in Acceleration

int N = 1 << 20;
// Perform SAXPY on 1M elements: y[]=a*x[]+y[]
saxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step 1)
© NVIDIA 2013

int N = 1 << 20;
// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]
cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
Add “cublas” prefix
and use device
variables
© NVIDIA 2013

int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
Allocate device
vectors
Deallocate device
vectors
© NVIDIA 2013

int N = 1 << 20;
cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);
cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
Transfer data to GPU
Read data back GPU
© NVIDIA 2013

GPU Computing with
OpenACC Directives

subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a
integer :: n, i
$!acc kernels
do i=1,n
y(i) = a*x(i)+y(i)
enddo
$!acc end kernels
end subroutine saxpy
...
$ Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x_d, y_d)
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY
© NVIDIA 2013
SAXPY in C SAXPY in Fortran

Directive Syntax
• Fortran
!$acc directive [clause [,] clause] …]
Often paired with a matching end directive surrounding
a structured code block
!$acc end directive
• C
#pragma acc directive [clause [,] clause] …]
Often followed by a structured code block
© NVIDIA 2013

kernels: Your first OpenACC Directive
Each loop executed as a separate kernel on the
GPU.
!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0
c(i) = 2.0
end do
do i=1,n
a(i) = b(i) + c(i)
end do
!$acc end kernels
kernel 1
kernel 2
Kernel:
A parallel
function that runs
on the GPU
© NVIDIA 2013

Kernels Construct
Fortran
!$acc kernels [clause …]
structured block
!$acc end kernels
Clauses
if( condition )
async( expression )
Also, any data clause (more later)
C
#pragma acc kernels [clause …]
{ structured block }
© NVIDIA 2013

C tip: the restrict keyword
• Declaration of intent given by the programmer to the compiler
Applied to a pointer, e.g.
float *restrict ptr
Meaning: “for the lifetime of ptr, only it or a value directly derived
from it (such as ptr + 1) will be used to access the object to which it
points”*
• Limits the effects of pointer aliasing
• OpenACC compilers often require restrict to determine
independence
– Otherwise the compiler can’t parallelize loops that access ptr
– Note: if programmer violates the declaration, behavior is undefined
http://en.wikipedia.org/wiki/Restrict
© NVIDIA 2013

Complete SAXPY example code
• Trivial first example
– Apply a loop directive
– Learn compiler commands
#include <stdlib.h>
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a * x[i] + y[i];
}
int main(int argc, char **argv)
{
int N = 1<<20; // 1 million floats
if (argc > 1)
N = atoi(argv[1]);
float *x = (float*)malloc(N * sizeof(float));
float *y = (float*)malloc(N * sizeof(float));
for (int i = 0; i < N; ++i)
{
x[i] = 2.0f;
y[i] = 1.0f;
}
saxpy(N, 3.0f, x, y);
return 0;
}
*restrict:
“I promise y does not
alias x”
© NVIDIA 2013

Compile and run
• C:
pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c
• Fortran:
pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90
• Compiler output:
pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c
saxpy:
8, Generating copyin(x[:n-1])
Generating copy(y[:n-1])
Generating compute capability 1.0 binary
9, Loop is parallelizable
Accelerator kernel generated
9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */
CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy
© NVIDIA 2013

Example: Jacobi Iteration
• Iteratively converges to correct value (e.g.
Temperature), by computing new values at each
point from the average of neighboring points.
– Common, useful algorithm
– Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙, 𝒚) = 𝟎
A(i,j) A(i+1,j)A(i-1,j)
A(i,j-1)
A(i,j+1)
𝐴 𝑘+1 𝑖, 𝑗 =
𝐴 𝑘(𝑖 − 1, 𝑗) + 𝐴 𝑘 𝑖 + 1, 𝑗 + 𝐴 𝑘 𝑖, 𝑗 − 1 + 𝐴 𝑘 𝑖, 𝑗 + 1
4
© NVIDIA 2013

Jacobi Iteration C Code
while ( error > tol && iter < iter_max )
{
error=0.0;
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
iter++;
}
Iterate until converged
Iterate across matrix
elements
Calculate new value
from neighbors
Compute max error for
convergence
Swap input/output
arrays
© NVIDIA 2013

Jacobi Iteration Fortran Code
do while ( err > tol .and. iter < iter_max )
err=0._fp_kind
do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + &
A(i , j-1) + A(i , j+1))
err = max(err, Anew(i,j) - A(i,j))
end do
end do
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
iter = iter +1
end do
Iterate until converged
Iterate across matrix
elements
Calculate new value
from neighbors
Compute max error for
convergence
Swap input/output
arrays
© NVIDIA 2013

OpenMP C Code
while ( error > tol && iter < iter_max ) {
error=0.0;
#pragma omp parallel for shared(m, n, Anew, A)
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
}
}
#pragma omp parallel for shared(m, n, Anew, A)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
}
}
iter++;
}
Parallelize loop across
CPU threads
CPU threads
© NVIDIA 2013

OpenMP Fortran Code
err=0._fp_kind
!$omp parallel do shared(m,n,Anew,A) reduction(max:err)
do j=1,m
do i=1,n
A(i , j-1) + A(i , j+1))
end do
end do
!$omp parallel do shared(m,n,Anew,A)
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
iter = iter +1
end do
CPU threads
CPU threads
© NVIDIA 2013

GPU startup overhead
• If no other GPU process running, GPU driver may be
swapped out
– Linux specific
– Starting it up can take 1-2 seconds
• Two options
– Run nvidia-smi in persistence mode (requires root permissions)
– Run “nvidia-smi –q –l 30” in the background
• If your running time is off by ~2 seconds from results in
these slides, suspect this
– Nvidia-smi should be running in persistent mode for these
exercises
© NVIDIA 2013

First Attempt: OpenACC C
error=0.0;
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
}
}
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
}
}
iter++;
}
Execute GPU kernel for
loop nest
Execute GPU kernel for
loop nest
© NVIDIA 2013

First Attempt: OpenACC Fortran
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
A(i , j-1) + A(i , j+1))
end do
end do
!$acc end kernels
!$acc kernels
do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
!$acc end kernels
iter = iter +1
end do
Generate GPU kernel
for loop nest
Generate GPU kernel
for loop nest
© NVIDIA 2013

First Attempt: Compiler output (C)
pgcc -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c
main:
57, Generating copyin(A[:4095][:4095])
Generating copyout(Anew[1:4094][1:4094])
58, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */
Cached references to size [18x18] block of 'A'
64, Max reduction generated for error
69, Generating copyout(A[1:4094][1:4094])
Generating copyin(Anew[1:4094][1:4094])
70, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */
© NVIDIA 2013

First Attempt: Performance
Execution Time (s) Speedup
CPU 1 OpenMP thread 69.80 --
CPU 2 OpenMP threads 44.76 1.56x
OpenACC GPU 162.16 0.24x FAIL Speedup vs. 6 CPU cores
Speedup vs. 1 CPU core
CPU: Intel Xeon X5680
6 Cores @ 3.33GHz
GPU: NVIDIA Tesla M2070
© NVIDIA 2013

Basic Concepts
PCI Bus
Transfer data
Offload computation
For efficiency, decouple data movement and compute off-load
GPU
GPU Memory
CPU
CPU Memory
© NVIDIA 2013

Excessive Data Transfers
while ( error > tol && iter < iter_max )
{
error=0.0;
...
}
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
}
}
A, Anew resident on host
A, Anew resident on host
A, Anew resident on accelerator
A, Anew resident on accelerator
These copies
happen every
iteration of the
outer while loop!*
Copy
Copy
*Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!
© NVIDIA 2013

Data Construct
Fortran
!$acc data [clause …]
structured block
!$acc end data
General Clauses
if( condition )
async( expression )
C
#pragma acc data [clause …]
Manage data movement. Data regions may be
nested.
© NVIDIA 2013

Data Clauses
copy ( list ) Allocates memory on GPU and copies data
from host to GPU when entering region and
copies data to the host when exiting region.
copyin ( list ) Allocates memory on GPU and copies data from
host to GPU when entering region.
copyout ( list ) Allocates memory on GPU and copies data to the
host when exiting region.
create ( list ) Allocates memory on GPU but does not copy.
present ( list ) Data is already present on GPU from another containing
data region.
and present_or_copy[in|out], present_or_create, deviceptr.
© NVIDIA 2013

Array Shaping
• Compiler sometimes cannot determine size of arrays
– Must specify explicitly using data clauses and array “shape”
• C
#pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4])
• Fortran
!$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))
• Note: data clauses can be used on data, kernels or
parallel
© NVIDIA 2013

Update Construct
Fortran
!$acc update [clause …]
Clauses
host( list )
device( list )
C
#pragma acc update [clause …]
if( expression )
async( expression )
Used to update existing data after it has changed in its
corresponding copy (e.g. update device copy after host copy
changes)
Move data from GPU to host, or host to GPU.
Data movement can be conditional, and asynchronous.
© NVIDIA 2013

Second Attempt: OpenACC C
#pragma acc data copy(A), create(Anew)
error=0.0;
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
}
}
#pragma acc kernels
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
}
}
iter++;
}
Copy A in at beginning of
loop, out at end. Allocate
Anew on accelerator
© NVIDIA 2013

Second Attempt: OpenACC Fortran
!$acc data copy(A), create(Anew)
err=0._fp_kind
!$acc kernels
do j=1,m
do i=1,n
A(i , j-1) + A(i , j+1))
end do
end do
!$acc end kernels
...
iter = iter +1
end do
!$acc end data
Copy A in at beginning of loop,
out at end. Allocate Anew on
accelerator
© NVIDIA 2013

Second Attempt: Performance
Execution Time (s) Speedup
CPU 1 OpenMP thread 69.80 --
OpenACC GPU 13.65 2.9x Speedup vs. 6 CPU cores
Speedup vs. 1 CPU core
CPU: Intel Xeon X5680
6 Cores @ 3.33GHz
GPU: NVIDIA Tesla M2070
Note: same code runs in 9.78s on NVIDIA Tesla M2090 GPU
© NVIDIA 2013

Further speedups
• OpenACC gives us more detailed control over
parallelization
– Via gang, worker, and vector clauses
• By understanding more about OpenACC execution
model and GPU hardware organization, we can get
higher speedups on this code
• By understanding bottlenecks in the code via profiling,
we can reorganize the code for higher performance
• Will tackle these in later exercises
© NVIDIA 2013

Finding Parallelism in your code
• (Nested) for loops are best for parallelization
• Large loop counts needed to offset GPU/memcpy overhead
• Iterations of loops must be independent of each other
– To help compiler: restrict keyword (C), independent
clause
• Compiler must be able to figure out sizes of data regions
– Can use directives to explicitly control sizes
• Pointer arithmetic should be avoided if possible
– Use subscripted arrays, rather than pointer-indexed arrays.
• Function calls within accelerated region must be inlineable.
© NVIDIA 2013

Tips and Tricks
• (PGI) Use time option to learn where time is being
spent
-ta=nvidia,time
• Eliminate pointer arithmetic
• Inline function calls in directives regions
(PGI): -inline or -inline,levels(<N>)
• Use contiguous memory for multi-dimensional arrays
• Use data regions to avoid excessive memory transfers
• Conditional compilation with _OPENACC macro
© NVIDIA 2013

OpenACC Learning Resources
• OpenACC info, specification, FAQ, samples,
and more
– http://openacc.org
• PGI OpenACC resources
– http://www.pgroup.com/resources/accel.htm
© NVIDIA 2013

Kernels Construct
Fortran
!$acc kernels [clause …]
structured block
!$acc end kernels
Clauses
if( condition )
async( expression )
Also any data clause
C
#pragma acc kernels [clause …]
© NVIDIA 2013

Kernels Construct
Each loop executed as a separate kernel on the
GPU.
!$acc kernels
do i=1,n
a(i) = 0.0
b(i) = 1.0
c(i) = 2.0
end do
do i=1,n
a(i) = b(i) + c(i)
end do
!$acc end kernels
kernel 1
kernel 2
© NVIDIA 2013

Parallel Construct
Fortran
!$acc parallel [clause …]
structured block
!$acc end parallel
Clauses
if( condition )
async( expression )
num_gangs( expression )
num_workers( expression )
vector_length( expression )
C
#pragma acc parallel [clause …]
private( list )
firstprivate( list )
reduction( operator:list )
Also any data clause
© NVIDIA 2013

Parallel Clauses
num_gangs ( expression ) Controls how many parallel gangs
are created (CUDA gridDim).
num_workers ( expression ) Controls how many workers are
created in each gang (CUDA
blockDim).
vector_length ( list ) Controls vector length of each
worker (SIMD execution).
private( list ) A copy of each variable in list is
allocated to each gang.
firstprivate ( list ) private variables initialized from
host.
reduction( operator:list ) private variables combined across
gangs.
© NVIDIA 2013

Loop Construct
Fortran
!$acc loop [clause …]
loop
!$acc end loop
Combined directives
!$acc parallel loop [clause …]
!$acc kernels loop [clause …]
C
#pragma acc loop [clause …]
{ loop }
!$acc parallel loop [clause …]
!$acc kernels loop [clause …]
Detailed control of the parallel execution of the
following loop.
© NVIDIA 2013

Loop Clauses
collapse( n ) Applies directive to the following
n nested loops.
seq Executes the loop sequentially on
the GPU.
private( list ) A copy of each variable in list is
created for each iteration of the
loop.
reduction( operator:list ) private variables combined
across iterations.
© NVIDIA 2013

Loop Clauses Inside parallel Region
gang Shares iterations across the gangs
of the parallel region.
worker Shares iterations across the
workers of the gang.
vector Execute the iterations in SIMD
mode.
© NVIDIA 2013

Loop Clauses Inside kernels Region
gang [( num_gangs )] Shares iterations across across
at most num_gangs gangs.
worker [( num_workers )] Shares iterations across at
most num_workers of a single
gang.
vector [( vector_length )] Execute the iterations in SIMD
mode with maximum
vector_length.
independent Specify that the loop iterations
are independent.
© NVIDIA 2013

Other Directives
cache construct Cache data in software
managed data cache (CUDA
shared memory).
host_data construct Makes the address of device
data available on the host.
wait directive Waits for asynchronous GPU
activity to complete.
declare directive Specify that data is to
allocated in device memory
for the duration of an implicit
data region created during the
execution of a subprogram.
© NVIDIA 2013

Runtime Library Routines
Fortran
use openacc
#include "openacc_lib.h"
acc_get_num_devices
acc_set_device_type
acc_get_device_type
acc_set_device_num
acc_get_device_num
acc_async_test
acc_async_test_all
C
#include "openacc.h"
acc_async_wait
acc_async_wait_all
acc_shutdown
acc_on_device
acc_malloc
acc_free
© NVIDIA 2013

Environment and Conditional Compilation
ACC_DEVICE device Specifies which device type to
connect to.
ACC_DEVICE_NUM num Specifies which device
number to connect to.
_OPENACC Preprocessor directive for
conditional compilation. Set
to OpenACC version
© NVIDIA 2013

Cuda

More Related Content

What's hot

Similar to Cuda

More from Gopi Saiteja

Recently uploaded

Cuda