4.11.24 Mass Incarceration and the New Jim Crow.pptx
Tech Talk NVIDIA CUDA
1. NVIDIA CUDA
The Compute Unified Device Architecture
Jens Rühmkorf
Tech Talk, DLR Köln-Porz, July 22nd 2009
Slide 1
CUDA > J. Rühmkorf > July 22nd 2009
2. References
University of Illinois at Urbana-Champaign,
Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009:
http://courses.ece.illinois.edu/ece498/al/
Website about General-Purpose Computation on Graphics Hardware:
http://gpgpu.org/developer/cuda
ACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU):
http://mags.acm.org/queue/20080304/
Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13:
http://www.ddj.com/architect/207200659
NVIDIA CUDA Best Practices Guide
http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI
DIA_CUDA_BestPracticesGuide_2.3.pdf
Hubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online:
http://developer.nvidia.com/object/gpu-gems-3.html
Slide 2
CUDA > J. Rühmkorf > July 22nd 2009
3. Multi- and Manycore Architectures
A Difficult Road Lies Ahead
Don Knuth on Multicore Architectures
“[…] my personal unhappiness with the current trend toward multicore
architecture. To me, it looks more or less like the hardware designers
have run out of ideas, and that they’re trying to pass the blame for the
future demise of Moore’s Law to the software writers by giving us
machines that work faster only on a few key benchmarks! I won’t be
surprised at all if the whole multithreading idea turns out to be a flop”
In: InformIT, April 25th 2008
http://www.informit.com/articles/article.aspx?p=1193856
Slide 3
CUDA > J. Rühmkorf > July 22nd 2009
4. Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
Slide 4
CUDA > J. Rühmkorf > July 22nd 2009
5. Multicore and Manycore (1)
Structural Differences
Multicore Manycore
Multicore: yoke of oxen
Each core optimized for executing a single thread
Manycore: flock of chickens
Cores optimized for aggregate throughput,
deemphasizing individual performance
Slide 5
CUDA > J. Rühmkorf > July 22nd 2009
6. Multicore and Manycore (2)
Technical Characteristics
Specifica-
Core i7 960 GTX285
tions
4 cores, 4 way 30 cores, 8 way
Processing SIMD SIMD
Elements
@3.2 GHz @1.5 GHz
4 cores, 2 threads, 30 cores, 32 SIMD
4 width SIMD:
Resident Threads
(max)
vectors, 32 width
SIMD:
Core i7
32 strands 30720 strands
SP GFLOP/s 102 1080
Memory Bandwidth 25.6 GB/s 159 GB/s
Register File - 1.875 MB
Local Store - 480 kB
GTX285
Slide 6
CUDA > J. Rühmkorf > July 22nd 2009
7. Multicore and Manycore (3)
Performance Comparison: CPU vs. GPU
CPU vs. GPU
y-axis: floating point operations per sec., single precision
Slide 7
CUDA > J. Rühmkorf > July 22nd 2009
8. An Example of the Physical Reality Behind CUDA
CPU
GPU w/ (host)
local DRAM
(device)
Slide 8
CUDA > J. Rühmkorf > July 22nd 2009
10. CUDA in a Nutshell
Key Characteristics
CUDA is designed for wide SIMD/SPMD parallelism & scalability
CUDA provides 3 key abstractions, i.e. a hierarchy:
of thread groups,
of shared memories, and
of barrier synchronization
CUDA programs are written in C + extensions
OpenCL is inspired by CUDA, but HW & SW vendor neutral
Programming model essentially identical
Slide 10
CUDA > J. Rühmkorf > July 22nd 2009
11. Hello World
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i]
}
int main() {
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
hello-world.cu
Slide 11
CUDA > J. Rühmkorf > July 22nd 2009
12. Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
Slide 12
CUDA > J. Rühmkorf > July 22nd 2009
13. CUDA Programming Model
Structure of a CUDA application
Integrated host + device application C program
Serial or modestly parallel parts in host C code
Highly parallel parts in device SPMD kernel C code
Serial Code (host)
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args); ...
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args); ...
Slide 13
CUDA > J. Rühmkorf > July 22nd 2009
14. CUDA Programming Model
CUDA Devices and Threads
A compute device
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
Is typically a GPU but can also be another type of parallel
processing device
Express data-parallel portions as device kernels
(which run on many threads)
Slide 14
CUDA > J. Rühmkorf > July 22nd 2009
15. CUDA Programming Model
Arrays of Parallel Threads
Execute a Kernel by specifying arrays of threads
All threads run the same code (SPMD)
Use thread-ID to compute memory addresses & make control decisions
threadID 0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Slide 15
CUDA > J. Rühmkorf > July 22nd 2009
16. CUDA Programming Model
Use Thread Blocks for (Scalable) Cooperation
Divide monolithic thread array into multiple blocks
Threads within a block can cooperate via
shared memory,
atomic operations, and
barrier synchronization
Threads in different blocks cannot cooperate
Thread Block 0 Thread Block 1 Thread Block N - 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
threadID
… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …
Slide 16
CUDA > J. Rühmkorf > July 22nd 2009
17. CUDA Programming Model
Organisation of Thread Blocks
Thread Blocks can be one-, two- or
three-dimensional arrays
The host issues a sequence of
kernel invocations (kernel 1,
kernel 2) to the device
Each kernel is executed as a batch
of threads
This batch is organized as a grid of
thread blocks
2-dimensional thread blocks
Slide 17
CUDA > J. Rühmkorf > July 22nd 2009
18. CUDA Programming Model
Block IDs and Thread IDs
Each thread uses IDs to decide what
data to work on
Block ID: 1D, 2D, or 3D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
Slide 18
CUDA > J. Rühmkorf > July 22nd 2009
19. Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
Slide 19
CUDA > J. Rühmkorf > July 22nd 2009
20. CUDA Memory Model
Overview
Global memory Grid
Main means of communicating
R/W Data between host and Block (0, 0) Block (1, 0)
device
Shared Memory Shared Memory
Contents visible to all threads
Long latency access Registers Registers Registers Registers
We will focus on global memory for
now
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Constant and texture memory will not
be covered here
Host Global Memory
Constant Memory
Texture Memory
Slide 20
CUDA > J. Rühmkorf > July 22nd 2009
21. CUDA Memory Model
CUDA Device Memory Allocation
cudaMalloc(): Grid
Allocates object in the device
global memory Block (0, 0) Block (1, 0)
Requires two parameters
Shared Memory Shared Memory
Address of a pointer to the
allocated object Registers Registers Registers Registers
Size of allocated object
cudaFree()
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Frees object from the device
global memory
Host Global
Pointer to freed object Memory
Slide 21
CUDA > J. Rühmkorf > July 22nd 2009
22. CUDA Memory Model
CUDA Host-Device Data Transfer
cudaMemcpy() Grid
memory data transfer
Block (0, 0) Block (1, 0)
Requires four parameters
Pointer to destination Shared Memory Shared Memory
Pointer to source
Registers Registers Registers Registers
Number of bytes to copy
Type of transfer
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
Type of transfer is one of:
Host to Host
Host to Device Host Global
Memory
Device to Host
Device to Device
Asynchronous transfer
Slide 22
CUDA > J. Rühmkorf > July 22nd 2009
23. Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
Slide 23
CUDA > J. Rühmkorf > July 22nd 2009
24. CUDA API
Extended C
Integrated source
(foo.cu)
cudacc
EDG C/C++ frontend
Open64 Global Optimizer
GPU Assembly CPU Host Code
foo.s foo.cpp
OCG gcc / cl
Mark Murphy, “NVIDIA’s Experience with
G80 SASS
foo.sass with Open64,”
www.capsl.udel.edu/conferences/open64/2008
008/Papers/101.doc
Slide 24
CUDA > J. Rühmkorf > July 22nd 2009
25. CUDA API
C for CUDA
Function type specifiers image-convolution.cu
__global__, __device__,
__host__ __device__ float filter[N];
Variable type specifiers
__device__, __shared__, __global__ void convolve(float
__constant__ *image) {
Keywords __shared__ float region[M];
threadIdx, blockIdx ...
Intrinsics / builtin functions: region[threadIdx] = image[i];
__syncthreads() __syncthreads()
Runtime API ...
Memory, symbol, execution
management
image[j] = result;
Function launch
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>>(myimage);
Slide 25
CUDA > J. Rühmkorf > July 22nd 2009
26. CUDA API
CUDA Function Type Qualifiers (1)
only callable
executed on:
from:
__device__ float deviceFunc() device device
__global__ void kernelFunc() device host
__host__ float hostFunc() host host
__global__:
defines a kernel function
must return void
__device__ and __host__ can be used together
Slide 26
CUDA > J. Rühmkorf > July 22nd 2009
27. CUDA API
CUDA Function Type Qualifiers (2)
__device__ functions cannot have their address taken
For functions executed on the device:
no recursion
no static variable declarations inside the function
no variable number of arguments
Slide 27
CUDA > J. Rühmkorf > July 22nd 2009
28. CUDA API
CUDA Variable Type Qualifiers
__device__
Resides in global memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through
the runtime library.
__shared__ (optionally used together with __device__)
Resides in the shared memory space of a thread block,
Has the lifetime of the block,
Is only accessible from all the threads within the block.
Not covered here:
__constant__ (optionally used together with __device__)
Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host through
the runtime library.
Slide 28
CUDA > J. Rühmkorf > July 22nd 2009
29. CUDA API
Calling a Kernel Function – Execution Configuration
A kernel function ( == __global__ function) must be called with an
execution configuration:
__global__ void kernelFunc(...) {...};
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
size_t sharedMemBytes = 64; // 64 bytes of shared memory
kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...);
Any call to a kernel function is asynchronous from CUDA 1.0 on
Explicit synchronization needed for blocking
Slide 29
CUDA > J. Rühmkorf > July 22nd 2009
30. Overview
A high level view on CUDA
CUDA programming model
CUDA memory model
CUDA application programming interface
Simple CUDA example
Slide 30
CUDA > J. Rühmkorf > July 22nd 2009
31. A Simple CUDA Example
Matrix Multiplication
A simple matrix multiplication example that illustrates the basic features
of memory and thread management in CUDA programs
Leave shared memory usage until later
Local, register usage
Thread ID usage
Memory data transfer API between host and device
Assume square matrix for simplicity
Slide 31
CUDA > J. Rühmkorf > July 22nd 2009
32. Simple CUDA Example
Square Matrix Multiplication N
P = M * N of size WIDTH x WIDTH
WIDTH
Here: without tiling!
One thread calculates one element
of P
M and N are loaded WIDTH times
from global memory
M P
WIDTH
WIDTH WIDTH
Slide 32
CUDA > J. Rühmkorf > July 22nd 2009
33. Memory Layout of a Matrix in C
M0,0 M1,0 M2,0 M3,0
M0,1 M1,1 M2,1 M3,1
M0,2 M1,2 M2,2 M3,2
M0,3 M1,3 M2,3 M3,3
M
M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
Slide 33
CUDA > J. Rühmkorf > July 22nd 2009
34. Step 1: Matrix Multiplication
N
A Simple Host Version in C
k
// Matrix multiplication on the (CPU)
// host in double precision j
WIDTH
void matrixMulOnHost(float* M, float* N,
float* P, int width) {
for (int i = 0; i < width; ++i)
for (int j = 0; j < width; ++j) {
double sum = 0;
for (int k = 0; k < width; ++k) {
M P
double a = M[i * width + k];
double b = N[k *i width + j];
sum += a * b;
WIDTH
}
P[i * width + j] = sum;
}
k
} WIDTH WIDTH
Slide 34
CUDA > J. Rühmkorf > July 22nd 2009
35. Step 2: Input Matrix Data Transfer
(Host-sided Code)
void matrixMulOnDevice(float* M, float* N, float* P, int width){
int size = width * width * sizeof(float);
float* Md, *Nd, *Pd;
…
1. // Allocate and load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// Allocate P on the device
cudaMalloc(&Pd, size);
Slide 35
CUDA > J. Rühmkorf > July 22nd 2009
36. Step 3: Output Matrix Data Transfer
(Host-sided Code)
2. // Kernel invocation code – to be shown later
…
3. // Read P from the device
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// Free device matrices
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
}
Slide 36
CUDA > J. Rühmkorf > July 22nd 2009
37. Step 4: Kernel Function (1)
// Matrix multiplication kernel – per thread code
__global__ void matrixMulKernel(float* Md, float* Nd,
float* Pd, int width) {
// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;
// see next page…
Slide 37
CUDA > J. Rühmkorf > July 22nd 2009
38. Step 4: Kernel Function (2)
Nd
for (int k = 0; k < width; ++k) { k
WIDTH
float Melement =
Md[k + threadIdx.y*width]; tx
float Nelement =
Nd[threadIdx.x + k*width];
Pvalue += Melement * Nelement;
} Md Pd
{ ty ty
int i = threadIdx.x +
WIDTH
threadIdx.y*width;
Pd[i] = Pvalue;
k tx
}
} WIDTH WIDTH
Slide 38
CUDA > J. Rühmkorf > July 22nd 2009
39. Step 5: Kernel Invocation
(Host-sided Code)
// Insert into step 2. from before
// Setup the execution configuration
dim3 dimGrid(1, 1);
dim3 dimBlock(width, width);
// Launch the device computation threads!
matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd,
width);
Slide 39
CUDA > J. Rühmkorf > July 22nd 2009
40. Example Far from Ideal
Only One Thread Block Used
Grid 1 Nd
Block 1
One Block of threads compute matrix 2
Pd
4
Each thread computes one
element of Pd Thread
(2, 2)
2
Each thread 6
Loads a row of matrix Md
Loads a column of matrix Nd
Perform one multiply and
addition for each pair of Md and
Nd elements
Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not
very high)
Size of matrix limited by the number WIDTH
of threads allowed in a thread block
Md Pd
Slide 40
CUDA > J. Rühmkorf > July 22nd 2009
41. CUDA: A Bright Future?
Eure Rede aber sei: Ja, ja; nein, nein.
Was darüber ist, das ist vom Übel.
Matthäus 5:37
Slide 41
CUDA > J. Rühmkorf > July 22nd 2009
42. NVIDIA CUDA: Appendix
Best Practices & Things to Watch Out For
Slide 42
CUDA > J. Rühmkorf > July 22nd 2009
43. Appendix
Best Practices & Things to Watch Out For
Obtain relevant hardware data
Compiling a CUDA program
Linking
Debugging
C for CUDA vs. CUDA Driver API
Watch out: floating point computations
Unsupported C language elements
Branching of code
Coalesced access to device global memory
Access patterns to avoid bank conflicts
Slide 43
CUDA > J. Rühmkorf > July 22nd 2009
44. Obtain Relevant Hardware Data
Make sure to obtain relevant additional hardware data
Call cudaGetDeviceProperties()
Slide 44
CUDA > J. Rühmkorf > July 22nd 2009
45. Compiling a CUDA Program (1)
Parallel Thread eXecution
C/C++ CUDA float4 me = gx[gtid];
(PTX)
Application me.x += me.y * me.z;
Virtual Machine
and ISA (Instruction
Set Architecture)
NVCC CPU Code Programming
model
Execution
Virtual PTX Code resources and state
Physical
PTX to Target ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];
Compiler mad.f32 $f1, $f5, $f3, $f1;
G80 … GPU
Target code
Slide 45
CUDA > J. Rühmkorf > July 22nd 2009
46. Compiling a CUDA Program (2)
NVCC as a Compiler Driver
Any source file containing CUDA language extensions must be compiled
with NVCC
NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, ...
NVCC outputs:
C code (host CPU Code)
Must then be compiled with the rest of the application using
another tool
PTX
Object code directly
Or, PTX source, interpreted at runtime
Slide 46
CUDA > J. Rühmkorf > July 22nd 2009
47. Linking
Any executable with CUDA code requires two dynamic libraries:
The CUDA runtime library (cudart)
The CUDA core library (cuda)
Slide 47
CUDA > J. Rühmkorf > July 22nd 2009
48. Debugging Using the Device Emulation Mode
An executable compiled in device emulation mode (enabled via nvcc
-deviceemu) runs completely on the host using the CUDA runtime
No need of any device and CUDA driver
Each device thread is emulated with a host thread
Running in device emulation mode, one can:
Use host native debug support (breakpoints, inspection, etc.)
Access any device-specific data from host code and vice-versa
Call any host function from device code (e.g. printf) and vice-
versa
Detect deadlock situations caused by improper usage of
__syncthreads()
Slide 48
CUDA > J. Rühmkorf > July 22nd 2009
49. Device Emulation Mode Pitfalls
Emulated device threads execute sequentially, so simultaneous
accesses of the same memory location by multiple threads could
produce different results.
Dereferencing device pointers on the host or host pointers on the device
can produce correct results in device emulation mode, but will generate
an error in device execution mode
Slide 49
CUDA > J. Rühmkorf > July 22nd 2009
50. CUDA Driver API vs. C for CUDA (1)
Extended C
Integrated source
(foo.cu)
cudacc
EDG C/C++ frontend
Open64 Global Optimizer
GPU Assembly CPU Host Code
foo.s foo.cpp
OCG gcc / cl
G80 SASS Mark Murphy, “NVIDIA’s Experience
foo.sass Experience with Open64,”
www.capsl.udel.edu/conferences/open6
n64/2008/Papers/101.doc
Slide 50
CUDA > J. Rühmkorf > July 22nd 2009
51. CUDA Driver API vs. C for CUDA (2)
Mutually Exclusive: Choose One or the Other
The C runtime for CUDA handles kernel loading The driver API is a lower-level API than the
and setting kernels before they are launched. The runtime API. When compared with the runtime
implicit code initialization, CUDA context API, the driver API has these advantages:
management, CUDA module management (cubin No dependency on the runtime library
and function mapping), kernel configuration, and
parameter passing are all performed by the C More control over devices (for example,
runtime for CUDA. only the driver API enables one CPU
thread to control multiple GPUs)
It comprises two principal parts:
No C extensions in the host code, so
The low-level functions compilers other than the default CPU
(cuda_runtime_api.h) have a C-style compiler can be used
interface that does not require compilation
with nvcc. Its primary disadvantages
The high-level functions (cuda_runtime.h) Verbose code
have a C++-style interface built on top of Greater difficulty in debugging
the low-level functions. No device emulation
Of these, the high-level functions are the most A key point is that for every runtime API function,
commonly used. They wrap some of the low-level there is an equivalent driver API function. The
functions, using overloading, references, and driver API, however, includes other functions
default arguments. These wrappers can be used missing in the runtime API, such as those for
from C++ code and can be compiled with any C++ migrating a context from one host thread to
compiler. another.
Slide 51
CUDA > J. Rühmkorf > July 22nd 2009
52. CUDA Driver API vs. C for CUDA (3)
Example: Vector Addition Using C for CUDA
const unsigned int cnBlockSize = 512; // copy host vectors to device
const unsigned int cnBlocks = 3; cudaMemcpy(pDeviceMemA, pA, cnDimension *
const unsigned int cnDimension = sizeof(float),
cnBlocks * cnBlockSize; cudaMemcpyHostToDevice);
// create CUDA device & context cudaMemcpy(pDeviceMemB, pB, cnDimension *
cudaSetDevice( 0 ); // pick first device sizeof(float),
cudaMemcpyHostToDevice);
// allocate host vectors
float * pA = new float[cnDimension];
vectorAdd<<<cnBlocks, cnBlockSize>>>
float * pB = new float[cnDimension]; (pDeviceMemA, pDeviceMemB,
float * pC = new float[cnDimension]; pDeviceMemC);
// initialize host memory
randomInit(pA, cnDimension); // copy result from device to host
randomInit(pB, cnDimension); cudaMemcpy ((void *) pC, pDeviceMemC,
cnDimension * sizeof(float),
cudaMemcpyDeviceToHost);
// allocate device memory
delete[] pA;
float *pDeviceMemA, *pDeviceMemB,
*pDeviceMemC; delete[] pB;
cudaMalloc((void **)&pDeviceMemA, delete[] pC;
cnDimension * sizeof(float)); cudaFree(pDeviceMemA);
cudaMalloc((void **)&pDeviceMemB, cudaFree(pDeviceMemB);
cnDimension * sizeof(float)); cudaFree(pDeviceMemC);
cudaMalloc((void **)&pDeviceMemC,
cnDimension * sizeof(float));
Slide 52
CUDA > J. Rühmkorf > July 22nd 2009
53. CUDA Driver API vs. C for CUDA (4)
Example: Vector Addition Using CUDA Driver API
const unsigned int cnBlocks = 3; #define ALIGN_UP(offset, alignment)
const unsigned int cnDimension = cnBlocks * cnBlockSize; (offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)
CUdevice hDevice;
CUcontext hContext; int offset = 0;
CUmodule hModule; void* ptr;
CUfunction hFunction; ptr = (void*)(size_t)pDeviceMemA;
// create CUDA device & context ALIGN_UP(offset, __alignof(ptr));
cuInit(0); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
cuDeviceGet(&hContext, 0); // pick first device offset += sizeof(ptr);
cuCtxCreate(&hContext, 0, hDevice)); ptr = (void*)(size_t)pDeviceMemB;
cuModuleLoad(&hModule, “vectorAdd.cubin”); ALIGN_UP(offset, __alignof(ptr));
cuModuleGetFunction(&hFunction, hModule, "vectorAdd"); cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
// allocate host vectors offset += sizeof(ptr);
float * pA = new float[cnDimension]; ptr = (void*)(size_t)pDeviceMemC;
float * pB = new float[cnDimension]; ALIGN_UP(offset, __alignof(ptr));
float * pC = new float[cnDimension]; cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));
// initialize host memory offset += sizeof(ptr);
randomInit(pA, cnDimension); cuParamSetSize(cuFunction, offset);
randomInit(pB, cnDimension); // execute kernel
// allocate memory on the device cuLaunchGrid(cuFunction, cnBlocks, 1);
CUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC; // copy the result from device back to host
cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float)); cuMemcpyDtoH((void *) pC, pDeviceMemC,
cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float)); cnDimension * sizeof(float));
cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float)); delete[] pA;
// copy host vectors to device delete[] pB;
cuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float)); delete[] pC;
cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float)); cuMemFree(pDeviceMemA);
// set up parameter values cuMemFree(pDeviceMemB);
cuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1); cuMemFree(pDeviceMemC);
Slide 53
CUDA > J. Rühmkorf > July 22nd 2009
54. Watch Out: Floating Point Computations
Differing Results of FP Computations
Results of floating-point computations will slightly differ because of:
Different compiler outputs, instruction sets
Use of extended precision for intermediate results
There are various options to force strict single precision on
the host
Slide 54
CUDA > J. Rühmkorf > July 22nd 2009
55. Watch Out: Floating Point Computations
Single and Double Precision Operations
Double precision
No deviations from the IEEE 754 standard
Single precision
Denormals and signalling NaNs are not supported;
Only two IEEE rounding modes are supported (chop and round-to-
nearest even); and
The precision of division/square root is slightly lower than single
precision.
Slide 55
CUDA > J. Rühmkorf > July 22nd 2009
56. Limitations (1)
Only a Subset of C Available
C for CUDA offers only a subset of the C language:
Recursion-free
Function-pointer-free
Functions reside in the global device memory, therefore we cannot
obtain their addresses
Slide 56
CUDA > J. Rühmkorf > July 22nd 2009
57. Limitations (2)
Branching in Programm Code
For best performance
Threads should be running in groups 32 threads
32 threads = 1 warp
All threads of a warp should take the same execution path
Otherwise, branching will probably hurt
Slide 57
CUDA > J. Rühmkorf > July 22nd 2009
58. Coalesced Access to Device Global Memory
High Priority: Ensure global memory accesses are coalesced whenever
possible
Global memory loads and stores by threads of a half warp (16 threads)
are coalesced by the device in as few as one transaction (or two
transactions in the case of 128-bit words)
But: certain access requirements have to be met
Slide 58
CUDA > J. Rühmkorf > July 22nd 2009
61. Shared Memory – Bank Conflicts
Shared Memory
16 KB
Organized in 16 Banks, 1 KB each
Shared Memory
As fast as a register …
… if no bank conflicts occur!
Bank conflict:
More than one thread in the same half-warp access the same bank
Access needs to be serialized
Cost = max (# of concurrent access)
Slide 61
CUDA > J. Rühmkorf > July 22nd 2009
62. Shared Memory – No Bank Conflicts
Linear addressing Random Linear addressing Broadcast
Step size = 1 Permutation Step size = 3
Word Words
Slide 62
CUDA > J. Rühmkorf > July 22nd 2009
63. Shared Memory – Bank Conflicts
Linear adressing Linear addressing No conflict
Step size = 2 Step size = 8 or 5-way
Words words conflict
Slide 63
CUDA > J. Rühmkorf > July 22nd 2009