Course Outline
ƒ GPU Architecture Overview (this module)
ƒ Tools of the Trade
ƒ Introduction to CUDA C
ƒ Patterns of Paral...
History of GPU Computation
No GPU
CPU handled graphic output
Dedicated GPU
Separate graphics card (PCI, AGP)
Programmable ...
Shaders
ƒ Small program that runs on the GPU
ƒ Types
† Vertex (used to calculate location of a vertex)
† Pixel (used to ca...
Why GPGPU?
ƒ General-Purpose Computation on GPUs
ƒ Highly parallel architecture
† Lots of concurrent threads of execution ...
GPGPU Frameworks
ƒ Compute Unified Driver Architecture (CUDA)
† Developed by NVIDIA Corporation
† Extensions to programmin...
Graphics Processor Architecture
ƒ Warning: NVIDIA terminology ahead
ƒ Streaming Multiprocessor (SM)
† Contains several CUD...
Graphics Processor Architecture
SM2
SM1
SP1 SP2 SP3
Device Memory
Registers
Shared Memory
SP4
Constant Memory
Texture Memo...
Compute Capability
ƒ A number indicating what the card can do
† Current range: 1.0, 1.x, 2.x, 3.0, 3.5
ƒ Affects both hard...
Choosing a Graphics Card
ƒ Look at performance specs (peak flops)
† Pay attention to single vs. double
† E.g. http://www.n...
Overview
ƒ Windows, Mac OS, Linux
† This course uses Windows/Visual Studio
† This course focuses on desktop GPU developmen...
CUDA Tools
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
What is NSight?
ƒ Many developers, few GPUs
† No need for each dev to have GPU
ƒ Client-server architecture
ƒ Server with ...
Overview
ƒ Compilation Process
ƒ Obligatory “Hello Cuda” demo
ƒ Location Qualifiers
ƒ Execution Model
ƒ Grid and Block Dim...
NVidia Cuda Compiler (nvcc)
ƒ nvcc is used to compile CUDA-specific code
† Not a compiler!
† Uses host C or C++ compiler (...
NVCC Compilation Process
__global__ void a()
{
}
void b()
{
}
__global__ void a()
{
}
nvcc
void b()
{
}
Executable
Host co...
Parallel Thread Execution (PTX)
ƒ PTX is the ‘assembly language’ of CUDA
† Similar to .NET IL or Java bytecode
† Low-level...
Location Qualifiers
ƒ __global__
Defines a kernel.
Runs on the GPU, called from the CPU.
Executed with <<<dim3>>> argument...
Execution Model
doSomething<<<2,3>>>()
doSomethingElse<<<4,2>>>()
thread
thread block
grid
C
U
D
A
Teaching
C
enter,D
epar...
Execution Model
ƒ Thread blocks are scheduled to run on available SMs
ƒ Each SM executes one block at a time
ƒ Thread bloc...
Dimensions
ƒ We defined execution as
† <<<a,b>>>
† A grid of a blocks of b threads each
† The grid and each block are 1D s...
Thread Variables
ƒ Execution parameters & current position
ƒ blockIdx
† Where we are in the grid
ƒ gridDim
† The size of t...
Error Handling
ƒ CUDA does not throw
† Silent failure
ƒ Core functions return cudaError_t
† Can check against cudaSuccess
...
Rules of the Game
ƒ Different types of memory
† Shared vs. private
† Access speeds
ƒ Data is in arrays
† No parallel data ...
Overview
ƒ Data Access
ƒ Map
ƒ Gather
ƒ Reduce
ƒ Scan
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,T...
Data Access
ƒ A real problem!
ƒ Thread space can be up to 6D
† 3D grid of 3D thread blocks
ƒ Input space typically 1D
† 2D...
Map
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Gather
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Black Scholes Formula
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Reduce
+
+
+
+ +
+
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Reduce in Practice
ƒ Adding up N data elements ƒ Adding up N data elements
ƒ Use 1 block of N/2 threads
ƒ Each thread does...
Scan
4 2 5 3 6
4 6 11 14 20
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Scan in Practice
ƒ Similar to reduce
1 2 3 4
1 3 5 73
0
5
1
1 3 6 106
0
1 3 6 10
0
7
2
10
1
ƒ Require N-1 threads
ƒ Step s...
Graphics Processor Architecture
SM2
SM1
SP1 SP2 SP3
Device Memory
Registers
Shared Memory
SP4
Constant Cache
Texture Cache...
Device Memory
ƒ Grid scope (i.e., available to all threads in all blocks in the grid)
ƒ Application lifetime (exists until...
Constant & Texture Memory
ƒ Read-only: useful for lookup tables, model parameters, etc.
ƒ Grid scope, Application lifetime...
Shared Memory
ƒ Block scope
† Shared only within a thread block
† Not shared between blocks
ƒ Kernel lifetime
ƒ Must be de...
Register & Local Memory
ƒ Memory can be allocated right within the kernel
† Thread scope, Kernel lifetime
ƒ Non-array memo...
Summary
Declaration Memory Scope Lifetime Slowdown
int foo; Register Thread Kernel 1x
int foo[10]; Local Thread Kernel 100...
Thread Synchronization
ƒ Threads can take different amounts
of time to complete a part of a
computation
ƒ Sometimes, you w...
Restrictions
ƒ __syncthreads() only synchronizes threads within a block
ƒ A thread that calls __syncthreads() waits for ot...
Branch Divergence
ƒ Once a block is assigned to an
SM, it is split into several
warps
ƒ All threads within a warp
must exe...
Summary
ƒ Threads are synchronized with __syncthreads()
† Block scope
† All threads must reach the barrier
† Each __syncth...
Overview
ƒ Why atomics?
ƒ Atomic Functions
ƒ Atomic Sum
ƒ Monte Carlo Pi
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puter...
Why Atomics?
ƒ x++ is a read-modify-write operation
† Read x into a register
† Increment register value
† Write register b...
Atomic Functions
ƒ Problem: many threads accessing the same memory location
ƒ Atomic operations ensure that only one threa...
Monte Carlo Pi
r
C
U
D
A
Teaching
C
enter,D
epartm
entofC
om
puterApplications,N
IT,Trichy
Summary
ƒ Atomic operations ensure operations on a variable cannot be
interrupted by a different thread
ƒ CUDA supports se...
Overview
ƒ Events
ƒ Event API
ƒ Event example
ƒ Pinned memory
ƒ Streams
ƒ Stream API
ƒ Example (single stream)
ƒ Example (...
Events
ƒ How to measure performance?
ƒ Use OS timers
† Too much noise
ƒ Use profiler
† Times only kernel duration + other ...
Event API
ƒ cudaEvent_t
† The event handle
ƒ cudaEventCreate(&e)
† Creates the event
ƒ cudaEventRecord(e, 0)
† Records the...
Pinned Memory
ƒ CPU memory is pageable
† Can be swapped to disk
ƒ Pinned (page-locked) stays in place
ƒ Performance advant...
Streams
ƒ Remember cudaEventRecord(event, stream)?
ƒ A CUDA stream is a queue of GPU operations
† Kernel launch
† Memory c...
Stream API
ƒ cudaStream_t
ƒ cudaStreamCreate(&stream)
ƒ kernel<<<blocks,threads,shared,stream>>>
ƒ cudaMemcpyAsync()
† Mus...
Summary
ƒ CUDA events let you time your code on the GPU
ƒ Pinned memory speeds up data transfers to/from device
ƒ CUDA str...
Upcoming SlideShare
Loading in …5
×

Cuda materials

507 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
507
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Cuda materials

  1. 1. Course Outline ƒ GPU Architecture Overview (this module) ƒ Tools of the Trade ƒ Introduction to CUDA C ƒ Patterns of Parallel Computing ƒ Thread Cooperation and Synchronization ƒ The Many Types of Memory ƒ Atomic Operations ƒ Events and Streams ƒ CUDA in Advanced Scenarios Note: to follow this course, you require a CUDA-capable GPU.urse, you require a CUDA cap C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  2. 2. History of GPU Computation No GPU CPU handled graphic output Dedicated GPU Separate graphics card (PCI, AGP) Programmable GPU Shaders C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  3. 3. Shaders ƒ Small program that runs on the GPU ƒ Types † Vertex (used to calculate location of a vertex) † Pixel (used to calculate the color components of a single pixel) ƒ Shader languages † High Level Shader Language (HLSL, Microsoft DirectX) † OpenGL Shading Language (GLSL, OpenGL) † Both are C-like † Both are intended for graphics (i.e., not general-purpose) ƒ Pixel shaders used for math † Convert data to texture † Run texture through pixel shader † Get result texture and convert to data C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  4. 4. Why GPGPU? ƒ General-Purpose Computation on GPUs ƒ Highly parallel architecture † Lots of concurrent threads of execution (SIMT) † Higher throughput compared to CPUs † Even taking into account many cores, hypethreading, SIMD † Thus more FLOPS (floating-point operations per second) ƒ Commodity hardware † Commonly available (mainly used by gamers) † Relatively cheap compared to custom solutions (e.g., FPGAs) ƒ Sensible programming model † Manufacturers realized GPGPU market potential † Graphics made optional † NVIDIA offers dedicated GPU platform “Tesla” † No output connections C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  5. 5. GPGPU Frameworks ƒ Compute Unified Driver Architecture (CUDA) † Developed by NVIDIA Corporation † Extensions to programming languages (C/C++) † Wrappers for other languages/platforms (e.g., FORTRAN, PyCUDA, MATLAB) ƒ Open Computing Language (OpenCL) † Supported by many manufacturers (inc. NVIDIA) † The high-level way to perform computation on ATI devices ƒ C++ Accelerated Massive Programming (AMP) † C++ superset † A standard by Microsoft, part of MSVC++ † Supports both ATI and NVIDIA ƒ Other frameworks and cross-compilers † Alea, Aparapi, Brook, etc. C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  6. 6. Graphics Processor Architecture ƒ Warning: NVIDIA terminology ahead ƒ Streaming Multiprocessor (SM) † Contains several CUDA cores † Can have >1 SM on card ƒ CUDA Core (a.k.a. Streaming Processor, Shader Unit) † # of cores per SM tied to compute capability ƒ Different types of memory † Means of access † Performance characteristics C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  7. 7. Graphics Processor Architecture SM2 SM1 SP1 SP2 SP3 Device Memory Registers Shared Memory SP4 Constant Memory Texture Memory C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  8. 8. Compute Capability ƒ A number indicating what the card can do † Current range: 1.0, 1.x, 2.x, 3.0, 3.5 ƒ Affects both hardware and API support † Number of CUDA cores per SM † Max # of 32-bit registers per SM † Max # of instructions per kernel † Support for double-precision ops † … and many other parameters ƒ Higher is better † See http://en.wikipedia.org/wiki/CUDA C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  9. 9. Choosing a Graphics Card ƒ Look at performance specs (peak flops) † Pay attention to single vs. double † E.g. http://www.nvidia.com/object/tesla-servers.html ƒ Number of GPUs ƒ Compute capability/architecture † COTS cheaper than dedicated ƒ Memory size ƒ Ensure PSU/motherboard is good enough to handle the card(s) ƒ Can have >1 graphics card † YMMV (PCI saturation) ƒ Can mix architectures (e.g. NVIDIA+ATI) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  10. 10. Overview ƒ Windows, Mac OS, Linux † This course uses Windows/Visual Studio † This course focuses on desktop GPU development ƒ CUDA Toolkit † LLVM-based compiler † Headers & libraries † Documentation † Samples ƒ NSight † Visual Studio: plugin that allows debugging † Eclipse IDE ƒ http://developer.nvidia.com/cuda C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  11. 11. CUDA Tools C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  12. 12. What is NSight? ƒ Many developers, few GPUs † No need for each dev to have GPU ƒ Client-server architecture ƒ Server with GPU (target) † Runs Nsight service † Accepts connections from developers ƒ Clients with Visual Studio + NSight plugin (host) † Use NSight to connect to the server to run application and debug GPU code ƒ Caveats † Need to use VS Remote Debugging to debug CPU code † Need your own mechanism for syncing output Target GPU(s) NSight Service Visual Studio w/NSight C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  13. 13. Overview ƒ Compilation Process ƒ Obligatory “Hello Cuda” demo ƒ Location Qualifiers ƒ Execution Model ƒ Grid and Block Dimensions ƒ Error Handling ƒ Device Introspection C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  14. 14. NVidia Cuda Compiler (nvcc) ƒ nvcc is used to compile CUDA-specific code † Not a compiler! † Uses host C or C++ compiler (MSVC, GCC) † Some aspects are C++ specific ƒ Splits code into GPU and non-GPU parts † Host code passed to native compiler ƒ Accepts project-defined GPU-specific settings † E.g., compute capability ƒ Translates code written in CUDA C into PTX † Graphics driver turns PTX into binary code C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  15. 15. NVCC Compilation Process __global__ void a() { } void b() { } __global__ void a() { } nvcc void b() { } Executable Host compiler (e.g. MSVC’s cl.exe) nvcc C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  16. 16. Parallel Thread Execution (PTX) ƒ PTX is the ‘assembly language’ of CUDA † Similar to .NET IL or Java bytecode † Low-level GPU instructions ƒ Can be generated from a project ƒ Typically useful to compiler writers † E.g., GPU Ocelot https://code.google.com/p/gpuocelot/ ƒ Inline PTX (asm) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  17. 17. Location Qualifiers ƒ __global__ Defines a kernel. Runs on the GPU, called from the CPU. Executed with <<<dim3>>> arguments. ƒ __device__ Runs on the GPU, called from the GPU. † Can be used for variables too ƒ __host__ Runs on the CPU, called from the CPU. ƒ Qualifiers can be mixed † E.g. __host__ __device__ foo() † Code compiled for both CPU and GPU † Useful for testing C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  18. 18. Execution Model doSomething<<<2,3>>>() doSomethingElse<<<4,2>>>() thread thread block grid C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  19. 19. Execution Model ƒ Thread blocks are scheduled to run on available SMs ƒ Each SM executes one block at a time ƒ Thread block is divided into warps † Number of threads per warp depends on compute capability ƒ All warps are handled in parallel ƒ CUDA Warp Watch C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  20. 20. Dimensions ƒ We defined execution as † <<<a,b>>> † A grid of a blocks of b threads each † The grid and each block are 1D structures ƒ In reality, these constructs are 3D † A 3-dimensional grid of 3-dimensional blocks † You can define (a×b×c) blocks of (x×y×z) threads † Can have 2D or 1D by setting extra dimensions to 1 ƒ Defined in dim3 structure † Simple container with x, y and z values. † Some constructors defined for C++ † Automatic conversion for <<<a,b>>> Æ (a,1,1) by (b,1,1) grid block thread C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  21. 21. Thread Variables ƒ Execution parameters & current position ƒ blockIdx † Where we are in the grid ƒ gridDim † The size of the grid ƒ threadIdx † Position of current thread in thread block ƒ blockDim † Size of thread block ƒ Limitations † Grid & block sizes † # of threads C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  22. 22. Error Handling ƒ CUDA does not throw † Silent failure ƒ Core functions return cudaError_t † Can check against cudaSuccess † Get description with cudaGetErrorString() ƒ Libraries may have different error types † E.g. cuRAND has curandStatus_t C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  23. 23. Rules of the Game ƒ Different types of memory † Shared vs. private † Access speeds ƒ Data is in arrays † No parallel data structures † No other data structures † No auto-parallelization/vectorization compiler support † No CPU-type SIMD equivalent ƒ Compiler constraint † No C++11 support C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  24. 24. Overview ƒ Data Access ƒ Map ƒ Gather ƒ Reduce ƒ Scan C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  25. 25. Data Access ƒ A real problem! ƒ Thread space can be up to 6D † 3D grid of 3D thread blocks ƒ Input space typically 1D † 2D arrays are possible ƒ Need to map threads to inputs ƒ Some examples † 1 block, N threads Æ threadIdx.x † 1 block, MxN threads Æ threadIdx.y * blockDim.x + threadIdx.x † N blocks, M threads Æ blockIdx.x * gridDim.x + threadIdx.x † … and so on C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  26. 26. Map C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  27. 27. Gather C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  28. 28. Black Scholes Formula C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  29. 29. Reduce + + + + + + C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  30. 30. Reduce in Practice ƒ Adding up N data elements ƒ Adding up N data elements ƒ Use 1 block of N/2 threads ƒ Each thread does x[i] += x[j]; ƒ At each step † # of threads halved † Distance (j-i) doubled ƒ x[0] is the result 1 2 3 4 5 6 7 8 3 7 11 153 0 7 1 11 2 155 3 10 260100 0 62626 1 3663636 0 C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  31. 31. Scan 4 2 5 3 6 4 6 11 14 20 C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  32. 32. Scan in Practice ƒ Similar to reduce 1 2 3 4 1 3 5 73 0 5 1 1 3 6 106 0 1 3 6 10 0 7 2 10 1 ƒ Require N-1 threads ƒ Step size keeps doubling ƒ Number of threads reduced by step size ƒ Each thread n does x[n+step] += x[n]; 5 3 9 2 000000 14 000 15 C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  33. 33. Graphics Processor Architecture SM2 SM1 SP1 SP2 SP3 Device Memory Registers Shared Memory SP4 Constant Cache Texture CacheT C C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  34. 34. Device Memory ƒ Grid scope (i.e., available to all threads in all blocks in the grid) ƒ Application lifetime (exists until app exits or explicitly deallocated) ƒ Dynamic † cudaMalloc() to allocate † Pass pointer to kernel † cudaMemcpy() to copy to/from host memory † cudaFree() to deallocate ƒ Static † Declare global variable as device __device__ int sum = 0; † Use freely within the kernel † Use cudaMemcpy[To/From]Symbol() to copy to/from host memory † No need to explicitly deallocate ƒ Slowest and most inefficient C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  35. 35. Constant & Texture Memory ƒ Read-only: useful for lookup tables, model parameters, etc. ƒ Grid scope, Application lifetime ƒ Resides in device memory, but ƒ Cached in a constant memory cache ƒ Constrained by MAX_CONSTANT_MEMORY † Expect 64kb ƒ Similar operation to statically-defined device memory † Declare as __constant__ † Use freely within the kernel † Use cudaMemcpy[To/From]Symbol() to copy to/from host memory ƒ Very fast provided all threads read from the same location ƒ Used for kernel arguments ƒ Texture memory: similar to Constant, optimized for 2D access patterns C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  36. 36. Shared Memory ƒ Block scope † Shared only within a thread block † Not shared between blocks ƒ Kernel lifetime ƒ Must be declared within the kernel function body ƒ Very fast C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  37. 37. Register & Local Memory ƒ Memory can be allocated right within the kernel † Thread scope, Kernel lifetime ƒ Non-array memory † int tid = … † Stored in a register † Very fast ƒ Array memory † Stored in ‘local memory’ † Local memory is an abstraction, actually put in global memory † Thus, as slow as global memory C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  38. 38. Summary Declaration Memory Scope Lifetime Slowdown int foo; Register Thread Kernel 1x int foo[10]; Local Thread Kernel 100x __shared__ int foo; Shared Block Kernel 1x __device__ int foo; Global Grid Application 100x __constant__ int foo; Constant Grid Application 1x C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  39. 39. Thread Synchronization ƒ Threads can take different amounts of time to complete a part of a computation ƒ Sometimes, you want all threads to reach a particular point before continuing their work ƒ CUDA provides a thread barrier function __syncthreads() ƒ A thread that calls __syncthreads() waits for other threads to reach this line ƒ Then, all threads can continue executing barrier A B C C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  40. 40. Restrictions ƒ __syncthreads() only synchronizes threads within a block ƒ A thread that calls __syncthreads() waits for other threads to reach this location ƒ All threads must reach __syncthreads() † if (x > 0.5) { __syncthreads(); // bad idea } ƒ Each call to __syncthreads() is unique † if (x > 0.5) { __syncthreads(); } else { __syncthreads(); // also a bad idea } C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  41. 41. Branch Divergence ƒ Once a block is assigned to an SM, it is split into several warps ƒ All threads within a warp must execute the same instruction at the same time ƒ I.e., if and else branches cannot be executed concurrently ƒ Avoid branching if possible ƒ Ensure if statements cut on warp boundaries branch performance loss C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  42. 42. Summary ƒ Threads are synchronized with __syncthreads() † Block scope † All threads must reach the barrier † Each __syncthreads() creates a unique barrier ƒ A block is split into warps † All threads in a warp execute same instruction † Branching leads to warp divergence (performance loss) † Avoid branching or ensure branch cases fall in different warps C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  43. 43. Overview ƒ Why atomics? ƒ Atomic Functions ƒ Atomic Sum ƒ Monte Carlo Pi C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  44. 44. Why Atomics? ƒ x++ is a read-modify-write operation † Read x into a register † Increment register value † Write register back into x † Effectively { temp = x; temp = temp+1; x = temp; } ƒ If two threads do x++ † Each thread has its own temp (say t1 and t2) † { t1 = x; t1 = t1+1; x = t1;} { t2 = x; t2 = t2+1; x = t2;} † Race condition: the thread that writes to x first wins † Whoever wins, x gets incremented only once C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  45. 45. Atomic Functions ƒ Problem: many threads accessing the same memory location ƒ Atomic operations ensure that only one thread can access the location ƒ Grid scope! ƒ atomicOP(x,y) † t1 = *x; // read † t2 = t1 OP y; // modify † *a = t2; // write ƒ Atomics need to be configured † #include "sm_20_atomic_functions.h“ † C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  46. 46. Monte Carlo Pi r C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  47. 47. Summary ƒ Atomic operations ensure operations on a variable cannot be interrupted by a different thread ƒ CUDA supports several atomic operations † atomicAdd() † atomicOr() † atomicMin() † … and others ƒ Atomics incur a heavy performance penalty C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  48. 48. Overview ƒ Events ƒ Event API ƒ Event example ƒ Pinned memory ƒ Streams ƒ Stream API ƒ Example (single stream) ƒ Example (multiple streams) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  49. 49. Events ƒ How to measure performance? ƒ Use OS timers † Too much noise ƒ Use profiler † Times only kernel duration + other invocations ƒ CUDA Events † Event = timestamp † Timestamp recorded on the GPU † Invoked from the CPU side C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  50. 50. Event API ƒ cudaEvent_t † The event handle ƒ cudaEventCreate(&e) † Creates the event ƒ cudaEventRecord(e, 0) † Records the event, i.e. timestamp † Second param is the stream to which to record ƒ cudaEventSynchronize(e) † CPU and GPU are async, can be doing things in parallel † cudaEventSynchronize() blocks all instruction processing until the GPU has reached the event ƒ cudaEventElapsedTime(&f, start, stop) † Computes elapsed time (msec) between start and stop, stored as float C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  51. 51. Pinned Memory ƒ CPU memory is pageable † Can be swapped to disk ƒ Pinned (page-locked) stays in place ƒ Performance advantage when copying to/from GPU ƒ Use cudaHostAlloc() instead of malloc() or new ƒ Use cudaFreeHost() to deallocate ƒ Cannot be swapped out † Must have enough † Proactively deallocate C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  52. 52. Streams ƒ Remember cudaEventRecord(event, stream)? ƒ A CUDA stream is a queue of GPU operations † Kernel launch † Memory copy ƒ Streams allow a form of task-based parallelism † Performance improvement ƒ To leverage streams you need device overlap support † GPU_OVERLAP C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  53. 53. Stream API ƒ cudaStream_t ƒ cudaStreamCreate(&stream) ƒ kernel<<<blocks,threads,shared,stream>>> ƒ cudaMemcpyAsync() † Must use pinned memory! ƒ stream parameter ƒ cudaStreamSynchronize(stream) C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy
  54. 54. Summary ƒ CUDA events let you time your code on the GPU ƒ Pinned memory speeds up data transfers to/from device ƒ CUDA streams allow you to queue up operations asynchronously † Lets you do different things in parallel on the GPU † Use of pinned memory is required C U D A Teaching C enter,D epartm entofC om puterApplications,N IT,Trichy

×