SlideShare a Scribd company logo
1 of 54
Introduction to CUDA Programming
CUDA Programming Introduction
Andreas Moshovos
Winter 2009
Some slides/material from:
UIUC course by Wen-Mei Hwu and David Kirk
UCSB course by Andrea Di Blas
Universitat Jena by Waqar Saleem
NVIDIA by Simon Green
Programmer’s view
• GPU as a co-processor
CPU
Memory
GPU
GPU Memory
1GB on our systems
3GB/s – 8GB.s
6.4GB/sec – 31.92GB/sec
8B per transfer
141GB/sec
Target Applications
int a[N]; // N is large
for all elements of a compute
a[i] = a[i] * fade
• Lots of independent computations
– CUDA thread need not be independent
Programmer’s View of the GPU
• GPU: a compute device that:
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
• Data-parallel portions of an application are
executed on the device as kernels which run in
parallel on many threads
Why are threads useful?
• Concurrency:
– Do multiple things in parallel
– Put more hardware  Get higher performance
Needs more functional units
Why are threads useful #2 – Tolerating stalls
• Often a thread stalls, e.g., memory access
Multiplex the same functional unit
Get more performance at a fraction of the cost
GPU vs. CPU Threads
• GPU threads are extremely lightweight
• Very little creation overhead
• In the order of microseconds
• GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
Execution Timeline
time
1. Copy to GPU mem
2. Launch GPU Kernel
GPU / Device
2’. Synchronize with GPU
3. Copy from GPU mem
CPU / Host
GBT: Grids of Blocks of Threads
Why? Realities of integrated circuits: need to cluster computation and storage
to achieve high speeds
Grids of Blocks of Threads: Dimension Limits
• Grid of Blocks 1D or 2D
– Max x: 65535
– Max y: 65535
• Block of Threads: 1D, 2D, or 3D
– Max number of threads: 512
– Max x: 512
– Max y: 512
– Max z: 64
• Limits apply to Compute Capability 1.0, 1.1, 1.2, and 1.3
– GTX280 = 1.3
Block and Thread IDs
• Threads and blocks have IDs
– So each thread can decide
what data to work on
– Block ID: 1D or 2D
– Thread ID: 1D, 2D, or 3D
• Simplifies memory
addressing when processing
multidimensional data
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
• IDs and dimensions are easily accessible through
predefined “variables”, e.g., blockDim.x and threadIdx.x
Thread Batching
• A kernel is executed as a
grid of thread blocks
– All threads share data
memory space
• A thread block: threads
that can cooperate with
each other by:
– Synchronizing their
execution
• For hazard-free shared
memory accesses
– Efficiently sharing data
through a low latency
shared memory
• Two threads from two
different blocks cannot
cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread Coordination Overview
• Race-free access to data
Programmer’s view: Memory Model
Programmer’s View: Memory Detail – Thread and Host
• Each thread can:
– R/W per-thread registers
– R/W per-thread local memory
– R/W per-block shared memory
– R/W per-grid global memory
– Read only per-grid constant memory
– Read only per-grid texture memory
• The host can R/W:
• global, constant, and texture memories
Memory Model: Global, Constant, and Texture Memories
• Global memory
– Main means of communicating R/W Data between
host and device
– Contents visible to all threads
• Texture and Constant Memories
– Constants initialized by host
– Contents visible to all threads
Memory Model Summary
Memory Location Cached Access Scope
Local off-chip No R/W thread
Shared on-chip N/A R/W all threads in a block
Global off-chip No R/W all threads + host
Constant off-chip Yes RO all threads + host
Texture off-chip Yes RO all threads + host
Execution Model: Ordering
• Execution order is undefined
• Do not assume and use:
• block 0 executes before block 1
• Thread 10 executes before thread 20
• And any other ordering even if you can observe it
Execution Model Summary
• Grid of blocks of threads
– 1D/2D grid of blocks
– 1D/2D/3D blocks of threads
• All blocks are identical:
– same structure and # of threads
• Block execution order is undefined
• Same block threads:
– can synchronize and share data fast (shared memory)
• Threads from different blocks:
– Cannot cooperate
– Communication through global memory
• Threads and Blocks have IDs
– Simplifies data indexing
– Can be 1D, 2D, or 3D (threads)
• Blocks do not migrate: execute on the same processor
• Several blocks may run over the same processor
CUDA Software Architecture
cuda…()
cu…()
e.g., fft()
Reasoning about CUDA call ordering
• GPU communication via cuda…() calls and
kernel invocations
– cudaMalloc, cudaMemCpy,
• Asynchronous from the CPU’s perspective
– CPU places a request in a “CUDA” queue
– requests are handled in-order
• Streams allow for multiple queues
– More on this much later one
CUDA API: Example
int a[N];
for (i =0; i < N; i++)
a[i] = a[i] + x;
1. Allocate CPU Data Structure
2. Initialize Data on CPU
3. Allocate GPU Data Structure
4. Copy Data from CPU to GPU
5. Define Execution Configuration
6. Run Kernel
7. CPU synchronizes with GPU
8. Copy Data from GPU to CPU
9. De-allocate GPU and CPU memory
1. Allocate CPU Data
float *ha;
main (int argc, char *argv[])
{
int N = atoi (argv[1]);
ha = (float *) malloc (sizeof (float) * N);
...
}
• Pinned memory allocation results in faster CPU to/from GPU
copies
• More on this later
• cudaMallocHost (…)
2. Initialize CPU Data
float *ha;
int i;
for (i = 0; i < N; i++)
ha[i] = i;
3. Allocate GPU Data
float *da;
cudaMalloc ((void **) &da, sizeof (float) * N);
• Notice: no assignment side
– NOT: da = cudaMalloc (…)
• Assignment is done internally:
– That why we pass &da
• Allocated space in Global Memory
GPU Memory Allocation
• The host manages GPU memory allocation:
– cudaMalloc (void **ptr, size_t nbytes)
– Must explicitly cast to (void **)
• cudaMalloc ((void **) &da, sizeof (float) * N);
– cudaFree (void *ptr);
• cudaFree (da);
– cudaMemset (void *ptr, int value, size_t
nbytes);
– cudaMemset (da, 0, N * sizeof (int));
• Check the CUDA Reference Manual
4. Copy Initialized CPU data to GPU
float *da;
float *ha;
cudaMemCpy ((void *) da, // DESTINATION
(void *) ha, // SOURCE
sizeof (float) * N, // #bytes
cudaMemcpyHostToDevice); // DIRECTION
Host/Device Data Transfers
• The host initiates all transfers:
• cudaMemcpy( void *dst, void *src,
size_t nbytes,
enum cudaMemcpyKind direction)
• Asynchronous from the CPU’s perspective
– CPU thread continues
• In-order processing with other CUDA requests
• enum cudaMemcpyKind
– cudaMemcpyHostToDevice
– cudaMemcpyDeviceToHost
– cudaMemcpyDeviceToDevice
5. Define Execution Configuration
• How many blocks and threads/block
int threads_block = 64;
int blocks = N / threads_block;
if (blocks % N != 0) blocks += 1;
• Alternatively:
blocks = (N + threads_block – 1) /
threads_block;
6. Launch Kernel & 7. CPU/GPU Synchronization
• Instructs the GPU to launch blocks x thread_blocks
threads:
darradd <<<blocks, threads_block>> (da, 10f, N);
cudaThreadSynchronize ();
• darradd: kernel name
• <<<…>>> execution configuration
– More on this soon
• (da, x, N): arguments
– 256 – 8 byte limit / No variable arguments
CPU/GPU Synchronization
• CPU does not block on cuda…() calls
– Kernel/requests are queued and processed in-order
– Control returns to CPU immediately
• Good if there is other work to be done
– e.g., preparing for the next kernel invocation
• Eventually, CPU must know when GPU is done
• Then it can safely copy the GPU results
• cudaThreadSynchronize ()
– Block CPU until all preceding cuda…() and kernel requests
have completed
8. Copy data from GPU to CPU & 9. DeAllocate Memory
float *da;
float *ha;
cudaMemCpy ((void *) ha, // DESTINATION
(void *) da, // SOURCE
sizeof (float) * N, // #bytes
cudaMemcpyDeviceToHost); // DIRECTION
cudaFree (da);
// display or process results here
free (ha);
The GPU Kernel
__global__ darradd (float *da, float x, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) da[i] = da[i] + x;
}
• BlockIdx: Unique Block ID.
– Numerically asceding: 0, 1, …
• ThreadIdx: Unique per Block Index
– 0, 1, …
– Per Block
• BlockDim: Dimensions of Block
– BlockDim.x, BlockDim.y, BlockDim.z
– Unused dimensions default to 0
Array Index Calculation Example
int i = blockIdx.x * blockDim.x + threadIdx.x;
a[0] a[63] a[64] a[127]a[128] a[255]a[256]
blockIdx.x 0 blockIdx.x 1 blockIdx.x 2
i = 0 i = 63 i = 64 i = 127 i = 128 i = 255 i = 256
Assuming blockDim.x = 64
Generic Unique Thread and Block Index Calculations #1
• 1D Grid / 1D Blocks:
UniqueBlockIndex = blockIdx.x;
UniqueThreadIndex = blockIdx.x * blockDim.x +
threadIdx.x;
• 1D Grid / 2D Blocks:
UniqueBlockIndex = blockIdx.x;
UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y
+ threadIdx.y * blockDim.x + threadIdx.x;
• 1D Grid / 3D Blocks:
UniqueBockIndex = blockIdx.x;
UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y
* blockDim.z + threadIdx.z * blockDim.y * blockDim.x +
threadIdx.y * blockDim.x + threadIdx.x;
• Source: http://forums.nvidia.com/lofiversion/index.php?t82040.html
Generic Unique Thread and Block Index Calculations #2
• 2D Grid / 1D Blocks:
UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
UniqueThreadIndex = UniqueBlockIndex * blockDim.x +
threadIdx.x;
• 2D Grid / 2D Blocks:
UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
UniqueThreadIndex =UniqueBlockIndex * blockDim.y *
blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;
• 2D Grid / 3D Blocks:
UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
UniqueThreadIndex = UniqueBlockIndex * blockDim.z *
blockDim.y * blockDim.x + threadIdx.z * blockDim.y *
blockDim.z + threadIdx.y * blockDim.x + threadIdx.x;
• UniqueThreadIndex means unique per grid.
CUDA Function Declarations
• __global__ defines a kernel function
– Must return void
– Can only call __device__ functions
• __device__ and __host__ can be used together
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc() device device
__global__ void KernelFunc() device host
__host__ float HostFunc() host host
__device__ Example
• Add x to a[i] multiple times
__device__ float addmany (float a, float b, int count)
{
while (count--) a += b;
return a;
}
__global__ darradd (float *da, float x, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) da[i] = addmany (da[i], x, 10);
}
Kernel and Device Function Restrictions
• __device__ functions cannot have their address taken
– e.g., f = &addmany; *f(…);
• For functions executed on the device:
– No recursion
• darradd (…)
{
darradd (…)
}
– No static variable declarations inside the function
• darradd (…)
{
static int canthavethis;
}
– No variable number of arguments
• e.g., something like printf (…)
Predefined Vector Datatypes
• Can be used both in host and in device code.
– [u]char[1..4], [u]short[1..4],
[u]int[1..4], [u]long[1..4],
float[1..4]
• Structures accessed with .x, .y, .z, .w
fields
• default constructors, “make_TYPE (…)”:
– float4 f4 = make_float4 (1f, 10f,
1.2f, 0.5f);
• dim3
– type built on uint3
– Used to specify dimensions
– Default value is (1, 1, 1)
Execution Configuration
• Must specify when calling a __global__
function:
<<< Dg, Db [, Ns [, S]] >>>
• where:
– dim3 Dg: grid dimensions in blocks
– dim3 Db: block dimensions in threads
– size_t Ns: per block additional number of shared
memory bytes to allocate
• optional, defaults to 0
• more on this much later on
– cudaStream_t S: request stream(queue)
• optional, default to 0.
• Compute capability >= 1.1
Built-in Variables
• dim3 gridDim
– Number of blocks per grid, in 2D (.z always 1)
• uint3 blockIdx
– Block ID, in 2D (blockIdx.z = 1 always)
• dim3 blockDim
– Number of threads per block, in 3D
• uint3 threadIdx
– Thread ID in block, in 3D
Execution Configuration Examples
• 1D grid / 1D blocks
dim3 gd(1024)
dim3 bd(64)
akernel<<<gd, bd>>>(...)
gridDim.x = 1024, gridDim.y = 1,
blockDim.x = 64, blockDim.y = 1,
blockDim.z = 1
• 2D grid / 3D blocks
dim3 gd(4, 128)
dim3 bd(64, 16, 4)
akernel<<<gd, bd>>>(...)
gridDim.x = 4, gridDim.y = 128,
blockDim.x = 64, blockDim.y = 16,
blockDim.z = 4
Error Handling
• Most cuda…() functions return a cudaError_t
– If cudaSuccess: Request completed without a problem
• cudaGetLastError():
– returns the last error to the CPU
– Use with cudaThreadSynchronize():
cudaError_t code;
cudaThreadSynchronize ();
code = cudaGetLastError ();
• char *cudaGetErrorString(cudaError_t code);
– returns a human-readable description of the error code
Error Handling Utility Function
void
cudaDie (const char *msg)
{
cudaError_t err;
cudaThreadSynchronize ();
err = cudaGetLastError();
if (err == cudaSuccess) return;
fprintf (stderr, "CUDA error: %s: %s.n",
msg,
cudaGetErrorString (err));
exit(EXIT_FAILURE);
}
• adapted from: http://www.ddj.com/hpc-high-performance-computing/207603131
Error Handling Macros
• CUDA_SAFE_CALL ( some cuda call )
CUDA_SAFE_CALL (cudaMemcpy (a_h, a_d, arr_size,
cudaMemcpyDeviceToHost) );
• Must define #define _DEBUG
– No code emitted when undefined: Performance
• Use make dbg=1 under NVIDIA_CUDA_SDK
Measuring Time -- gettimeofday
• Unix-based:
#include <sys/time.h>
#include <time.h>
struct timeval start, end;
gettimeofday (&start, NULL);
WHAT WE ARE INTERESTED IN
gettimeofday (&end, NULL);
timeCpu = (float)(end.tv_sec - start.tv_sec);
if (end.tv_usec < start.tv_usec)
{
timeCpu -= 1.0;
timeCpu += (double)(1000000.0 + end.tv_usec -
start.tv_usec)/1000000.0;
} else
timeCpu += (double)(end.tv_usec -
start.tv_usec)/1000000.0;
Using CUDA clock ()
• clock_t clock ();
• Can be used in device code
• returns per multiprocessor counter which is incremented every clock cycle
• Sample at the beginning and end
• upper bound since threads are time-sliced
• uint start = clock();
... compute (less than 3 sec) ....
uint end = clock();
if (end > start)
time = end - start;
else
time = end + (0xffffffff - start)
• Look at the clock example under projects in SDK
• Using takes some effort
– Every thread measures start and end
– Then must find min start and max end
– Accurate
Using cutTimer…() library calls
#include <cuda.h>
#include <cutil.h>
unsigned int htimer;
cutCreateTimer (&htimer);
CudaThreadSynchronize ();
cutStartTimer(htimer);
WHAT WE ARE INTERESTED IN
cudaThreadSynchronize ();
cutStopTimer(htimer);
printf (“time: %fn", cutGetTimerValue(htimer));
Code Overview: Host side
#include <cuda.h>
#include <cutil.h>
unsigned int htimer;
float *ha, *da;
main (int argc, char *argv[]) {
int N = atoi (argv[1]);
ha = (float *) malloc (sizeof (float) * N);
for (int i = 0; i < N; i++) ha[i] = i;
cutCreateTimer (&htimer);
cudaMalloc ((void **) &da, sizeof (float) * N);
cudaMemCpy ((void *) da, (void *) ha, sizeof (float) * N,
cudaMemcpyHostToDevice);
cudaThreadSynchronize ();
cutStartTimer(htimer);
blocks = (N + threads_block – 1) / threads_block;
darradd <<<blocks, threads_block>> (da, 10f, N)
cudaThreadSynchronize ();
cutStopTimer(htimer);
cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N,
cudaMemcpyDeviceToHost);
cudaFree (da);
free (ha);
printf (“processing time: %fn", cutGetTimerValue(htimer));
}
Code Overview: Device Side
__device__ float addmany (float a, float b, int count)
{
while (count--) a += b;
return a;
}
__global__ darradd (float *da, float x, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) da[i] = addmany (da[i], x, 10);
}
Variable Declarations – Will revisit next time
• __device__
– stored in device memory (large, high latency, no cache)
– Allocated with cudaMalloc (__device__qualifier implied)
– accessible by all threads
– lifetime: application
• __constant__
– same as __device__, but cached and read-only by GPU
– written by CPU via cudaMemcpyToSymbol(...) call
– lifetime: application
• __shared__
– stored in on-chip shared memory (very low latency)
– accessible by all threads in the same thread block
– lifetime: kernel launch
• Unqualified variables:
– scalars and built-in vector types are stored in registers
– arrays of more than 4 elements or run-time indices stored in device memory
Measurement Methodology
• You will not get exactly the same time
measurements every time
– Other processes running / external events (e.g.,
network activity)
– Cannot control
– “Non-determinism”
• Must take sufficient samples
– say 10 or more
– There is theory on what the number of samples
must be
• Measure average
• Will discuss this next time or will provide a
handout online
Handling Large Input Data Sets – 1D Example
• Recall gridDim.[xy] <= 65535
• Host calls kernel multiple times:
float *dac = da; // starting offset for current kernel
while (n_blocks)
{
int bn = n_blocks;
int elems; // array elements processed in this kernel
if (bn > 65535) bn = 65535;
elems = bn * block_size;
darradd <<<bn, block_size>>> (dac, 10.0f, elems);
n_blocks -= bn;
dac += elems;
}

More Related Content

Similar to 002 - Introduction to CUDA Programming_1.ppt

Cuda introduction
Cuda introductionCuda introduction
Cuda introductionHanibei
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Rob Gillen
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architectureDhaval Kaneria
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingJun Young Park
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 

Similar to 002 - Introduction to CUDA Programming_1.ppt (20)

Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Cuda
CudaCuda
Cuda
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
GPU Programming
GPU ProgrammingGPU Programming
GPU Programming
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
GPU-Accelerated Parallel Computing
GPU-Accelerated Parallel ComputingGPU-Accelerated Parallel Computing
GPU-Accelerated Parallel Computing
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 

Recently uploaded

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 

Recently uploaded (20)

How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 

002 - Introduction to CUDA Programming_1.ppt

  • 1. Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David Kirk UCSB course by Andrea Di Blas Universitat Jena by Waqar Saleem NVIDIA by Simon Green
  • 2. Programmer’s view • GPU as a co-processor CPU Memory GPU GPU Memory 1GB on our systems 3GB/s – 8GB.s 6.4GB/sec – 31.92GB/sec 8B per transfer 141GB/sec
  • 3. Target Applications int a[N]; // N is large for all elements of a compute a[i] = a[i] * fade • Lots of independent computations – CUDA thread need not be independent
  • 4. Programmer’s View of the GPU • GPU: a compute device that: – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel • Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
  • 5. Why are threads useful? • Concurrency: – Do multiple things in parallel – Put more hardware  Get higher performance Needs more functional units
  • 6. Why are threads useful #2 – Tolerating stalls • Often a thread stalls, e.g., memory access Multiplex the same functional unit Get more performance at a fraction of the cost
  • 7. GPU vs. CPU Threads • GPU threads are extremely lightweight • Very little creation overhead • In the order of microseconds • GPU needs 1000s of threads for full efficiency • Multi-core CPU needs only a few
  • 8. Execution Timeline time 1. Copy to GPU mem 2. Launch GPU Kernel GPU / Device 2’. Synchronize with GPU 3. Copy from GPU mem CPU / Host
  • 9. GBT: Grids of Blocks of Threads Why? Realities of integrated circuits: need to cluster computation and storage to achieve high speeds
  • 10. Grids of Blocks of Threads: Dimension Limits • Grid of Blocks 1D or 2D – Max x: 65535 – Max y: 65535 • Block of Threads: 1D, 2D, or 3D – Max number of threads: 512 – Max x: 512 – Max y: 512 – Max z: 64 • Limits apply to Compute Capability 1.0, 1.1, 1.2, and 1.3 – GTX280 = 1.3
  • 11. Block and Thread IDs • Threads and blocks have IDs – So each thread can decide what data to work on – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D • Simplifies memory addressing when processing multidimensional data Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) • IDs and dimensions are easily accessible through predefined “variables”, e.g., blockDim.x and threadIdx.x
  • 12. Thread Batching • A kernel is executed as a grid of thread blocks – All threads share data memory space • A thread block: threads that can cooperate with each other by: – Synchronizing their execution • For hazard-free shared memory accesses – Efficiently sharing data through a low latency shared memory • Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0)
  • 13. Thread Coordination Overview • Race-free access to data
  • 15. Programmer’s View: Memory Detail – Thread and Host • Each thread can: – R/W per-thread registers – R/W per-thread local memory – R/W per-block shared memory – R/W per-grid global memory – Read only per-grid constant memory – Read only per-grid texture memory • The host can R/W: • global, constant, and texture memories
  • 16. Memory Model: Global, Constant, and Texture Memories • Global memory – Main means of communicating R/W Data between host and device – Contents visible to all threads • Texture and Constant Memories – Constants initialized by host – Contents visible to all threads
  • 17. Memory Model Summary Memory Location Cached Access Scope Local off-chip No R/W thread Shared on-chip N/A R/W all threads in a block Global off-chip No R/W all threads + host Constant off-chip Yes RO all threads + host Texture off-chip Yes RO all threads + host
  • 18. Execution Model: Ordering • Execution order is undefined • Do not assume and use: • block 0 executes before block 1 • Thread 10 executes before thread 20 • And any other ordering even if you can observe it
  • 19. Execution Model Summary • Grid of blocks of threads – 1D/2D grid of blocks – 1D/2D/3D blocks of threads • All blocks are identical: – same structure and # of threads • Block execution order is undefined • Same block threads: – can synchronize and share data fast (shared memory) • Threads from different blocks: – Cannot cooperate – Communication through global memory • Threads and Blocks have IDs – Simplifies data indexing – Can be 1D, 2D, or 3D (threads) • Blocks do not migrate: execute on the same processor • Several blocks may run over the same processor
  • 21. Reasoning about CUDA call ordering • GPU communication via cuda…() calls and kernel invocations – cudaMalloc, cudaMemCpy, • Asynchronous from the CPU’s perspective – CPU places a request in a “CUDA” queue – requests are handled in-order • Streams allow for multiple queues – More on this much later one
  • 22. CUDA API: Example int a[N]; for (i =0; i < N; i++) a[i] = a[i] + x; 1. Allocate CPU Data Structure 2. Initialize Data on CPU 3. Allocate GPU Data Structure 4. Copy Data from CPU to GPU 5. Define Execution Configuration 6. Run Kernel 7. CPU synchronizes with GPU 8. Copy Data from GPU to CPU 9. De-allocate GPU and CPU memory
  • 23. 1. Allocate CPU Data float *ha; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); ... } • Pinned memory allocation results in faster CPU to/from GPU copies • More on this later • cudaMallocHost (…)
  • 24. 2. Initialize CPU Data float *ha; int i; for (i = 0; i < N; i++) ha[i] = i;
  • 25. 3. Allocate GPU Data float *da; cudaMalloc ((void **) &da, sizeof (float) * N); • Notice: no assignment side – NOT: da = cudaMalloc (…) • Assignment is done internally: – That why we pass &da • Allocated space in Global Memory
  • 26. GPU Memory Allocation • The host manages GPU memory allocation: – cudaMalloc (void **ptr, size_t nbytes) – Must explicitly cast to (void **) • cudaMalloc ((void **) &da, sizeof (float) * N); – cudaFree (void *ptr); • cudaFree (da); – cudaMemset (void *ptr, int value, size_t nbytes); – cudaMemset (da, 0, N * sizeof (int)); • Check the CUDA Reference Manual
  • 27. 4. Copy Initialized CPU data to GPU float *da; float *ha; cudaMemCpy ((void *) da, // DESTINATION (void *) ha, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyHostToDevice); // DIRECTION
  • 28. Host/Device Data Transfers • The host initiates all transfers: • cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction) • Asynchronous from the CPU’s perspective – CPU thread continues • In-order processing with other CUDA requests • enum cudaMemcpyKind – cudaMemcpyHostToDevice – cudaMemcpyDeviceToHost – cudaMemcpyDeviceToDevice
  • 29. 5. Define Execution Configuration • How many blocks and threads/block int threads_block = 64; int blocks = N / threads_block; if (blocks % N != 0) blocks += 1; • Alternatively: blocks = (N + threads_block – 1) / threads_block;
  • 30. 6. Launch Kernel & 7. CPU/GPU Synchronization • Instructs the GPU to launch blocks x thread_blocks threads: darradd <<<blocks, threads_block>> (da, 10f, N); cudaThreadSynchronize (); • darradd: kernel name • <<<…>>> execution configuration – More on this soon • (da, x, N): arguments – 256 – 8 byte limit / No variable arguments
  • 31. CPU/GPU Synchronization • CPU does not block on cuda…() calls – Kernel/requests are queued and processed in-order – Control returns to CPU immediately • Good if there is other work to be done – e.g., preparing for the next kernel invocation • Eventually, CPU must know when GPU is done • Then it can safely copy the GPU results • cudaThreadSynchronize () – Block CPU until all preceding cuda…() and kernel requests have completed
  • 32. 8. Copy data from GPU to CPU & 9. DeAllocate Memory float *da; float *ha; cudaMemCpy ((void *) ha, // DESTINATION (void *) da, // SOURCE sizeof (float) * N, // #bytes cudaMemcpyDeviceToHost); // DIRECTION cudaFree (da); // display or process results here free (ha);
  • 33. The GPU Kernel __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = da[i] + x; } • BlockIdx: Unique Block ID. – Numerically asceding: 0, 1, … • ThreadIdx: Unique per Block Index – 0, 1, … – Per Block • BlockDim: Dimensions of Block – BlockDim.x, BlockDim.y, BlockDim.z – Unused dimensions default to 0
  • 34. Array Index Calculation Example int i = blockIdx.x * blockDim.x + threadIdx.x; a[0] a[63] a[64] a[127]a[128] a[255]a[256] blockIdx.x 0 blockIdx.x 1 blockIdx.x 2 i = 0 i = 63 i = 64 i = 127 i = 128 i = 255 i = 256 Assuming blockDim.x = 64
  • 35. Generic Unique Thread and Block Index Calculations #1 • 1D Grid / 1D Blocks: UniqueBlockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x + threadIdx.x; • 1D Grid / 2D Blocks: UniqueBlockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x; • 1D Grid / 3D Blocks: UniqueBockIndex = blockIdx.x; UniqueThreadIndex = blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; • Source: http://forums.nvidia.com/lofiversion/index.php?t82040.html
  • 36. Generic Unique Thread and Block Index Calculations #2 • 2D Grid / 1D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex = UniqueBlockIndex * blockDim.x + threadIdx.x; • 2D Grid / 2D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex =UniqueBlockIndex * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; • 2D Grid / 3D Blocks: UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x; UniqueThreadIndex = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.z + threadIdx.y * blockDim.x + threadIdx.x; • UniqueThreadIndex means unique per grid.
  • 37. CUDA Function Declarations • __global__ defines a kernel function – Must return void – Can only call __device__ functions • __device__ and __host__ can be used together Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host __host__ float HostFunc() host host
  • 38. __device__ Example • Add x to a[i] multiple times __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }
  • 39. Kernel and Device Function Restrictions • __device__ functions cannot have their address taken – e.g., f = &addmany; *f(…); • For functions executed on the device: – No recursion • darradd (…) { darradd (…) } – No static variable declarations inside the function • darradd (…) { static int canthavethis; } – No variable number of arguments • e.g., something like printf (…)
  • 40. Predefined Vector Datatypes • Can be used both in host and in device code. – [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] • Structures accessed with .x, .y, .z, .w fields • default constructors, “make_TYPE (…)”: – float4 f4 = make_float4 (1f, 10f, 1.2f, 0.5f); • dim3 – type built on uint3 – Used to specify dimensions – Default value is (1, 1, 1)
  • 41. Execution Configuration • Must specify when calling a __global__ function: <<< Dg, Db [, Ns [, S]] >>> • where: – dim3 Dg: grid dimensions in blocks – dim3 Db: block dimensions in threads – size_t Ns: per block additional number of shared memory bytes to allocate • optional, defaults to 0 • more on this much later on – cudaStream_t S: request stream(queue) • optional, default to 0. • Compute capability >= 1.1
  • 42. Built-in Variables • dim3 gridDim – Number of blocks per grid, in 2D (.z always 1) • uint3 blockIdx – Block ID, in 2D (blockIdx.z = 1 always) • dim3 blockDim – Number of threads per block, in 3D • uint3 threadIdx – Thread ID in block, in 3D
  • 43. Execution Configuration Examples • 1D grid / 1D blocks dim3 gd(1024) dim3 bd(64) akernel<<<gd, bd>>>(...) gridDim.x = 1024, gridDim.y = 1, blockDim.x = 64, blockDim.y = 1, blockDim.z = 1 • 2D grid / 3D blocks dim3 gd(4, 128) dim3 bd(64, 16, 4) akernel<<<gd, bd>>>(...) gridDim.x = 4, gridDim.y = 128, blockDim.x = 64, blockDim.y = 16, blockDim.z = 4
  • 44. Error Handling • Most cuda…() functions return a cudaError_t – If cudaSuccess: Request completed without a problem • cudaGetLastError(): – returns the last error to the CPU – Use with cudaThreadSynchronize(): cudaError_t code; cudaThreadSynchronize (); code = cudaGetLastError (); • char *cudaGetErrorString(cudaError_t code); – returns a human-readable description of the error code
  • 45. Error Handling Utility Function void cudaDie (const char *msg) { cudaError_t err; cudaThreadSynchronize (); err = cudaGetLastError(); if (err == cudaSuccess) return; fprintf (stderr, "CUDA error: %s: %s.n", msg, cudaGetErrorString (err)); exit(EXIT_FAILURE); } • adapted from: http://www.ddj.com/hpc-high-performance-computing/207603131
  • 46. Error Handling Macros • CUDA_SAFE_CALL ( some cuda call ) CUDA_SAFE_CALL (cudaMemcpy (a_h, a_d, arr_size, cudaMemcpyDeviceToHost) ); • Must define #define _DEBUG – No code emitted when undefined: Performance • Use make dbg=1 under NVIDIA_CUDA_SDK
  • 47. Measuring Time -- gettimeofday • Unix-based: #include <sys/time.h> #include <time.h> struct timeval start, end; gettimeofday (&start, NULL); WHAT WE ARE INTERESTED IN gettimeofday (&end, NULL); timeCpu = (float)(end.tv_sec - start.tv_sec); if (end.tv_usec < start.tv_usec) { timeCpu -= 1.0; timeCpu += (double)(1000000.0 + end.tv_usec - start.tv_usec)/1000000.0; } else timeCpu += (double)(end.tv_usec - start.tv_usec)/1000000.0;
  • 48. Using CUDA clock () • clock_t clock (); • Can be used in device code • returns per multiprocessor counter which is incremented every clock cycle • Sample at the beginning and end • upper bound since threads are time-sliced • uint start = clock(); ... compute (less than 3 sec) .... uint end = clock(); if (end > start) time = end - start; else time = end + (0xffffffff - start) • Look at the clock example under projects in SDK • Using takes some effort – Every thread measures start and end – Then must find min start and max end – Accurate
  • 49. Using cutTimer…() library calls #include <cuda.h> #include <cutil.h> unsigned int htimer; cutCreateTimer (&htimer); CudaThreadSynchronize (); cutStartTimer(htimer); WHAT WE ARE INTERESTED IN cudaThreadSynchronize (); cutStopTimer(htimer); printf (“time: %fn", cutGetTimerValue(htimer));
  • 50. Code Overview: Host side #include <cuda.h> #include <cutil.h> unsigned int htimer; float *ha, *da; main (int argc, char *argv[]) { int N = atoi (argv[1]); ha = (float *) malloc (sizeof (float) * N); for (int i = 0; i < N; i++) ha[i] = i; cutCreateTimer (&htimer); cudaMalloc ((void **) &da, sizeof (float) * N); cudaMemCpy ((void *) da, (void *) ha, sizeof (float) * N, cudaMemcpyHostToDevice); cudaThreadSynchronize (); cutStartTimer(htimer); blocks = (N + threads_block – 1) / threads_block; darradd <<<blocks, threads_block>> (da, 10f, N) cudaThreadSynchronize (); cutStopTimer(htimer); cudaMemCpy ((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost); cudaFree (da); free (ha); printf (“processing time: %fn", cutGetTimerValue(htimer)); }
  • 51. Code Overview: Device Side __device__ float addmany (float a, float b, int count) { while (count--) a += b; return a; } __global__ darradd (float *da, float x, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < N) da[i] = addmany (da[i], x, 10); }
  • 52. Variable Declarations – Will revisit next time • __device__ – stored in device memory (large, high latency, no cache) – Allocated with cudaMalloc (__device__qualifier implied) – accessible by all threads – lifetime: application • __constant__ – same as __device__, but cached and read-only by GPU – written by CPU via cudaMemcpyToSymbol(...) call – lifetime: application • __shared__ – stored in on-chip shared memory (very low latency) – accessible by all threads in the same thread block – lifetime: kernel launch • Unqualified variables: – scalars and built-in vector types are stored in registers – arrays of more than 4 elements or run-time indices stored in device memory
  • 53. Measurement Methodology • You will not get exactly the same time measurements every time – Other processes running / external events (e.g., network activity) – Cannot control – “Non-determinism” • Must take sufficient samples – say 10 or more – There is theory on what the number of samples must be • Measure average • Will discuss this next time or will provide a handout online
  • 54. Handling Large Input Data Sets – 1D Example • Recall gridDim.[xy] <= 65535 • Host calls kernel multiple times: float *dac = da; // starting offset for current kernel while (n_blocks) { int bn = n_blocks; int elems; // array elements processed in this kernel if (bn > 65535) bn = 65535; elems = bn * block_size; darradd <<<bn, block_size>>> (dac, 10.0f, elems); n_blocks -= bn; dac += elems; }