Cuda 2


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Cuda 2

  1. 1. CUDA ProgrammingcontinuedITCS 4145/5145 Nov 24, 2010 © Barry WilkinsonRevised
  2. 2. 2Timing GPU ExecutionCan use CUDA “events” – create two events and computethe time between them:cudaEvent_t start, stop;float elapsedTime;cudaEventCreate(&start); // create event objectscudaEventCreate(&stop);cudaEventRecord(start, 0); // Record start event...cudaEventRecord(stop, 0); // record end eventcudaEventSynchronize(stop); // wait work preceding to completecudaEventRecord(stop,0)cudaEventElapsedTime(&elapsedTime, start, stop);//compute elapsed time between eventscudaEventDestroy(start); //destroy start eventcudaEventDestroy(stop);); //destroy stop event
  3. 3. 3
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. 7
  8. 8. 8Host SynchronizationKernels•Control returned to CPU immediately (asynchronous,non-blocking)•Kernel starts after all previous CUDA calls completedcudaMemcpy•Returns after copy complete (synchronous)•Copy starts after all previous CUDA calls completed
  9. 9. 9CUDA Synchronization RoutinesHostcudaThreadSynchronize()•Blocks until all previous CUDA calls completeGPUvoid __syncthreads()•Synchronizes all threads in a block•Barrier – no thread can pass until all threads in blockreach it.•All threads must reach __syncthread in thread block.
  10. 10. 10GPU Atomic OperationsPerforms a read-modify-write atomic operation on oneword residing in global or shared memory.Associative operations on signed/unsigned integers,add, sub, min, max, and, or, xor, increment,decrement, exchange, compare and swap.Requires GPU with compute capability 1.1+(Shared memory operations and 64-bit words require higher capability)coit-grid06 Tesla C2050 has compute capability 2.0See for GPU compute capabilities
  11. 11. 11int atomicAdd(int* address, int val);reads old located at address address in global or sharedmemory, computes (old + val), and stores result back tomemory at same address.These three operations (read, compute, and write) areperformed in one atomic transaction.*Function returns old.Atomic Operation Example* Once stated, it continues to completion without being able to be interrupted by otherprocessors. Other processors cannot read or write to memory location once atomicoperation starts. Mechanism implemented in hardware.
  12. 12. 12Other operationsint atomicSub(int* address, int val);int atomicExch(int* address, int val);int atomicMin(int* address, int val);int atomicMax(int* address, int val);unsigned int atomicInc(unsigned int* address, unsigned int val);unsigned int atomicDec(unsigned int* address, unsigned int val);int atomicCAS(int* address, int compare, int val); //compare and swapint atomicAnd(int* address, int val);int atomicOr(int* address, int val);int atomicXor(int* address, int val);Source: NVIDIA CUDA C Programming Guide, version 3.2, 11/9/2010
  13. 13. 13int atomicCAS(int* address, int compare, int val);reads the word old located at address address in global or sharedmemory, and compares old with compare. If they are the same, itset old to val (stores val at address address), i.e.:if (old == compare) old = val; // else old = oldThe three operations (read, compute, and write) are performed inone atomic transaction.The function returns the original value of old.Also unsigned and unsigned long long int versions.Compare and Swap(also called compare and exchange)
  14. 14. 14__device__ int lock=0; // unlocked__global__ void kernel(...) { {} while(atomicCAS(&lock,0,1)); // if lock = 0 set to1// and continue... // critical sectionlock = 0; // free lock}Coding Critical Sections withLocks
  15. 15. 15Memory FencesThreads may see the effects of a series of writes to memoryexecuted by another thread in different orders. To enforce ordering:void __threadfence_block();waits until all global and shared memory accesses made by thecalling thread prior to __threadfence_block() are visible to allthreads in the thread block.Other routines:void __threadfence();void __threadfence_system();
  16. 16. 16Writes to device memory not guaranteed in any order,so global writes may not have completed by the time thelock is unlocked__global__ void kernel(...) { {} while(atomicCAS(&lock,0,1));... // criticial section__threadfence(); // wait for writes to finishlock = 0;}Critical sections with memoryoperations
  17. 17. 17Error reportingAll CUDA calls (except kernel launches) return error code of typecudaError_tcudaError_t cudaGetLastError(void)Returns code for the last errorCan be used to get error from kernel execution.Char* cudaGetErroprString(cudaError_t code)Returns a null-terminated character string describing errorExampleprint(“%sn”,cudaGetErrorString(cudaGetLastError());
  18. 18. Questions