Cuda Architecture

1,483 views

Published on

Published in: Technology, Sports
  • Be the first to comment

  • Be the first to like this

Cuda Architecture

  1. 1. CUDA Architecture Overview
  2. 2. PROGRAMMING ENVIRONMENT
  3. 3. CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C “Runtime” API High level of abstraction - start here! CUDA C “Driver” API More control, more verbose (OpenCL: Similar to CUDA C Driver API)
  4. 4. CUDA C and OpenCL Entry point for Entry point for developers developers who want low-level API who prefer high-level C Shared back-end compilerand optimization technology
  5. 5. Processing Flow PCI BusCopy input data from CPU memory to GPU memory
  6. 6. Processing Flow PCI Bus1. Copy input data from CPU memory to GPU memory2. Load GPU program and execute, caching data on chip for performance
  7. 7. Processing Flow PCI Bus1. Copy input data from CPU memory to GPUmemory2. Load GPU program and execute, caching data on chip for performance3. Copy results from GPU memory to CPUmemory
  8. 8. CUDA Parallel Computing ArchitectureParallel computing architectureand programming modelIncludes a CUDA C compiler,support for OpenCL andDirectComputeArchitected to natively supportmultiple computationalinterfaces (standard languagesand APIs)
  9. 9. C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];} Standard C Code// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
  10. 10. CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads
  11. 11. Compiling CUDA C Applications (Runtime API) void serial_function(… ) { ... C CUDA Rest of C}void other_function(int ... ) { Key Kernels Application ...} NVCCvoid saxpy_serial(float ... ) { CPU Compiler (Open64) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Modify into} Parallel CUDA object CPU object void main( ) { CUDA code files files float x; Linker saxpy_serial(..); ... } CPU-GPU Executable
  12. 12. PROGRAMMING MODEL CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels
  13. 13. CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel float x = input[threadID]; All threads execute the same float y = func(x); output[threadID] = y; code, can take different paths Each thread has an ID Select input/output data Control decisions
  14. 14. CUDA Kernels: Subdivide into Blocks
  15. 15. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks
  16. 16. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid
  17. 17. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
  18. 18. CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
  19. 19. Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to “within a block” permits scalability Fast communication between N threads is not feasible when N large
  20. 20. Transparent Scalability – G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2
  21. 21. Transparent Scalability – G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6
  22. 22. Transparent Scalability – GT200 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 2 10 11 12 Idl Idl Idl 1 3 4 5 6 7 8 9 e e ... e Idl e
  23. 23. CUDA Programming Model - Summary Host Device A kernel executes as a grid of thread blocks Kernel 1 0 1 2 3 A block is a batch of threads 1D Communicate through shared memory 0,0 0,1 0,2 0,3 Kernel 2 Each block has a block ID 1,0 1,1 1,2 1,3 2D Each thread has a thread ID
  24. 24. MEMORY MODELMemory hierarchy Thread: Registers
  25. 25. Memory hierarchy Thread: Registers Thread: Local memory
  26. 26. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
  27. 27. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
  28. 28. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
  29. 29. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
  30. 30. Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

×