Cuda Architecture

1,115 views
927 views

Published on

Published in: Technology, Sports
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,115
On SlideShare
0
From Embeds
0
Number of Embeds
88
Actions
Shares
0
Downloads
46
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  • Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  • Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth & fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
  • Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text
  • Cuda Architecture

    1. 1. CUDA Architecture Overview
    2. 2. PROGRAMMING ENVIRONMENT
    3. 3. CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C “Runtime” API High level of abstraction - start here! CUDA C “Driver” API More control, more verbose (OpenCL: Similar to CUDA C Driver API)
    4. 4. CUDA C and OpenCL Entry point for Entry point for developers developers who want low-level API who prefer high-level C Shared back-end compilerand optimization technology
    5. 5. Processing Flow PCI BusCopy input data from CPU memory to GPU memory
    6. 6. Processing Flow PCI Bus1. Copy input data from CPU memory to GPU memory2. Load GPU program and execute, caching data on chip for performance
    7. 7. Processing Flow PCI Bus1. Copy input data from CPU memory to GPUmemory2. Load GPU program and execute, caching data on chip for performance3. Copy results from GPU memory to CPUmemory
    8. 8. CUDA Parallel Computing ArchitectureParallel computing architectureand programming modelIncludes a CUDA C compiler,support for OpenCL andDirectComputeArchitected to natively supportmultiple computationalinterfaces (standard languagesand APIs)
    9. 9. C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];} Standard C Code// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
    10. 10. CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads
    11. 11. Compiling CUDA C Applications (Runtime API) void serial_function(… ) { ... C CUDA Rest of C}void other_function(int ... ) { Key Kernels Application ...} NVCCvoid saxpy_serial(float ... ) { CPU Compiler (Open64) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Modify into} Parallel CUDA object CPU object void main( ) { CUDA code files files float x; Linker saxpy_serial(..); ... } CPU-GPU Executable
    12. 12. PROGRAMMING MODEL CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels
    13. 13. CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel float x = input[threadID]; All threads execute the same float y = func(x); output[threadID] = y; code, can take different paths Each thread has an ID Select input/output data Control decisions
    14. 14. CUDA Kernels: Subdivide into Blocks
    15. 15. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks
    16. 16. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid
    17. 17. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
    18. 18. CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
    19. 19. Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to “within a block” permits scalability Fast communication between N threads is not feasible when N large
    20. 20. Transparent Scalability – G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2
    21. 21. Transparent Scalability – G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6
    22. 22. Transparent Scalability – GT200 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 2 10 11 12 Idl Idl Idl 1 3 4 5 6 7 8 9 e e ... e Idl e
    23. 23. CUDA Programming Model - Summary Host Device A kernel executes as a grid of thread blocks Kernel 1 0 1 2 3 A block is a batch of threads 1D Communicate through shared memory 0,0 0,1 0,2 0,3 Kernel 2 Each block has a block ID 1,0 1,1 1,2 1,3 2D Each thread has a thread ID
    24. 24. MEMORY MODELMemory hierarchy Thread: Registers
    25. 25. Memory hierarchy Thread: Registers Thread: Local memory
    26. 26. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
    27. 27. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
    28. 28. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
    29. 29. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
    30. 30. Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches

    ×