Your SlideShare is downloading. ×
Cuda Architecture
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Cuda Architecture

642
views

Published on

Published in: Technology, Sports

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
642
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  • Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  • Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth & fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
  • Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text
  • Transcript

    • 1. CUDA Architecture Overview
    • 2. PROGRAMMING ENVIRONMENT
    • 3. CUDA APIs API allows the host to manage the devices Allocate memory & transfer data Launch kernels CUDA C “Runtime” API High level of abstraction - start here! CUDA C “Driver” API More control, more verbose (OpenCL: Similar to CUDA C Driver API)
    • 4. CUDA C and OpenCL Entry point for Entry point for developers developers who want low-level API who prefer high-level C Shared back-end compilerand optimization technology
    • 5. Processing Flow PCI BusCopy input data from CPU memory to GPU memory
    • 6. Processing Flow PCI Bus1. Copy input data from CPU memory to GPU memory2. Load GPU program and execute, caching data on chip for performance
    • 7. Processing Flow PCI Bus1. Copy input data from CPU memory to GPUmemory2. Load GPU program and execute, caching data on chip for performance3. Copy results from GPU memory to CPUmemory
    • 8. CUDA Parallel Computing ArchitectureParallel computing architectureand programming modelIncludes a CUDA C compiler,support for OpenCL andDirectComputeArchitected to natively supportmultiple computationalinterfaces (standard languagesand APIs)
    • 9. C for CUDA : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];} Standard C Code// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; Parallel C Code}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
    • 10. CUDA Parallel Computing Architecture CUDA defines: Programming model Memory model Execution model CUDA uses the GPU, but is for general-purpose computing Facilitate heterogeneous computing: CPU + GPU CUDA is scalable Scale to run on 100s of cores/1000s of parallel threads
    • 11. Compiling CUDA C Applications (Runtime API) void serial_function(… ) { ... C CUDA Rest of C}void other_function(int ... ) { Key Kernels Application ...} NVCCvoid saxpy_serial(float ... ) { CPU Compiler (Open64) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; Modify into} Parallel CUDA object CPU object void main( ) { CUDA code files files float x; Linker saxpy_serial(..); ... } CPU-GPU Executable
    • 12. PROGRAMMING MODEL CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels
    • 13. CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU Array of threads, in parallel float x = input[threadID]; All threads execute the same float y = func(x); output[threadID] = y; code, can take different paths Each thread has an ID Select input/output data Control decisions
    • 14. CUDA Kernels: Subdivide into Blocks
    • 15. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks
    • 16. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid
    • 17. CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
    • 18. CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads
    • 19. Communication Within a Block Threads may need to cooperate Memory accesses Share results Cooperate using shared memory Accessible by all threads within a block Restriction to “within a block” permits scalability Fast communication between N threads is not feasible when N large
    • 20. Transparent Scalability – G84 1 2 3 4 5 6 7 8 9 10 11 12 11 12 9 10 7 8 5 6 3 4 1 2
    • 21. Transparent Scalability – G80 1 2 3 4 5 6 7 8 9 10 11 12 9 10 11 12 1 2 3 4 5 6
    • 22. Transparent Scalability – GT200 1 2 3 4 5 6 7 8 9 1 1 1 0 1 2 2 10 11 12 Idl Idl Idl 1 3 4 5 6 7 8 9 e e ... e Idl e
    • 23. CUDA Programming Model - Summary Host Device A kernel executes as a grid of thread blocks Kernel 1 0 1 2 3 A block is a batch of threads 1D Communicate through shared memory 0,0 0,1 0,2 0,3 Kernel 2 Each block has a block ID 1,0 1,1 1,2 1,3 2D Each thread has a thread ID
    • 24. MEMORY MODELMemory hierarchy Thread: Registers
    • 25. Memory hierarchy Thread: Registers Thread: Local memory
    • 26. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
    • 27. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory
    • 28. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
    • 29. Memory hierarchy Thread: Registers Thread: Local memory Block of threads: Shared memory All blocks: Global memory
    • 30. Additional Memories Host can also allocate textures and arrays of constants Textures and constants have dedicated caches