Your SlideShare is downloading. ×
0
CUDA Programming model ReviewParallel kernels composed of many threads       Threads execute the same sequential program  ...
GPU Architecture:Two Main Components Global memory    Analogous to RAM in a CPU server    Accessible by both GPU and CPU  ...
GPU Architecture – Fermi:        Instruction Cache                                          Scheduler    Scheduler        ...
GPU Architecture – Fermi:                        Instruction Cache                                                        ...
CUDA Execution Model Blocks run on multiprocessor(SM)                        Kernel launched by host Entire Block gets sch...
Hardware Multithreading Hardware allocates resources to blocks     blocks need: thread slots, registers, shared memory    ...
Hardware Multithreading Hardware allocates resources to blocks     blocks need: thread slots, registers, shared memory    ...
SM schedules warps & issues instructions Dual issue pipelines select two warps to issue SIMT warp executes one instruction...
Introducing: Kepler GK110
Welcome the Kepler GK110 GPU Performance   EfficiencyProgrammability
Kepler GK110 Block DiagramArchitecture  7.1B Transistors  15 SMX units  > 1 TFLOP FP64  1.5 MB L2 Cache  384-bit GDDR5  PC...
SMX: Efficient Performance Power-Aware SMX Architecture Clocks & Feature Size SMX result -      Performance up      Power ...
Power vs Clock Speed Example                           Logic       Clocking                        Area Power    Area Powe...
Kepler     Fermi                                    Kepler                                                                ...
SMX Balance of Resources     Resource                 Kepler GK110 vs Fermi  Floating point throughput            2-3x    ...
New ISA Encoding: 255 Registers perThread Fermi limit: 63 registers per thread     A common Fermi performance limiter     ...
New High-Performance SMX Instructions                                             Compiler-generated,SHFL (shuffle) -- Int...
New Instruction: SHFLData exchange between threads within a warp  Avoids use of shared memory  One 32-bit value per exchan...
SHFL Example: Warp Prefix-Sum__global__ void shfl_prefix_sum(int *data){                                                  ...
ATOM instruction enhancementsAdded int64 functions to       2 – 10x performance gainsmatch existing int32              Sho...
High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loops      Example: Data reduction (sum ...
High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loops      Example: Data reduction (sum ...
TexturesUsing textures in CUDA 4.0                                            Global Memory                               ...
Texture Pros & Cons        Good Stuff                     Bad Stuff                                Explicit global binding...
Bindless TexturesKepler permits dynamic binding of                        Bad Stuff  textures:                            ...
Global Load Through TextureLoad from direct address, through                                  Bad Stuff  texture pipeline:...
const __restrict Example Annotate eligible kernel      __global__ void saxpy(float x, float y,                            ...
Thank you
Gpu archi
Upcoming SlideShare
Loading in...5
×

Gpu archi

470

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
470
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  • Agenda slide:Heading – Agenda - Font size 30, Arial BoldPlease restrict this slide with just 5 agenda points. If you have more than 5 points on the agenda slide please add another slide. If you have only 3 then you can use just one slide and delete the other 2 points.
  • Content Slide: This is usually the most frequently used slide in every presentation. Use this slide for Text heavy slides. Text can only be used in bullet pointsTitle Heading – font size 30, Arial boldSlide Content – Should not reduce beyond Arial font 16If you need to use sub bullets please use the indent buttons located next to the bullets buttons in the tool bar and this will automatically provide you with the second, third, fourth & fifth level bullet styles and font sizesPlease note you can also press the tab key to create the different levels of bulleted content
  • Blank slide or a freeform slide you may use this to insert or show screenshots etcIf content is added in this slide you will need to use bulleted text
  • Transcript of "Gpu archi"

    1. 1. CUDA Programming model ReviewParallel kernels composed of many threads Threads execute the same sequential program Thread Use parallel threads rather than sequential loopsThreads grouped in Cooperative Thread Arrays Threads in same CTA cooperate & share memory CTA implements a CUDA thread block CTA / BlockCTAs are grouped into grids t0 t1 … tB Threads and blocks have unique IDs : threadIdx, blockIdx Blocks and Grids have dimensions : blockDim, gridDim A warp in CUDA is a group of 32 threads, which is the minimum size of the data processed in SIMD fashion by a CUDA multiprocessor. Grid CTA 0 CTA 1 CTA 2 CTA m ...
    2. 2. GPU Architecture:Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB per GPU Bandwidth currently up to ~180 GB/s (Tesla DRAM I/F DRAM I/F products) ECC on/off (Quadro and Tesla products) HOST I/F DRAM I/F Streaming Multiprocessors (SMs) L2 Perform the actual computations DRAM I/F Thread Giga Each SM has its own: DRAM I/F DRAM I/F Control units, registers, execution pipelines, caches
    3. 3. GPU Architecture – Fermi: Instruction Cache Scheduler Scheduler Streaming Multiprocessor (SM) Dispatch Dispatch Register File 32 CUDA Cores per SM Core Core Core Core  32 fp32 ops/clock Core Core Core Core Core Core Core Core  16 fp64 ops/clock Core Core Core Core  32 int32 ops/clock Core Core Core Core 2 warp schedulers Core Core Core Core  Up to 1536 threads Core Core Core Core concurrently Core Core Core Core 4 special-function units Load/Store Units x 16 Special Func Units x 4 64KB shared mem + L1 cache Interconnect Network 32K 32-bit registers 64K Configurable Cache/Shared Mem Uniform Cache
    4. 4. GPU Architecture – Fermi: Instruction Cache Scheduler Scheduler CUDA Core Dispatch Dispatch Register File Floating point & Integer unit Core Core Core Core  IEEE 754-2008 floating-point Core Core Core Core standard Core Core Core Core  Fused multiply-add (FMA) Core Core Core Core CUDA Core instruction for both single and Dispatch Port Core Core Core Core double precision Operand Collector Core Core Core Core Logic unit FP Unit INT Unit Core Core Core Core Move, compare unit Core Core Core Core Load/Store Units x 16 Branch unit Result Queue Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache
    5. 5. CUDA Execution Model Blocks run on multiprocessor(SM) Kernel launched by host Entire Block gets scheduled on a single SM . Multiple blocks can reside on an SM at . the same time . Limit is 8 blocks/SM on Fermi Limit is 16 blocks/SM on Kepler Device processor array MT IU MT IU SP SP MT IU SP MTIU SP ... MT IU SP MTIU SP MT IU SP MTIU SP Sh Sh Sh Share Share Share Shared Shared are are are d d d d d d Memory Memory Mem Mem Mem Me Me Me ory ory ory mo mo mo ry ry ry Device Memory
    6. 6. Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks don’t run until resources are available for all of it’s threads. Hardware schedules threads in units of warps threads have their own registers context switching is (basically) free – every cycle Hardware picks from warps that have an instruction ready(i.e. all operands ready) to execute. Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
    7. 7. Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks don’t run until resources are available for all of it’s threads. Hardware schedules threads in units of warps threads have their own registers context switching is (basically) free – every cycle Hardware picks from warps that have an instruction ready(i.e. all operands ready) to execute. Hardware relies on threads to hide latency i.e., parallelism is necessary for performance
    8. 8. SM schedules warps & issues instructions Dual issue pipelines select two warps to issue SIMT warp executes one instruction for up to 32 threads Warp Scheduler Warp Scheduler Instruction Dispatch Unit Instruction Dispatch Unit Warp 8 instruction 11 Warp 9 instruction 11 Warp 2 instruction 42 Warp 3 instruction 33 Warp 14 instruction 95 Warp 15 instruction 95 time Warp 8 instruction 12 Warp 9 instruction 12 Warp 14 instruction 96 Warp 3 instruction 34 Warp 2 instruction 43 Warp 15 instruction 96
    9. 9. Introducing: Kepler GK110
    10. 10. Welcome the Kepler GK110 GPU Performance EfficiencyProgrammability
    11. 11. Kepler GK110 Block DiagramArchitecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3
    12. 12. SMX: Efficient Performance Power-Aware SMX Architecture Clocks & Feature Size SMX result - Performance up Power down
    13. 13. Power vs Clock Speed Example Logic Clocking Area Power Area Power Fermi A B 1.0x 1.0x 1.0x 1.0x2x clock A B Kepler 1.8x 0.9x 1.0x 0.5x1x clock A B
    14. 14. Kepler Fermi Kepler SM InstructionCache InstructionCache Scheduler Scheduler CUDA Core WarpScheduler WarpScheduler WarpScheduler WarpS Dispatch Dispatch Disp Dispa atch tchPo Port rt DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit Dispatch Register File ALU Ope Coll rand ecto Result Queue r Core Core Core Core Core Core Core Core RegisterFile(65,536x32-bit) Core Core Core Core C C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r Core Core Core Core e e e e e e T U e e e e e C C C C C C L C C C C C D S Core Core Core Core o o o o o o / F o o o o o r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Core Core Core Core C C C C C C L C C C C C D S o o o o o o o o o o o Load/Store Units x 16 r r r r r r / F r r r r r S U e e e e e e T e e e e e Special Func Units x 4 C C C C C C L C C C C C InterconnectNetwork D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e 64K Configurable Cache/Shared Mem C C C C C C L S C C C C C D o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e Uniform Cache C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U e e e e e e T e e e e e C C C C C C L C C C C C D S o o o o o o o o o o o / F r r r r r r S r r r r r U
    15. 15. SMX Balance of Resources Resource Kepler GK110 vs Fermi Floating point throughput 2-3x Max Blocks per SMX 2x Max Threads per SMX 1.3x Register File Bandwidth 2x Register File Capacity 2xShared Memory Bandwidth 2x Shared Memory Capacity 1x
    16. 16. New ISA Encoding: 255 Registers perThread Fermi limit: 63 registers per thread A common Fermi performance limiter Leads to excessive spilling Kepler : Up to 255 registers per thread Especially helpful for FP64 apps
    17. 17. New High-Performance SMX Instructions Compiler-generated,SHFL (shuffle) -- Intra-warp data exchange high performance instructions:  bit shift  bit rotateATOM -- Broader functionality, Faster  fp32 division  read-only cache
    18. 18. New Instruction: SHFLData exchange between threads within a warp Avoids use of shared memory One 32-bit value per exchange 4 variants: a b c d e f g h __shfl() __shfl_up() __shfl_down() __shfl_xor() h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f Indexed Shift right to nth Shift left to nth neighbour Butterfly (XOR) any-to-any neighbour exchange
    19. 19. SHFL Example: Warp Prefix-Sum__global__ void shfl_prefix_sum(int *data){ 3 8 2 6 3 9 1 4 int id = threadIdx.x; int value = data[id]; n = __shfl_up(value, 1) int lane_id = threadIdx.x & warpSize; value += n 3 11 10 8 9 12 10 5 // Now accumulate in log2(32) steps n = __shfl_up(value, 2) for(int i=1; i<=width; i*=2) { int n = __shfl_up(value, i); value += n 3 11 13 19 19 20 19 17 if(lane_id >= i) value += n; n = __shfl_up(value, 4) } value += n 3 11 13 19 21 31 32 36 // Write out our result data[id] = value;}
    20. 20. ATOM instruction enhancementsAdded int64 functions to 2 – 10x performance gainsmatch existing int32 Shorter processing pipeline More atomic processorsAtom Op int32 int64 Slowest 10x fasteradd x x Fastest 2x fastercas x xexch x xmin/max x Xand/or/xor x X
    21. 21. High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loops Example: Data reduction (sum of all values) Without Atomics 1. Divide input data array into N sections 2. Launch N blocks, each reduces one section 3. Output is N values 4. Second launch of N threads, reduces outputs to single value
    22. 22. High Speed Atomics Enable New UsesAtomics are now fast enough to use within inner loops Example: Data reduction (sum of all values) With Atomics 1. Divide input data array into N sections 2. Launch N blocks, each reduces one section 3. Write output directly via atomic. No need for second kernel launch.
    23. 23. TexturesUsing textures in CUDA 4.0 Global Memory ptr1. Bind texture to memory region width2. Launch kernel height3. Use tex1D / tex2D to access memory from kernel cudaBindTexture2D(ptr, width, height) (0,0) Texture (x,y) int value = tex2D(texture, x, y)
    24. 24. Texture Pros & Cons Good Stuff Bad Stuff Explicit global binding Dedicated cache Limited number of global textures Separate memory pipe No dynamic texture indexing Relaxed coalescing No arrays of texture references Samplers & filters Different read/write instructions Separate memory region (uses offsets not pointers)
    25. 25. Bindless TexturesKepler permits dynamic binding of Bad Stuff textures: Explicit global binding Textures now referenced by ID Limited number of global textures Create new ID when needed, destroy when needed No dynamic texture indexing Can pass IDs as parameters No arrays of texture references Dynamic texture indexing Different read/write instructions Arrays of texture IDs supported Separate memory region (uses 1000s of IDs possible offsets not pointers)
    26. 26. Global Load Through TextureLoad from direct address, through Bad Stuff texture pipeline: Explicit global binding Eliminates need for texture setup Limited number of global textures Access entire memory space through texture No dynamic texture indexing Use normal pointers to read via texture No arrays of texture references Emitted automatically by compiler where possible Different read/write instructions Can hint to compiler with "const __restrict" Separate memory region (uses offsets not pointers)
    27. 27. const __restrict Example Annotate eligible kernel __global__ void saxpy(float x, float y, const float * __restrict input, parameters with float * output) { const __restrict size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); Compiler will automatically // Compiler will automatically use texture map loads to use read-only // for "input" output[offset] = (input[offset] * x) + y; data cache path }
    28. 28. Thank you
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×