Gpu archi

CUDA Programming model Review
Parallel kernels composed of many threads
Threads execute the same sequential program Thread
Use parallel threads rather than sequential loops
Threads grouped in Cooperative Thread Arrays
Threads in same CTA cooperate & share memory
CTA implements a CUDA thread block CTA / Block
CTAs are grouped into grids t0 t1 … tB
Threads and blocks have unique IDs : threadIdx, blockIdx
Blocks and Grids have dimensions : blockDim, gridDim
A warp in CUDA is a group of 32 threads, which is the minimum
size of the data processed in SIMD fashion by a CUDA multiprocessor.

Grid

CTA 0 CTA 1 CTA 2 CTA m
...

GPU Architecture:
Two Main Components
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 6 GB per GPU
Bandwidth currently up to ~180 GB/s (Tesla

DRAM I/F

DRAM I/F
products)
ECC on/off (Quadro and Tesla products)

HOST I/F

DRAM I/F
Streaming Multiprocessors (SMs) L2

Perform the actual computations

DRAM I/F
Thread
Giga
Each SM has its own:

DRAM I/F

DRAM I/F
Control units, registers, execution pipelines, caches

GPU Architecture – Fermi: Instruction Cache

Scheduler Scheduler

Streaming Multiprocessor (SM) Dispatch Dispatch

Register File

 32 CUDA Cores per SM Core Core Core Core

 32 fp32 ops/clock
Core Core Core Core

Core Core Core Core
 16 fp64 ops/clock
Core Core Core Core

 32 int32 ops/clock
Core Core Core Core

 2 warp schedulers Core Core Core Core

 Up to 1536 threads Core Core Core Core

concurrently Core Core Core Core

 4 special-function units Load/Store Units x 16
Special Func Units x 4

 64KB shared mem + L1 cache Interconnect Network

 32K 32-bit registers
64K Configurable
Cache/Shared Mem

Uniform Cache

GPU Architecture – Fermi: Instruction Cache

Scheduler Scheduler

CUDA Core Dispatch Dispatch

Register File

 Floating point & Integer unit Core Core Core Core

 IEEE 754-2008 floating-point
Core Core Core Core

standard
Core Core Core Core

 Fused multiply-add (FMA)
Core Core Core Core
CUDA Core
instruction for both single and Dispatch Port
Core Core Core Core

double precision Operand Collector Core Core Core Core

 Logic unit FP Unit INT Unit
Core Core Core Core

 Move, compare unit
Core Core Core Core

Load/Store Units x 16

 Branch unit Result Queue Special Func Units x 4
Interconnect Network

64K Configurable
Cache/Shared Mem

Uniform Cache

CUDA Execution Model
Blocks run on multiprocessor(SM) Kernel launched by host
Entire Block gets scheduled on a single
SM .
Multiple blocks can reside on an SM at .
the same time .
Limit is 8 blocks/SM on Fermi
Limit is 16 blocks/SM on Kepler
Device processor array
MT IU MT IU

SP SP

MT
IU
SP
MTIU
SP
... MT
IU
SP
MTIU
SP
MT
IU
SP
MTIU
SP
Sh Sh Sh
Share Share Share
Shared Shared are are are
d d d
d d d
Memory Memory Mem Mem Mem
Me Me Me
ory ory ory
mo mo mo
ry ry ry

Device Memory

Hardware Multithreading

Hardware allocates resources to blocks
blocks need: thread slots, registers, shared memory
blocks don’t run until resources are available for all of it’s threads.

Hardware schedules threads in units of warps
threads have their own registers
context switching is (basically) free – every cycle
Hardware picks from warps that have an instruction ready(i.e. all operands
ready) to execute.

Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance

SM schedules warps & issues instructions
Dual issue pipelines select two warps to issue
SIMT warp executes one instruction for up to 32 threads

Warp Scheduler Warp Scheduler

Instruction Dispatch Unit Instruction Dispatch Unit

Warp 8 instruction 11 Warp 9 instruction 11


time




Welcome the Kepler GK110 GPU

Performance

Efficiency

Programmability

Kepler GK110 Block Diagram

Architecture
7.1B Transistors
15 SMX units
> 1 TFLOP FP64
1.5 MB L2 Cache
384-bit GDDR5
PCI Express Gen3

SMX: Efficient Performance

Power-Aware SMX Architecture

Clocks & Feature Size

SMX result -
Performance up
Power down

Power vs Clock Speed Example
Logic Clocking
Area Power Area Power

Fermi
A B 1.0x 1.0x 1.0x 1.0x
2x clock

A B
Kepler
1.8x 0.9x 1.0x 0.5x
1x clock
A B

Kepler
Fermi Kepler
SM
InstructionCache
InstructionCache
Scheduler Scheduler
CUDA Core
WarpScheduler WarpScheduler WarpScheduler WarpS
Dispatch Dispatch Disp Dispa
atch tchPo
Port rt
DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit DispatchUnit Dispatch
Register File ALU
Ope Coll
rand ecto
Result Queue r

Core Core Core Core

Core Core Core Core RegisterFile(65,536x32-bit)

Core Core Core Core C C C C C C L
S
C C C C C
D
o o o o o o o o o o o
/ F
r r r r r r S r r r r r
U
e e e e e e T e e e e e
Core Core Core Core
C C C C C C L C C C C C
D S
/ F
Core Core Core Core e e e e e e T
U
e e e e e

D S
Core Core Core Core o o o o o o
/ F
o o o o o
U

Core Core Core Core C C C C C C L C C C C C
D S
/ F
U
Core Core Core Core
D S
Load/Store Units x 16 r r r r r r
/ F
r r r r r
S U
Special Func Units x 4
InterconnectNetwork D S
/ F
U
64K Configurable
Cache/Shared Mem C C C C C C L
S
C C C C C
D
/ F
U
Uniform Cache
D S
/ F
U

D S
/ F
U

D S
/ F
U

SMX Balance of Resources

Resource Kepler GK110 vs Fermi
Floating point throughput 2-3x
Max Blocks per SMX 2x

Max Threads per SMX 1.3x
Register File Bandwidth 2x

Register File Capacity 2x
Shared Memory Bandwidth 2x

Shared Memory Capacity 1x

New ISA Encoding: 255 Registers per
Thread
Fermi limit: 63 registers per thread
A common Fermi performance limiter
Leads to excessive spilling

Kepler : Up to 255 registers per thread
Especially helpful for FP64 apps

New High-Performance SMX Instructions

Compiler-generated,
SHFL (shuffle) -- Intra-warp data exchange high performance
instructions:

 bit shift
 bit rotate
ATOM -- Broader functionality, Faster  fp32 division
 read-only cache

New Instruction: SHFL

Data exchange between threads within a warp
Avoids use of shared memory
One 32-bit value per exchange
4 variants:
a b c d e f g h

__shfl() __shfl_up() __shfl_down() __shfl_xor()

h d f e a c c b g h a b c d e f c d e f g h a b c d a b g h e f

Indexed Shift right to nth Shift left to nth neighbour Butterfly (XOR)
any-to-any neighbour exchange

SHFL Example: Warp Prefix-Sum

__global__ void shfl_prefix_sum(int *data)
{ 3 8 2 6 3 9 1 4
int id = threadIdx.x;
int value = data[id]; n = __shfl_up(value, 1)
int lane_id = threadIdx.x & warpSize;
value += n 3 11 10 8 9 12 10 5
// Now accumulate in log2(32) steps n = __shfl_up(value, 2)
for(int i=1; i<=width; i*=2) {
int n = __shfl_up(value, i); value += n 3 11 13 19 19 20 19 17
if(lane_id >= i)
value += n; n = __shfl_up(value, 4)

}
value += n 3 11 13 19 21 31 32 36
// Write out our result
data[id] = value;
}

ATOM instruction enhancements

Added int64 functions to 2 – 10x performance gains
match existing int32 Shorter processing pipeline
More atomic processors
Atom Op int32 int64 Slowest 10x faster
add x x Fastest 2x faster
cas x x
exch x x
min/max x X
and/or/xor x X

High Speed Atomics Enable New Uses

Atomics are now fast enough to use within inner loops
Example: Data reduction (sum of all values)

Without Atomics
1. Divide input data array into N sections

2. Launch N blocks, each reduces one
section

3. Output is N values

4. Second launch of N threads, reduces
outputs to single value

High Speed Atomics Enable New Uses

Atomics are now fast enough to use within inner loops
Example: Data reduction (sum of all values)

With Atomics
1. Divide input data array into N sections

2. Launch N blocks, each reduces one
section

3. Write output directly via atomic.
No need for second kernel launch.

Textures

Using textures in CUDA 4.0 Global Memory

ptr
1. Bind texture to memory region width

2. Launch kernel

height
3. Use tex1D / tex2D to access memory
from kernel cudaBindTexture2D(ptr, width, height)
(0,0)
Texture
(x,y)

int value = tex2D(texture, x, y)

Texture Pros & Cons
Good Stuff Bad Stuff

Explicit global binding
Dedicated cache
Limited number of global textures
Separate memory pipe
No dynamic texture indexing
Relaxed coalescing
No arrays of texture references
Samplers & filters
Different read/write instructions

Separate memory region (uses
offsets not pointers)

Bindless Textures
Kepler permits dynamic binding of Bad Stuff
textures:
Textures now referenced by ID
Create new ID when needed, destroy when
needed No dynamic texture indexing
Can pass IDs as parameters
Dynamic texture indexing
Different read/write instructions
Arrays of texture IDs supported
1000s of IDs possible offsets not pointers)

Global Load Through Texture
Load from direct address, through Bad Stuff
texture pipeline:
Eliminates need for texture setup
Access entire memory space through
texture No dynamic texture indexing
Use normal pointers to read via texture
Emitted automatically by compiler where
possible Different read/write instructions
Can hint to compiler with "const __restrict"

offsets not pointers)

const __restrict Example

Annotate eligible kernel __global__ void saxpy(float x, float y,
const float * __restrict input,
parameters with float * output)
{
const __restrict size_t offset = threadIdx.x +
(blockIdx.x * blockDim.x);

Compiler will automatically // Compiler will automatically use texture
map loads to use read-only // for "input"
output[offset] = (input[offset] * x) + y;
data cache path }

Gpu archi

More Related Content

What's hot

Similar to Gpu archi

More from Piyush Mittal

Gpu archi

Editor's Notes