CS 354 GPU Architecture

CS 354
GPU Architecture
Mark Kilgard
University of Texas
March 6, 2012

CS 354 2

Today’s material
 In-class quiz
 Lecture topic
 Architecture of Graphics Processing Units (GPUs)
 Course work
 Homework #4 due today
 Review textbook reading
 Chapter 5, 6, and 7
 Project #2 on texturing, shading, & lighting is coming
 Remember: Midterm in-class on March 8

CS 354 3

My Office Hours
 Tuesday, before class
 Painter (PAI) 5.35
 8:45 a.m. to 9:15
 Thursday, after class
 ACE 6.302
 11:00 a.m. to 12

 Randy’s office hours
 Monday & Wednesday
 11 a.m. to 12:00
 Painter (PAI) 5.33

CS 354 4

Last time, this time
 Last lecture, we discussed
 Programmable shading
 Graphics hardware shading languages
 This lecture
 How do GPUs work?

CS 354 5

On a sheet of paper
Daily Quiz • Write your EID, name, and date
• Write #1, #2, #3, #4 followed by its answer

 Pick the best choice: Shade  Multiple choice: The GLSL standard
has built-in data types for
trees are a) vectors
a) fractal trees with shadows b) matrices
b) OpenGL commands c) texture samplers
c) hierarchical arrangements of d) floating-point values
e) pointers to malloc’ed memory
shading computations f) a through e
d) fractal patterns of all sorts g) a through d

 Name one general purpose
programming language that GLSL
borrows from.

CS 354 6

Key Trend in OpenGL Evolution
Complex
Configurability
Simple Shaders!
Configurability
High-level languages

Fixed-function Programmable

 Direct3D follows the same trend
 Also reflects trend in GPU architecture
 API and hardware co-evolving

CS 354 7

Programming Shaders inside GPU
 Multiple programmable domains within the GPU
3D Application
or Game  Can be programmed in high-level languages
 Cg, HLSL, or OpenGL Shading Language (GLSL)
OpenGL API
CPU – GPU
Boundary
GPU Vertex Primitive Clipping, Setup, Raster
Front End Assembly Assembly and Rasterization Operations

Vertex Geometry Fragment
Shader Program Shader

Attribute Fetch

Legend
Parameter Buffer Read Texture Fetch Framebuffer Access
programmable

fixed-function
Memory Interface
OpenGL 3.3

CS 354 8

Complex OpenGL Data Flow

CS 354 9

Six Years of GPU Architecture
OpenGL Direct3D
Product New Features Version Version

Hardware transform & lighting, configurable
2000 GeForce 256 fixed-point shading, cube maps, texture 1.3 DX7
compression, anisotropic texture filtering

Programmable vertex transformation, 4
texture units, dependent textures, 3D
2001 GeForce3
textures, shadow maps, multisampling,
1.4 DX8
occlusion queries

2002 GeForce4 Ti 4600 Early Z culling, dual-monitor 1.4 DX8.1
Vertex program branching, floating-point
fragment programs, 16 texture units, limited
2003 GeForce FX
floating-point textures, color and depth
1.5 DX9
compression

Vertex textures, structured fragment
branching, non-power-of-two textures,
2004 GeForce 6800 Ultra
generalized floating-point textures, floating-
2.0 DX9c
point texture filtering and blending

2005 GeForce 7800 GTX Transparency antialiasing 2.0 DX9c

CS 354 10

GeForce Peak
Vertex Processing Trends
rate for trivial 4x4 exceeds peak
1,400

vertex transform setup rates—allows
Millions of vertices per second

1,200
excess vertex
processing
1,000

800

600

400

200

0
GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce
GTS 4600 6800 Ultra 7800 GTX

Vertex units 1 1 2 3 6 8

CS 354 11

GeForce Peak
Memory Bandwidth Trends
200
128-bit interface 256-bit interface
180

Raw
160 bandwidth
Gigabytes per second

140

Effective raw
bandwidth
120
with
compression
100
Expon.
(Effective raw
bandwidth
80
with
compression)
60
Expon. (Raw
bandwidth)

40

20

0
GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce
GT S 4600 6800 Ultra 7800 GT X

CS 354 12

Effective GPU
Memory Bandwidth
 Compression schemes
 Lossless depth and color (when multisampling)
compression
 Lossy texture compression (S3TC / DXTC)
 Typically assumes 4:1 compression
 Avoidance useless work
 Early killing of fragments (Z cull)
 Avoiding useless blending and texture fetches
 Very clever memory controller designs
 Combining memory accesses for improved coherency
 Caches for texture fetches

CS 354 13

GeForce Core and Memory
Clock Rates
1,400
DDR memory
1,200
transition—
memory rates
1,000
double physical
clock rate
Megahertz (Mhz)

800 Core
clock
600 Memory
clock
400

200

0

X
a
0
S

ltr

T
X
60
2

3
X

T
T

G
F

U
ce
Z

G

i4
N

ce

0
0
or
a

T

2

T

80
80
iv

ce

or
eF
a

4
R

7
iv

6
eF
e
or

G
R

ce
c

ce
eF

G
or

or
or
eF
G

eF
eF
G

G
G

CS 354 14

GeForce Peak
Triangle Setup Trends
300
assumes 50%
face culling
Millions of triangles per second

250

200

150

100

50

0

GTS 4600 6800 Ultra 7800 GTX

CS 354 15

GeForce Peak
Texture Fetch Trends
12,000
assuming no texture
cache misses
10,000
Millions of texture fetches

8,000
per second

6,000

4,000

2,000

0
GTS 4600 6800 Ultra 7800 GTX

Texture units 2×4 2×4 2×4 2×4 16 24

CS 354 16

GeForce Peak
Depth/Stencil-only Fill
18,000
assuming no double speed
Millions of depth/stencil pixel updates

16,000 read-modify-write depth-stencil
only
14,000

12,000
per second

10,000

8,000

6,000

4,000

2,000

0
GTS 4600 6800 Ultra 7800 GTX

Raster Op units 4 4 4 4+4 16+16 16+16

CS 354 17

GeForce Transistor Count and
Semiconductor Process
450

400
Millions of transistors

350

300

250

200

150

100

50

0

Riva ZX Riva GeForce2 GeForce3 GeForce4 GeForce GeForce GeForce
TNT2 GTS Ti 4600 FX 6800 7800 GTX
Ultra

Process (µm) 0.35 0.22 0.18 0.18 0.15 0.13 0.13 0.11

CS 354 18
Hardware GeForce GeForce GeForce
Unit FX 5900 6800 Ultra 7800 GTX
Vertex
3 6 8

4+4 16 24
Fragment

2nd Texture
Fetch

4+4 16+16 16+16
Raster Color

Raster Depth

CS 354 19

GeForce 7800 GTX
Board Details
SLI Connector Single slot cooling

sVideo
TV Out

DVI x 2

256MB/256-bit DDR3
600 MHz
16x PCI-Express 8 pieces of 8Mx32

CS 354 20

GeForce 7800 GTX
GPU Details
302 million transistors
430 MHz core clock
256-bit memory interface

Notable Functionality
• Non-power-of-two textures with mipmaps
• Floating-point (fp16) blending and filtering
• sRGB color space texture filtering and
frame buffer blending
• Vertex textures
• 16x anisotropic texture filtering
• Dynamic vertex and fragment branching
• Double-rate depth/stencil-only rendering
• Early depth/stencil culling
• Transparency antialiasing

CS 354 21

GeForce 7800 GTX
Parallelism
8 Vertex Engines

Z-Cull Triangle Setup/Raster

Shader Instruction Dispatch 24 Fragment Shaders

Fragment Crossbar 16 Raster Operation Pipelines

Memory Memory Memory Memory
Partition Partition Partition Partition

CS 354 22

GeForce Graphics Pipeline

Separate dedicated units

Vertex Fragment Raster Frame
CPU Engine Setup Raster Shader Ops Buffer

Z Cull Texture

CS 354 23

Vertex Engine
Vertex pulling
Vector floating-point instructions
Dynamic branching
Vertex texture
Vertex stream frequency


Z Cull Texture

CS 354 24

Setup
Prepare triangle for
rasterization
215M triangles/sec setup


Z Cull Texture

CS 354 25

Raster
Compute coverage
Points, lines, and triangles
Rotated grid multisampling


Z Cull Texture

CS 354 26

Z Cull

Discard fragments early based on Z
Up to 64 pixels/clock
Multisampled: 256 samples/clock


Z Cull Texture

CS 354 27

Fragment Shader
User-programmed fragment coloring
Dynamic branching
Long shaders
Multiple render targets
fp16 and fp32 vectors


Z Cull Texture

CS 354 28

Texture
fp16 and sRGB filtering
16x anisotropic filtering
Non-power-of-two mipmapping
Shadow maps, cube maps, and 3D
Floating-point textures


Z Cull Texture

CS 354 29

Texture
2x and 4x multisampling
fp16 and sRGB blending
Multiple render targets
Color and depth compression
Double-speed depth/stencil only


Z Cull Texture

CS 354 30

Single GeForce 7800
Vertex Unit
Primitive Assembly + Vertex Processing Engine
Attribute Processing • MIMD Architecture
• Dual Issue
• Low-penalty branching
• Shader Model 3.0
• 32 vector registers
Vertex FP32 FP32 • 512 static instructions per
Texture Scalar Vector
Fetch Unit Unit
program
• Indexed input and output
registers

Texture Branch
Vertex Texture Fetch
Cache Unit
• Non-stalling
• Up to 4 texture units
Viewport Processing • Unlimited fetches
• Mipmapping, no filtering

To Setup

CS 354 31

Vertex Texturing Example

Vertex
Program

Flat tessellated mesh Displaced mesh
Height field
texture

CS 354 32

Vertex Textures for Dynamic
Displacement Mapping

Without Vertex Textures With Vertex Textures

Images used with permission from Pacific Fighters. © 2004 Developed by 1C:Maddox Games.
All rights reserved. © 2004 Ubi Soft Entertainment.

CS 354 33

Vertex Textures to Drive
Particle Systems
 Render-to-texture
 Simulation runs
in floating-point
frame buffer, also
usable as texture
 Vertex textures
 Determines particle
location with
vertex texture
fetch

CS 354 34

Single GeForce 7800
Fragment Shader Pipeline
Texture Input Fragment Texture Processor
Data Data
16 texture units
1 texture fetch at full speed
Bilinear or tri-linear filtering
FP32 16x anisotropic filtering
Texture
Shader
Processor Floating-point (fp16) texture filtering
Unit 1
Shader Unit 1
FP32 4 MULs + RCP
Texture Dual Issue
Shader
Cache Unit 2 Texture address calculation
Fast fp16 normalize
Branch Free: negate, abs, condition codes
Processor
Shader Unit 2
Output 4 MADs or DP4
Fixed-function
Shaded Dual Issue
Fog Unit
Fragments
Free: negate, abs, condition codes

CS 354 35

Operations Per Fragment
Shader Pass

Shader 4 Components 1 Texture /
Unit 1 1 Op / component
fragment at full
4 ops / fragment or
per pass speed per pass
Texture

Shader 4 Components
1 Op / component
Unit 2 4 Ops / fragment
per pass

8 Operations / fragment per pass

CS 354 36

Fragment Shader
Component Co-issue
 Use 4 components various ways
 RGBA all together
 RGB and A
 RG and GB
Shader
 Both shader units Unit 1 R G B A

 Two operations Operation 1 Operation 2

per shader unit
Shader
Unit 2 R G B A

Operation 3 Operation 4

CS 354 37

Single GeForce 7800
Raster Operations Pipeline
Input
Shaded Pixel Crossbar
Fragment Interconnect Functionality
Data
• OpenEXR
Multisample Antialiasing floating-point
blending
• sRGB
Depth Color blending
Compression Compression • 4x rotated grid
multisampling
Depth Color • Lossless color
Raster Raster and depth
Operations Operations compression
• Multiple
render targets
Memory Frame Buffer Partition

CS 354 38

GeForce 7800
Transparency Antialiasing

Conventional 4x antialiasing Transparency antialiasing
with alpha tested context with alpha tested context

CS 354 39

Scalable Link Interface (SLI)

 Gang two GeForce 6600, 6800, or 7800
graphics boards together
 Can almost double your performance

SLI
Connector

Two 6800 Ultras
pictured

CS 354 40

SLI Rendering Modes
 Split Frame Rendering (SFR)
 One GPU renders top of screen; other renders the bottom
 Scales fragment processing but not vertex processing
 Alternate Frame Rendering (AFR)
 Scales both vertex and fragment processing
 Adds frame latency
 Rendering must be free of CPU synchronization
 SLI Antialiasing: SLI8x and SLI16x
 Better antialiasing quality rather than performance
 Each card renders with slightly different sub-pixel offset

CS 354 41

PC Graphics Hardware Evolution
Viable economics: 650 million GeForce GPUs since 1999
1,000x complexity since 1995
Moore’s Law at work GeForce
580 GTX
3B transistors
GeForce
8800
681M
GeForce FX transistors
GeForce 256 GeForce 3
® 125M
23M 60M transistors
RIVA 128
transistors transistors
3M
transistors

1997 2000 2005 2010

CS 354 42

Current High-end “Fermi” GPU
 Current high-end graphics card
 512 graphics “cores”
 1.5Gb memory
 System power: 600W
 OpenGL 4.2 / DirectX 11
functionality

CS 354 43

High-level “Fermi” Architecture
 GF100
 Four Graphics
Processor
Clusters (GPCs)
 Each is self-
contained
graphics
pipeline
 Smaller chips
have fewer
GPcs
 Shared L2 cache
 6 Memory
Controllers
 1.5 Gb

CS 354 44

Inside Each
Graphics Processing Cluster
 Raster engine
 Four SMs
 Streaming
Multiprocessor
 Texture fetch
resources
 Tessellation and
vertex
processing
resources
 Polymorph
Engine

CS 354 45

Streaming
Multiprocessor (SM)
 Multi-processor
execution unit
 32 scalar processor
cores
 Warp is a unit of
thread execution of up
to 32 threads
 Two workloads
 Graphics
 Vertex shader
 Tessellation
 Geometry shader
 Fragment shader
 Compute

CS 354 46

OpenGL Pipeline Programmable
Domains run on Unified Hardware
 Unified Streaming Processor Array (SPA) architecture
means same capabilities for all domains
 Plus tessellation + compute (not shown below)

,
GPU Vertex Primitive Clipping, Setup,
Raster
Front End Assembly Assembly and Rasterization Operations

Can be Vertex Primitive Fragment
unified Program Program Program

hardware!
Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access

Memory Interface

CS 354 47

Dual Warp Scheduling

32 threads launch!

CS 354 48

Shader or CUDA Core,
Same Unit but Two Personalities
 Execution unit
 Scalar floating-point
 Scalar integer

CS 354 49

Levels of Caching in Fermi GPU
 12 KB L1 Texture cache
 Per texture unit
 SM 64 K cache
 Split into dedicated 16K or 48K
Load/Store cache
 Shared memory 48K or 16K

 L2 unifies texture cache, raster
operation cache, and internal
buffering in prior generation
 768 K
 Read / write
 Fully coherent

CS 354 50

Cache Use Strategies
in Fermi GPU
 Pipeline stages can communicate efficiently through
GPU’s L1 and L2 caches
 Buffering between stages stays all on chip
 Only vertex, texel, and pixel read/writes need to go to DRAM

CS 354 51

Vertex and Tessellation
Processing Tasks
 Fixed-function graphics engines
 Pull attributes and assemble vertex
 Manage tessellation control and domain shader evaluation
 Viewport transform
 Attribute setup of plane equations for rasterization
 Stream out vertices into buffers

CS 354 52

Rasterization Tasks
 Turns primitives into fragments
 Computes edge equations
 Two-stage rasterization
 Coarse raster finds tiles the primitive could be in
 Fine raster evaluates sample positions within tiles
 Zcull efficiently eliminates occluded fragments

CS 354 54

Apply Phong Tessellation


CS 354 55

Add Displacement Mapping
Apply Displacement Mapping


CS 354 56

GPUs as Compute Nodes
 Architecture of GPU has evolved into a high-
performance, high-bandwidth compute node

Small form factor
Compute

Integrated CPU-GPU OEM CPU Server + Workstations
Servers & Blades Compute 1U 2 to 4 Tesla
GPUs

CS 354 57

Compute Programming Model
 Cooperative Thread Array (CTA)
 Single Program, Multiple Data
 Organized in 1D, 2D, or 3D
 Programming APIs
 CUDA, OpenCL, DirectCompute
 APIs + language = parallel processing system
 OpenGL or Direct3D through shaders
 Cg, HLSL, GLSL

CS 354 58

Now in World’s Fastest
Supercomputers
Tianhe-1A

2.507 Petaflop

7168 Tesla M2050
GPUs

National Supercomputing Center
in Tianjin

CS 354 59

Opposite direction:
Consumer mobile devices

CS 354 60

Low-power Mobile
System on a Chip (SoC)
Complete system on a chip
 4 ARM cores
 Integrated graphics
 OpenGL ES 2.0
 Power <1W

CS 354 61

Mid-term Next Class
 Mid-term
 Similar in format to the homeworks
 15% of your final grade
 Arrive on time
 Open textbook. Open notes, including lecture slides.
 Calculators allowed/encouraged.
 No smart phones, no computers, no Internet access.
 Show your work to justify your answer and provide a basis for partial
credit.
 What to study
 All material in lecture slides
 Review in-class quiz questions
 Study homeworks
 Responsible for textbook readings
 Have a relaxing spring break
 Next lecture: Shadows
 Come back to Project 2

CS 354 GPU Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to CS 354 GPU Architecture

Similar to CS 354 GPU Architecture (20)

More from Mark Kilgard

More from Mark Kilgard (20)

Recently uploaded

Recently uploaded (20)

CS 354 GPU Architecture

Editor's Notes