• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
CS 354 GPU Architecture
 

CS 354 GPU Architecture

on

  • 1,605 views

CS 354 Computer Graphics; GPU Architecture; March 6, 2012

CS 354 Computer Graphics; GPU Architecture; March 6, 2012

Statistics

Views

Total Views
1,605
Views on SlideShare
1,605
Embed Views
0

Actions

Likes
3
Downloads
130
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The technology of graphics processors has evolved amazingly over the last 15 years or so. I’ve been at NVIDIA for more than 10 years and have seen a lot of this first hand. As the hardware increases in performance, the visual quality improves. This is driven by Moore’s law, which says that the number of transistors able to fit on a piece of silicon doubles roughly every 18 months. The great thing about graphics is that has an insatiable appetite for computation. We’re clearly not at photo-realistic quality yet and still have a long way to.
  • World’s Fastest Known Supercomputer today – official Top500 list comes out next month Peta = 10^15 = thousand trillion floating point operations per second

CS 354 GPU Architecture CS 354 GPU Architecture Presentation Transcript

  • CS 354GPU ArchitectureMark KilgardUniversity of TexasMarch 6, 2012
  • CS 354 2 Today’s material  In-class quiz  Lecture topic  Architecture of Graphics Processing Units (GPUs)  Course work  Homework #4 due today  Review textbook reading  Chapter 5, 6, and 7  Project #2 on texturing, shading, & lighting is coming  Remember: Midterm in-class on March 8
  • CS 354 3 My Office Hours  Tuesday, before class  Painter (PAI) 5.35  8:45 a.m. to 9:15  Thursday, after class  ACE 6.302  11:00 a.m. to 12  Randy’s office hours  Monday & Wednesday  11 a.m. to 12:00  Painter (PAI) 5.33
  • CS 354 4 Last time, this time  Last lecture, we discussed  Programmable shading  Graphics hardware shading languages  This lecture  How do GPUs work?
  • CS 354 5 On a sheet of paper Daily Quiz • Write your EID, name, and date • Write #1, #2, #3, #4 followed by its answer  Pick the best choice: Shade  Multiple choice: The GLSL standard has built-in data types for trees are a) vectors a) fractal trees with shadows b) matrices b) OpenGL commands c) texture samplers c) hierarchical arrangements of d) floating-point values e) pointers to malloc’ed memory shading computations f) a through e d) fractal patterns of all sorts g) a through d  Name one general purpose programming language that GLSL borrows from.
  • CS 354 6 Key Trend in OpenGL Evolution Complex Configurability Simple Shaders! Configurability High-level languages Fixed-function Programmable  Direct3D follows the same trend  Also reflects trend in GPU architecture  API and hardware co-evolving
  • CS 354 7 Programming Shaders inside GPU  Multiple programmable domains within the GPU 3D Application or Game  Can be programmed in high-level languages  Cg, HLSL, or OpenGL Shading Language (GLSL) OpenGL API CPU – GPU Boundary GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Vertex Geometry Fragment Shader Program Shader Attribute FetchLegend Parameter Buffer Read Texture Fetch Framebuffer Access programmable fixed-function Memory Interface OpenGL 3.3
  • CS 354 8 Complex OpenGL Data Flow
  • CS 354 9 Six Years of GPU Architecture OpenGL Direct3D Product New Features Version Version Hardware transform & lighting, configurable2000 GeForce 256 fixed-point shading, cube maps, texture 1.3 DX7 compression, anisotropic texture filtering Programmable vertex transformation, 4 texture units, dependent textures, 3D2001 GeForce3 textures, shadow maps, multisampling, 1.4 DX8 occlusion queries2002 GeForce4 Ti 4600 Early Z culling, dual-monitor 1.4 DX8.1 Vertex program branching, floating-point fragment programs, 16 texture units, limited2003 GeForce FX floating-point textures, color and depth 1.5 DX9 compression Vertex textures, structured fragment branching, non-power-of-two textures,2004 GeForce 6800 Ultra generalized floating-point textures, floating- 2.0 DX9c point texture filtering and blending2005 GeForce 7800 GTX Transparency antialiasing 2.0 DX9c
  • CS 354 10 GeForce Peak Vertex Processing Trends rate for trivial 4x4 exceeds peak 1,400 vertex transform setup rates—allows Millions of vertices per second 1,200 excess vertex processing 1,000 800 600 400 200 0 GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce GTS 4600 6800 Ultra 7800 GTX Vertex units 1 1 2 3 6 8
  • CS 354 11 GeForce Peak Memory Bandwidth Trends 200 128-bit interface 256-bit interface 180 Raw 160 bandwidth Gigabytes per second 140 Effective raw bandwidth 120 with compression 100 Expon. (Effective raw bandwidth 80 with compression) 60 Expon. (Raw bandwidth) 40 20 0 GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce GT S 4600 6800 Ultra 7800 GT X
  • CS 354 12 Effective GPU Memory Bandwidth  Compression schemes  Lossless depth and color (when multisampling) compression  Lossy texture compression (S3TC / DXTC)  Typically assumes 4:1 compression  Avoidance useless work  Early killing of fragments (Z cull)  Avoiding useless blending and texture fetches  Very clever memory controller designs  Combining memory accesses for improved coherency  Caches for texture fetches
  • CS 354 13 GeForce Core and Memory Clock Rates 1,400 DDR memory 1,200 transition— memory rates 1,000 double physical clock rate Megahertz (Mhz) 800 Core clock 600 Memory clock 400 200 0 X a 0 S ltr T X 60 2 3 X T T G F U ce Z G i4 N ce 0 0 or a T 2 T 80 80 iv ce or eF a 4 R 7 iv 6 eF e or G R ce c ce eF G or or or eF G eF eF G G G
  • CS 354 14 GeForce Peak Triangle Setup Trends 300 assumes 50% face culling Millions of triangles per second 250 200 150 100 50 0 GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce GTS 4600 6800 Ultra 7800 GTX
  • CS 354 15 GeForce Peak Texture Fetch Trends 12,000 assuming no texture cache misses 10,000 Millions of texture fetches 8,000 per second 6,000 4,000 2,000 0 GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce GTS 4600 6800 Ultra 7800 GTX Texture units 2×4 2×4 2×4 2×4 16 24
  • CS 354 16 GeForce Peak Depth/Stencil-only Fill 18,000 assuming no double speed Millions of depth/stencil pixel updates 16,000 read-modify-write depth-stencil only 14,000 12,000 per second 10,000 8,000 6,000 4,000 2,000 0 GeForce2 GeForce3 GeForce4 Ti GeForce FX GeForce GeForce GTS 4600 6800 Ultra 7800 GTX Raster Op units 4 4 4 4+4 16+16 16+16
  • CS 354 17 GeForce Transistor Count and Semiconductor Process 450 400 Millions of transistors 350 300 250 200 150 100 50 0 Riva ZX Riva GeForce2 GeForce3 GeForce4 GeForce GeForce GeForce TNT2 GTS Ti 4600 FX 6800 7800 GTX Ultra Process (µm) 0.35 0.22 0.18 0.18 0.15 0.13 0.13 0.11
  • CS 354 18 Hardware GeForce GeForce GeForce Unit FX 5900 6800 Ultra 7800 GTX Vertex 3 6 8 4+4 16 24 Fragment 2nd Texture Fetch 4+4 16+16 16+16 Raster Color Raster Depth
  • CS 354 19 GeForce 7800 GTX Board Details SLI Connector Single slot cooling sVideo TV Out DVI x 2 256MB/256-bit DDR3 600 MHz 16x PCI-Express 8 pieces of 8Mx32
  • CS 354 20 GeForce 7800 GTX GPU Details 302 million transistors 430 MHz core clock 256-bit memory interface Notable Functionality • Non-power-of-two textures with mipmaps • Floating-point (fp16) blending and filtering • sRGB color space texture filtering and frame buffer blending • Vertex textures • 16x anisotropic texture filtering • Dynamic vertex and fragment branching • Double-rate depth/stencil-only rendering • Early depth/stencil culling • Transparency antialiasing
  • CS 354 21 GeForce 7800 GTX Parallelism 8 Vertex Engines Z-Cull Triangle Setup/Raster Shader Instruction Dispatch 24 Fragment Shaders Fragment Crossbar 16 Raster Operation Pipelines Memory Memory Memory Memory Partition Partition Partition Partition
  • CS 354 22 GeForce Graphics Pipeline Separate dedicated units Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 23 GeForce Graphics Pipeline Vertex Engine Vertex pulling Vector floating-point instructions Dynamic branching Vertex texture Vertex stream frequency Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 24 GeForce Graphics Pipeline Setup Prepare triangle for rasterization 215M triangles/sec setup Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 25 GeForce Graphics Pipeline Raster Compute coverage Points, lines, and triangles Rotated grid multisampling Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 26 GeForce Graphics Pipeline Z Cull Discard fragments early based on Z Up to 64 pixels/clock Multisampled: 256 samples/clock Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 27 GeForce Graphics Pipeline Fragment Shader User-programmed fragment coloring Dynamic branching Long shaders Multiple render targets fp16 and fp32 vectors Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 28 GeForce Graphics Pipeline Texture fp16 and sRGB filtering 16x anisotropic filtering Non-power-of-two mipmapping Shadow maps, cube maps, and 3D Floating-point textures Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 29 GeForce Graphics Pipeline Texture 2x and 4x multisampling fp16 and sRGB blending Multiple render targets Color and depth compression Double-speed depth/stencil only Vertex Fragment Raster Frame CPU Engine Setup Raster Shader Ops Buffer Z Cull Texture
  • CS 354 30 Single GeForce 7800 Vertex Unit Primitive Assembly + Vertex Processing Engine Attribute Processing • MIMD Architecture • Dual Issue • Low-penalty branching • Shader Model 3.0 • 32 vector registers Vertex FP32 FP32 • 512 static instructions per Texture Scalar Vector Fetch Unit Unit program • Indexed input and output registers Texture Branch Vertex Texture Fetch Cache Unit • Non-stalling • Up to 4 texture units Viewport Processing • Unlimited fetches • Mipmapping, no filtering To Setup
  • CS 354 31 Vertex Texturing Example Vertex Program Flat tessellated mesh Displaced mesh Height field texture
  • CS 354 32 Vertex Textures for Dynamic Displacement Mapping Without Vertex Textures With Vertex Textures Images used with permission from Pacific Fighters. © 2004 Developed by 1C:Maddox Games. All rights reserved. © 2004 Ubi Soft Entertainment.
  • CS 354 33 Vertex Textures to Drive Particle Systems  Render-to-texture  Simulation runs in floating-point frame buffer, also usable as texture  Vertex textures  Determines particle location with vertex texture fetch
  • CS 354 34 Single GeForce 7800 Fragment Shader Pipeline Texture Input Fragment Texture Processor Data Data 16 texture units 1 texture fetch at full speed Bilinear or tri-linear filtering FP32 16x anisotropic filtering Texture Shader Processor Floating-point (fp16) texture filtering Unit 1 Shader Unit 1 FP32 4 MULs + RCP Texture Dual Issue Shader Cache Unit 2 Texture address calculation Fast fp16 normalize Branch Free: negate, abs, condition codes Processor Shader Unit 2 Output 4 MADs or DP4 Fixed-function Shaded Dual Issue Fog Unit Fragments Free: negate, abs, condition codes
  • CS 354 35 Operations Per Fragment Shader Pass Shader 4 Components 1 Texture / Unit 1 1 Op / component fragment at full 4 ops / fragment or per pass speed per pass Texture Shader 4 Components 1 Op / component Unit 2 4 Ops / fragment per pass 8 Operations / fragment per pass
  • CS 354 36 Fragment Shader Component Co-issue  Use 4 components various ways  RGBA all together  RGB and A  RG and GB Shader  Both shader units Unit 1 R G B A  Two operations Operation 1 Operation 2 per shader unit Shader Unit 2 R G B A Operation 3 Operation 4
  • CS 354 37 Single GeForce 7800 Raster Operations Pipeline Input Shaded Pixel Crossbar Fragment Interconnect Functionality Data • OpenEXR Multisample Antialiasing floating-point blending • sRGB Depth Color blending Compression Compression • 4x rotated grid multisampling Depth Color • Lossless color Raster Raster and depth Operations Operations compression • Multiple render targets Memory Frame Buffer Partition
  • CS 354 38 GeForce 7800 Transparency Antialiasing Conventional 4x antialiasing Transparency antialiasing with alpha tested context with alpha tested context
  • CS 354 39 Scalable Link Interface (SLI)  Gang two GeForce 6600, 6800, or 7800 graphics boards together  Can almost double your performance SLI Connector Two 6800 Ultras pictured
  • CS 354 40 SLI Rendering Modes  Split Frame Rendering (SFR)  One GPU renders top of screen; other renders the bottom  Scales fragment processing but not vertex processing  Alternate Frame Rendering (AFR)  Scales both vertex and fragment processing  Adds frame latency  Rendering must be free of CPU synchronization  SLI Antialiasing: SLI8x and SLI16x  Better antialiasing quality rather than performance  Each card renders with slightly different sub-pixel offset
  • CS 354 41 PC Graphics Hardware Evolution Viable economics: 650 million GeForce GPUs since 1999 1,000x complexity since 1995 Moore’s Law at work GeForce 580 GTX 3B transistors GeForce 8800 681M GeForce FX transistors GeForce 256 GeForce 3 ® 125M 23M 60M transistors RIVA 128 transistors transistors 3M transistors 1997 2000 2005 2010
  • CS 354 42 Current High-end “Fermi” GPU  Current high-end graphics card  512 graphics “cores”  1.5Gb memory  System power: 600W  OpenGL 4.2 / DirectX 11 functionality
  • CS 354 43 High-level “Fermi” Architecture  GF100  Four Graphics Processor Clusters (GPCs)  Each is self- contained graphics pipeline  Smaller chips have fewer GPcs  Shared L2 cache  6 Memory Controllers  1.5 Gb
  • CS 354 44 Inside Each Graphics Processing Cluster  Raster engine  Four SMs  Streaming Multiprocessor  Texture fetch resources  Tessellation and vertex processing resources  Polymorph Engine
  • CS 354 45 Streaming Multiprocessor (SM)  Multi-processor execution unit  32 scalar processor cores  Warp is a unit of thread execution of up to 32 threads  Two workloads  Graphics  Vertex shader  Tessellation  Geometry shader  Fragment shader  Compute
  • CS 354 46 OpenGL Pipeline Programmable Domains run on Unified Hardware  Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains  Plus tessellation + compute (not shown below) , GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Can be Vertex Primitive Fragment unified Program Program Program hardware! Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access Memory Interface
  • CS 354 47 Dual Warp Scheduling32 threads launch!
  • CS 354 48 Shader or CUDA Core, Same Unit but Two Personalities  Execution unit  Scalar floating-point  Scalar integer
  • CS 354 49 Levels of Caching in Fermi GPU  12 KB L1 Texture cache  Per texture unit  SM 64 K cache  Split into dedicated 16K or 48K Load/Store cache  Shared memory 48K or 16K  L2 unifies texture cache, raster operation cache, and internal buffering in prior generation  768 K  Read / write  Fully coherent
  • CS 354 50 Cache Use Strategies in Fermi GPU  Pipeline stages can communicate efficiently through GPU’s L1 and L2 caches  Buffering between stages stays all on chip  Only vertex, texel, and pixel read/writes need to go to DRAM
  • CS 354 51 Vertex and Tessellation Processing Tasks  Fixed-function graphics engines  Pull attributes and assemble vertex  Manage tessellation control and domain shader evaluation  Viewport transform  Attribute setup of plane equations for rasterization  Stream out vertices into buffers
  • CS 354 52 Rasterization Tasks  Turns primitives into fragments  Computes edge equations  Two-stage rasterization  Coarse raster finds tiles the primitive could be in  Fine raster evaluates sample positions within tiles  Zcull efficiently eliminates occluded fragments
  • CS 354 53 Base Input Input Mesh Mesh From Metro 2033, © THQ and 4A Games
  • CS 354 54 Apply Phong Tessellation From Metro 2033, © THQ and 4A Games
  • CS 354 55 Add Displacement Mapping Apply Displacement Mapping From Metro 2033, © THQ and 4A Games
  • CS 354 56 GPUs as Compute Nodes  Architecture of GPU has evolved into a high- performance, high-bandwidth compute node Small form factor Compute Integrated CPU-GPU OEM CPU Server + Workstations Servers & Blades Compute 1U 2 to 4 Tesla GPUs
  • CS 354 57 Compute Programming Model  Cooperative Thread Array (CTA)  Single Program, Multiple Data  Organized in 1D, 2D, or 3D  Programming APIs  CUDA, OpenCL, DirectCompute  APIs + language = parallel processing system  OpenGL or Direct3D through shaders  Cg, HLSL, GLSL
  • CS 354 58 Now in World’s Fastest Supercomputers Tianhe-1A 2.507 Petaflop 7168 Tesla M2050 GPUs National Supercomputing Center in Tianjin
  • CS 354 59 Opposite direction: Consumer mobile devices
  • CS 354 60 Low-power Mobile System on a Chip (SoC) Complete system on a chip  4 ARM cores  Integrated graphics  OpenGL ES 2.0  Power <1W
  • CS 354 61 Mid-term Next Class  Mid-term  Similar in format to the homeworks  15% of your final grade  Arrive on time  Open textbook. Open notes, including lecture slides.  Calculators allowed/encouraged.  No smart phones, no computers, no Internet access.  Show your work to justify your answer and provide a basis for partial credit.  What to study  All material in lecture slides  Review in-class quiz questions  Study homeworks  Responsible for textbook readings  Have a relaxing spring break  Next lecture: Shadows  Come back to Project 2