CS 354 3 My Office Hours Tuesday, before class Painter (PAI) 5.35 8:45 a.m. to 9:15 Thursday, after class ACE 6.302 11:00 a.m. to 12 Randy’s office hours Monday & Wednesday 11 a.m. to 12:00 Painter (PAI) 5.33
CS 354 4 Last time, this time Last lecture, we discussed Acceleration structures This lecture Graphics Performance Analysis Projects Project 4 on ray tracing on Piazza Due May 2, 2012 Get started!
CS 354 5 On a sheet of paper Daily Quiz • Write your EID, name, and date • Write #1, #2, #3 followed by its answer Multiple choice: Which is NOT a bounding volume True of False: Volume representation rendering can be accelerated by the GPU by drawing a) sphere blended slices of the volume. b) axis-aligned bounding box c) object aligned bounding box d) bounding graph point e) convex polyhedron True or False: Place objects within a uniform grid is easier than placing objects within a KD tree.
CS 354 6 Graphics Performance Analysis Generating synthetic images by computer is computationally—and bandwidth— intensive Achieving interactive rates is key 60 frames/second ≈ real-time interactivity Worth optimizing Entertainment and intuition tied to interactivity How do we think about graphics performance analysis?
CS 354 7 Framing Amdahl’s Law Assume a workload with two parts First part in A% Second part is B% Such that A% + B% = 100% If we have a technique to speedup the second part by N times But have no speedup for the first part What overall speed up can we expect?
CS 354 8 Amdahl’s Equation Assume A% + B% = 100% If the un-optimized effort is 100%, the optimized effort should be smaller B% OptimizedEffort = A% + N Speedup is ratio of UnoptimizedEffort to OptimizedEffort 100% 1 Speedup = = B% B A% + ( B − 1) + N N
CS 354 9 Who was Amdahl? Gene Amdahl CPU architect for IBM in 1960s Helped design IBM’s System/360 mainframe architecture Left IBM to found Amdahl computer Building IBM compatible mainframes Why? Evaluating whether to invest in parallel processing or not
CS 354 10 Parallelization Broadly speaking, computer tasks can be broken into two portions Sequential sub-tasks Naturally requires steps to be done in a particular order Examples: text layout, entropy decoding Parallel sub-tasks Problem splits into lots of independent chunks of work Chunks of work can be done by separate processing units simultaneously: parallelization Examples: tracing rays, shading pixels, transforming vertices
CS 354 11 Serial Work Sandwiching Parallel Work
CS 354 12 Example of Amdahl’s Law Say a task is 50% serial and 50% parallel Consider using 4 parallel processors on the parallel portion Speedup: 1.6x Consider using 40 parallel processor on parallel portion Speedup: 1.951x Consider limit: 1 lim =2 n →∞ .5 .5 + n
CS 354 14 Pessimism about Parallelism? Amdahl’s Law can instill pessimism about parallel processing If the serial work percent is high, adding parallel units has low benefit Assumes fixed “problem” size So workload stays same size even as parallel execution resources are added So why do GPUs offer 100’s of cores then?
CS 354 15 Gustafsons Law Observation by John Gustafson With N parallel unit, bigger problems can be attacked Great example Increasing GPU resolution Was 640x480 pixels, now 1920x1200 More parallel units means more pixels can be processed simultaneously Supporting rendering resolutions previously unattainable Problem size improvement problemScale = N − A( N − 1)
CS 354 16 Example Say a task is 50% serial and 50% parallel Consider using 4 parallel processors on the parallel portion Problem scales up: 2.5x Consider 100 parallel processors Problem scales up: 50.5x Also consider heterogeneous nature of graphics processing units
CS 354 17 Coherent Work vs. Incoherent Work Not all parallel work is created equal Coherent work = “adjacent” chunks of work performing similar operations and memory accesses Example: camera rays, pixel shading Allows sharing control of instruction execution Good for caches Incoherent work = “adjacent” chunks of work performing dissimilar operations and memory accesses Examples: reflection, shadow, and refraction rays Bad for caches
CS 354 18 Coherent vs. Incoherent Rays coherent = camera rays coherent = light rays incoherent = reflected rays
CS 354 19 Keeping Work Coherent? How do we keep work concurrent? Pipelines Careful because they can introduce latency Data structures SPMD (or SIMD) execution Single Program, Multiple Data To exploit Single Instruction, Multiple Data (SIMD) units Bundling “adjacent” work elements helps cache and memory access efficiency
A Simplified Graphics PipelineCS 354 21 Application Application- OpenGL API boundary Vertex batching & assembly Triangle assembly Triangle clipping NDC to window space Triangle rasterization Fragment shading Depth testing Depth buffer Color update Framebuffer
CS 354 22 Another View of the Graphics Pipeline 3D Application or Game OpenGL API CPU – GPU Boundary GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Vertex Geometry Fragment Shader Program Shader Attribute FetchLegend Parameter Buffer Read Texture Fetch Framebuffer Access programmable fixed-function Memory Interface OpenGL 3.3
CS 354 23 Modeling Pipeline Efficiency Rate of processing for sequential tasks Assume three tasks Run time is sum of each operation’s time A+B+C Rate of processing in a pipeline Assume three tasks, treated as stages Performance gated by slowest operation Three operations in pipeline: A, B, C Run time = max(A,B,C)
CS 354 24 Hardware Clocks Heart beat of hardware Measured in frequency Hertz (Hz) = cycles per second Megahertz, gigahertz = million, billion Hz Faster clocks = faster computation and data transfer So why not simply raise clocks? High clocks consume more power Circuits are only rated to a maximum clock speed before becoming unreliable
CS 354 25 Clock Domains Given chip may have multiple clocks running Three key domains (GPU-centric) Graphics clock—for fixed-function units Example uses: rasterization, texture filtering, blending Optimize for throughput, not latency Can often instance more units instead of raising clocks Processor clock—for programmable shader units Example: shader instruction execution Generally higher than graphics clock Because optimized for latency rather than throughput Memory clock—for talking to external memory Depends on speed rating of external memory Other domains too Display clock, PCI-Express bus clock Generally not crucial to rendering performance
CS 354 26 3D Pipeline Programmable Domains run on Unified Hardware Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains Plus tessellation + compute (not shown below) , GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Can be Vertex Primitive Fragment unified Program Program Program hardware! Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access Memory Interface
CS 354 27 Memory Bandwidth Raw memory bandwidth Physical clock rate Examples: 3 Ghz Memory bus width 64-bit, 128-bit, 192-bit, 256-bit, 384-bit Wider buses are faster but more expensive to route all those wires Signaling rate Double data rate (DDR) means signals are sent on the rising and falling clock edges Often logical memory clock rate includes signaling rate Computing raw memory bandwidth bandwidth = physicalClock × signalPerClock × busWidth
CS 354 28 Latency vs. Throughput Raw bandwidth is reduced by memory utilization bandwidth Unrealistic to expect 100% utilization GPUs are much better than CPUs generally Trade-off Maximizing throughput (utilization) increases latency Minimizing latency reduces utilization
CS 354 30 GeForce Peak Memory Bandwidth Trends 200 128-bit interface 256-bit interface 180 Raw 160 bandwidth Gigabytes per second 140 Effective raw bandwidth 120 with compression 100 Expon. (Effective raw bandwidth 80 with compression) 60 Expon. (Raw bandwidth) 40 20 0 GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce GT S 4600 6800 Ultra 7800 GT X
CS 354 31 Effective GPU Memory Bandwidth Compression schemes Lossless depth and color (when multisampling) compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression Avoidance useless work Early killing of fragments (Z cull) Avoiding useless blending and texture fetches Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches
CS 354 32 Other Metrics Host bandwidth Vertex pulling Vertex transformation Triangle rasterization and setup Fragment shading rate Shader instruction rate Raster (blending) operation rate Early Z reject rate
CS 354 33 Kepler GeForce GTX 680 High-level Block Diagram 8 Streaming Multiprocessors (SMX) 1536 CUDA Cores 8 Geometry Units 4 Raster Units 128 Texture units 32 Raster operations 256-bit GDDR5 memory
CS 354 34 Kepler Streaming Multiprocessor 8 more copies of this
CS 354 35 Prior Generation Streaming Multiprocessor (SM) Multi-processor execution unit (Fermi) 32 scalar processor cores Warp is a unit of thread execution of up to 32 threads Two workloads Graphics Vertex shader Tessellation Geometry shader Fragment shader Compute
CS 354 36 Power Gating Computer architecture has hit the “power wall” Low-power operation is at a premium Battery-powered devices Thermal constraints Economic constraints Power Management (PM) works to reduce power by Lower clocks when performance isn’t required Disabling hardware units Avoids leakage
CS 354 37 Scene Graph Labor High-level division of scene graph labor Four pipeline stages App (application) Code that manipulates/modifies the scene graph in response to user input or other events Isect (intersection) Geometric queries such as collision detection or picking Cull Traverse the scene graph to find the nodes to be rendered Best example: eliminate objects out of view Optimize the ordering of nodes Sort objects to minimize graphics hardware state changes Draw Communicating drawing commands to the hardware Generally through graphics API (OpenGL or Direct3D) Can map well to multi-processor CPU systems
CS 354 38 App-cull-draw Threading App-cull-draw processing on one CPU core App-cull-draw processing on multiple CPUs
CS 354 39 Scene Graph Profiling Scene graph should help provide insight into performance Process statistics What’s going on? Time stamps Database statistics How complex is the scene in any frame?
CS 354 40 Example: Depth Complexity Visualization How many pixels are being rendered? Pixels can be rasterized by multiple objects Depth complexity is the average number of times a pixel or color sample is updated per frame yellow and black indicate higher depth complexity
CS 354 41 Example: Heads-up Display of Statistics Process statistics How long is everything taking? Database statistic What is being rendered? Overlaying on active scene often value Dynamic update
CS 354 42 Benchmarking Synthetic benchmarks focus on rendering particular operations in isolation What is the blended pixel performance Application benchmarks Try to reflect what a real application would do
CS 354 43 Tips for Interactive Performance Analysis Vary things you can control Change window resolution Making it smaller and seeing better performance Null driver analysis Skip the actual rendering calls What if the driver was *infinitely” fast Use occlusion queries to monitor how many samples (pixels) are actually got to need Keep data on the GPU Let GPU do Direct Memory Access (DMA) Keep from swapping textures and buffers Easy when multi-gigabyte graphics cards available
CS 354 44 Next Class Next lecture Surfaces Programmable tessellation Reading None Project 4 Project 4 is a simple ray tracer Due Wednesday, May 2, 2012