Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)


Published on

Talk in the Beyond Programmable Shading course at Siggraph 2009

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Concrete effects on the graphics pipelineInclude a few wishes & predictions of how we would like the GPU programming model to evolve
  • FB1 Started out 5 years ago. Target the ”next-generation” consoles.And while we developed the engine we also worked on BFBC1, which was the pilot project. After that shipped, we have been spending quite a bit of effort on a new version of the engine of which one of the major new things is full PC & DX11 support.So I think we have some interesting experiences on both the consoles and modern PCs.And no, I won’t talk about BF3.
  • These large scale environments require heavy processing.Lets go a head and talk about jobs..
  • Better code structure!Gustafson’s LawFixed 33 ms/f
  • 90% of this is a single job graph for the rendering of a frameRely on dynamic load-balancingTask-parallel examples: Terrain culling, particles, building entitiesData-parallel examples: ParticlesHave example of low-latency GPU job later onBraided: Aaron introduced the term at this course last year
  • We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU
  • One of the first optimizations for multi-core that we did was to move all rendering dispatch to a seperate thread. This includes all the draw calls and states that we set to D3D.This helps a lot but it doesn’t scale that well as we only utilize a single extra core.Gather
  • Software rasterization of occluders
  • Conservative
  • Want to rasterize on GPU, not CPU CPU Û GPU job dependencies
  • Only visible trianglesJon Olick also talked about Edge at the course last yearSo I wont go into that much detail about thisThis is a huge optimization both for us and many other developers
  • Initially very skepticalIntrinsics problematic
  • 280 instruction GSStreamOut buffer/query management is difficult & buggyWant to use CS
  • Esp. heavy with MSAAParallel reduction probably faster than atomics
  • MSAA texture read cost, UAV costGood: depth bounds on all HW
  • Paper presented at Siggraph yesterday.Dynamic irregular workloadsElegant model, holds a lot of promise
  • OpenCLRapidMind, GRAMPs
  • Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)

    1. 1. Parallel Graphics in Frostbite – Current & Future<br />Johan Andersson<br />DICE<br />
    2. 2. Menu<br />Game engine CPU & GPU parallelism<br />Rendering techniques & systems – old & new<br />Mixed in with some future predictions & wishes<br />
    3. 3. Quick background<br />Frostbite 1.x [1][2][3]<br />Xbox 360, PS3, DX10<br />Battlefield: Bad Company (shipped)<br />Battlefield 1943 (shipped)<br />Battlefield: Bad Company 2<br />Frostbite 2 [4][5]<br />In development<br />Xbox 360, PS3<br />DX11 (10.0, 10.1, 11)<br />Disclaimer: Unless specified, pictures are from engine tests, not actual games<br />
    4. 4.
    5. 5.
    6. 6. Job-based parallelism<br />Must utilize all cores in the engine<br />Xbox 360: 6 HW threads<br />PS3: 2 HW threads + 6 great SPUs<br />PC: 2-8 HW threads <br />And many more coming<br />Divide up systems into Jobs<br />Async function calls with explicit inputs & outputs<br />Typically fully independent stateless functions<br />Makes it easier on PS3 SPU & in general<br />Job dependencies create job graph<br />All cores consume jobs<br />CELL processor – We like<br />
    7. 7. Frostbite CPU job graph<br />Frame job graph <br />from Frostbite 1 (PS3)<br />Build big job graphs<br />Batch, batch, batch<br />Mix CPU- & SPU-jobs <br />Future: Mix in low-latency GPU-jobs<br />Job dependencies determine:<br />Execution order <br />Sync points<br />Load balancing<br />I.e. the effective parallelism<br />Braided Parallelism* [6]<br />Intermixed task- & data-parallelism<br />* Still only 10 hits on google (yet!), but I like Aaron’s term<br />
    8. 8. Rendering jobs<br />Rendering systems are heavily divided up into jobs<br />Jobs:<br />Terrain geometry processing<br />Undergrowth generation [2]<br />Decal projection [3]<br />Particle simulation<br />Frustum culling<br />Occlusion culling<br />Occlusion rasterization<br />Command buffer generation<br />PS3: Triangle culling<br />Most will move to GPU<br />Eventually.. A few have already!<br />Mostly one-way data flow<br />I will talk about a couple of these..<br />
    9. 9. Parallel command buffer recording <br />Dispatch draw calls and state to multiple command buffers in parallel<br />Scales linearly with # cores<br />1500-4000 draw calls per frame<br />Reduces latency & improves performance<br />Important for all platforms, used on:<br />Xbox 360<br />PS3 (SPU-based)<br />PC DX11<br />Previously not possible on PC, but now in DX11...<br />
    10. 10. DX11 parallel dispatch<br />First class citizen in DX11<br />Killer feature for reducing CPU overhead & latency<br />~90% of our rendering dispatch job time is in D3D/driver<br />DX11 deferred device context per core <br />Together with dynamic resources (cbuffer/vbuffer) for usage on that deferred context<br />Renderer has list of all draw calls we want to do for each rendering “layer” of the frame<br />Split draw calls for each layer into chunks of ~256 and dispatch in parallel to the deferred contexts<br />Each chunk generates a command list<br />Render to immediate context & execute command lists<br />Profit! <br />Goal: close to linear scaling up to octa-core when we get full DX11 driver support (up to the IHVs now)<br />Future note: This is ”just” a stopgap measure until we evolve the GPU to be able to fully feed itself (hi LRB)<br />
    11. 11. Occlusion culling<br />Problem: Buildings & env occlude large amounts of objects<br />Invisible objects still have to:<br />Update logic & animations<br />Generate command buffer<br />Processed on CPU & GPU<br />Difficult to implement full culling<br />Destructible buildings<br />Dynamic occludees<br />Difficult to precompute <br />GPU occlusion queries can be heavy to render<br />From Battlefield: Bad Company PS3<br />
    12. 12. Our solution:<br />Software occlusion rasterization<br />
    13. 13. Software occlusion culling<br />Rasterize coarse zbuffer on SPU/CPU<br />256x114 float<br />Good fit in SPU LS, but could be 16-bit<br />Low-poly occluder meshes<br />Manually conservative<br />100 m view distance<br />Max 10000 vertices/frame<br />Parallel SPU vertex & raster jobs<br />Cost: a few milliseconds<br />Then cull all objects against zbuffer<br /><ul><li>Before passed to all other systems = big savings
    14. 14. Screen-space bounding-box test</li></ul>Pictures & numbers from <br />Battlefield: Bad Company PS3<br />
    15. 15. GPU occlusion culling<br />Ideally want GPU rasterization & testing, but:<br />Occlusion queries introduces overhead & latency<br />Can be manageable, but far from ideal<br />Conditional rendering only helps GPU<br />Not CPU, frame memory or draw calls<br />Future 1: Low-latency extra GPU exec. context<br />Rasterization and testing done on GPU where it belongs<br />Lockstep with CPU, need to read back data within a few ms<br />Should be possible on LRB (latency?), want on all HW<br />Future 2: Move entire cull & rendering to ”GPU”<br />World rep., cull, systems, dispatch. End goal.<br />
    16. 16. PS3 geometry processing<br />Problem: Slow GPU triangle & vertex setup on PS3<br />Combined with unique situation with powerful & initially not fully utilized ”free” SPUs!<br />Solution: SPU triangle culling<br />Trade SPU time for GPU time<br />Cull all back faces, micro-triangles, out of frustum<br />Based on Sony’s PS3 EDGE library [7]<br />Also see Jon Olick’s talk from the course last year<br />5 SPU jobs processes frame geometry in parallel<br />Output is new index buffer for each draw call<br />
    17. 17. Custom geometry processing<br />Software control opens up great flexibility and programmability! <br />Simple custom culling/processing that we’ve added:<br />Partition bounding box culling<br />Mesh part culling<br />Clip plane triangle trivial accept & reject<br />Triangle cull volumes (inverse clip planes)<br />Others are doing: Full skinning, morph targets, CLOD, cloth<br />Future wish: No forced/fixed vertex & geometry shaders<br />DIY compute shaders with fixed-func stages (tesselation and rasterization)<br />Software-controlled queuing of data between stages<br />To avoid always spilling out to memory<br />
    18. 18. Decal projection<br />Traditionally a CPU process<br />Relying on identical visual & physics representation <br />Or duplicated mesh data in CPU memory (on PC) <br /><ul><li>Consoles read visual mesh data directly
    19. 19. UMA! 
    20. 20. Project in SPU-jobs
    21. 21. Output VB/IB to GPU</li></li></ul><li>Decals through GS & StreamOut<br />Keep the computation & data on the GPU (DX10)<br />See GDC’09 ”Shadows & Decals – D3D10 techniques in Frostbite”, slides with complete source code online [4]<br />Process all mesh triangles with Geometry Shader<br />Test decal projection against the triangles<br />Setup per-triangle clip planes for intersecting tris<br />Output intersecting triangles using StreamOut<br />Issues:<br />StreamOut management<br />Drivers (not your standard GS usage)<br />Benefits:<br /><ul><li>CPU & GPU worlds separate
    22. 22. No CPU memory or upload
    23. 23. Huge decals + huge meshes</li></li></ul><li>Deferred lighting/shading<br />Traditional deferred shading:<br />Graphics pipeline rasterizes gbuffer for opaque surfaces<br />Normal, albedos, roughness<br />Light sources are rendered & accumulate lighting to a texture<br />Light volume or screen-space tile rendering<br />Combine shading & lighting for final output<br />Also see Wolfgang’s talk “Light Pre-Pass Renderer Mark III”from Monday for a wider description [8]<br />
    24. 24. Screen-space tile classification<br />Divide screen up into tiles and determine how many & which light sources intersect each tile<br />Only apply the visible light sources on pixels in each tile<br />Reduced BW & setup cost with multiple lights in single shader<br />Used in Naughty Dog’s Uncharted [9] and SCEE PhyreEngine [10]<br />Hmm, isn’t light classification per screen-space tile sort of similar of how a compute shader can work with 2D thread groups?<br />Answer: YES, except a CS can do everything in a single pass!<br />From ”The Technology of Uncharted&quot;. GDC’08 [9]<br />
    25. 25. CS-based deferred shading <br />Deferred shading using DX11 CS<br /><ul><li>Experimental implementation in Frostbite 2
    26. 26. Not production tested or optimized
    27. 27. Compute Shader 5.0
    28. 28. Assumption: No shadows (for now)</li></ul>New hybrid Graphics/Compute shading pipeline:<br />Graphics pipeline rasterizes gbuffers for opaque surfaces<br />Compute pipeline uses gbuffers, culls light sources, computes lighting & combines with shading<br />(multiple other variants also possible)<br />
    29. 29. CS requirements & setup<br />Input data is gbuffers, depth buffer & light constants<br />Output is fully composited & lit HDR texture<br />1 thread per pixel, 16x16 thread groups (aka tile)<br />Normal<br />Roughness<br />Texture2D&lt;float4&gt; gbufferTexture1 : register(t0);<br />Texture2D&lt;float4&gt; gbufferTexture2 : register(t1);<br />Texture2D&lt;float4&gt; gbufferTexture3 : register(t2);<br />Texture2D&lt;float4&gt; depthTexture : register(t3);<br />RWTexture2D&lt;float4&gt; outputTexture : register(u0);<br />#define BLOCK_SIZE 16<br />[numthreads(BLOCK_SIZE,BLOCK_SIZE,1)]<br />void csMain(<br /> uint3 groupId : SV_GroupID,<br /> uint3 groupThreadId : SV_GroupThreadID,<br /> uint groupIndex: SV_GroupIndex,<br /> uint3 dispatchThreadId : SV_DispatchThreadID)<br />{<br /> ...<br />}<br />Diffuse Albedo<br />Specular Albedo<br />
    30. 30. CS steps 1-2<br />groupshared uint minDepthInt;<br />groupshared uint maxDepthInt;<br />// --- globals above, function below -------<br />float depth = <br /> depthTexture.Load(uint3(texCoord, 0)).r;<br />uint depthInt = asuint(depth);<br />minDepthInt = 0xFFFFFFFF;<br />maxDepthInt = 0;<br />GroupMemoryBarrierWithGroupSync();<br />InterlockedMin(minDepthInt, depthInt);<br />InterlockedMax(maxDepthInt, depthInt);<br />GroupMemoryBarrierWithGroupSync();<br />float minGroupDepth = asfloat(minDepthInt);<br />float maxGroupDepth = asfloat(maxDepthInt);<br />Load gbuffers & depth<br />Calculate min & max z in threadgroup / tile<br />Using InterlockedMin/Max on groupshared variable<br />Atomics only work on ints <br />But casting works (z is always +)<br />Optimization note: <br />Separate pass using parallel reduction with Gather to a small texture could be faster<br />Note to the future:<br />GPU already has similar values in HiZ/ZCull! <br />Can skip step 2 if we could resolve out min & max z to a texture directly<br />Min z looks just like the occlusion software rendering output<br />
    31. 31. CS step 3 – Cull idea<br />Determine visible light sources for each tile<br />Cull all light sources against tile ”frustum”<br />Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sources<br />Output for each tile is:<br /># of visible light sources<br />Index list of visible light sources<br />Example numbers from test scene<br />This is the key part of the algorithm and compute shader, so must try to be rather clever with the implementation!<br />Per-tile visible light count<br />(black = 0 lights, white = 40)<br />
    32. 32. CS step 3 – Cull implementation<br />struct Light<br />{<br /> float3 pos;<br /> float sqrRadius;<br /> float3 color;<br /> float invSqrRadius;<br />};<br />int lightCount;<br />StructuredBuffer&lt;Light&gt; lights;<br />groupshared uint visibleLightCount = 0;<br />groupshared uint visibleLightIndices[1024];<br />// ----- globals above, cont. function below -----------<br />uint threadCount = BLOCK_SIZE*BLOCK_SIZE; <br />uint passCount = (lightCount+threadCount-1) / threadCount;<br />for (uint passIt = 0; passIt &lt; passCount; ++passIt)<br />{<br /> uint lightIndex = passIt*threadCount + groupIndex;<br /> // prevent overrun by clamping to a last ”null” light<br /> lightIndex = min(lightIndex, lightCount); <br /> if (intersects(lights[lightIndex], tile))<br /> {<br /> uint offset;<br /> InterlockedAdd(visibleLightCount, 1, offset);<br /> visibleLightIndices[offset] = lightIndex;<br /> } <br />}<br />GroupMemoryBarrierWithGroupSync();<br />Each thread switches to process light sources instead of a pixel* <br />Wow, parallelism switcheroo!<br />256 light sources in parallel per tile<br />Multiple iterations for &gt;256 lights <br />Intersect light source & tile<br />Many variants dep. on accuracy requirements & performance<br />Tile min & max z is used as a shader ”depth bounds” test<br />For visible lights, append light index to index list<br />Atomic add to threadgroup shared memory. ”inlined stream compaction”<br />Prefix sum + stream compaction should be faster than atomics, but more limiting<br />Synchronize group & switch back to processing pixels<br />We now know which light sources affect the tile<br />*Your grandfather’s pixel shader can’t do that!<br />
    33. 33. CS deferred shading final steps<br />Computed lighting<br />For each pixel, accumulate lighting from visible lights<br />Read from tile visible light index list in threadgroup shared memory<br />Combine lighting & shading albedos / parameters<br />Output is non-MSAA HDR texture<br />Render transparent surfaces on top<br />float3 diffuseLight = 0;<br />float3 specularLight = 0;<br />for (uint lightIt = 0; lightIt &lt; visibleLightCount; ++lightIt)<br />{<br /> uint lightIndex = visibleLightIndices[lightIt];<br /> Light light = lights[lightIndex]; <br /> evaluateAndAccumulateLight(<br /> light, <br /> gbufferParameters,<br /> diffuseLight,<br /> specularLight); <br />}<br />Combined final output <br />(not the best example)<br />
    34. 34. Example results<br />
    35. 35. Example: 25+ analytical specular highlights per pixel<br />
    36. 36.
    37. 37. Compute Shader-based <br />Deferred Shading demo<br />
    38. 38. CS-based deferred shading <br />The Good:<br />Constant & absolute minimal bandwidth<br />Read gbuffers & depth once!<br />Doesn’t need intermediate light buffers<br />Can take a lot of memory with HDR, MSAA & color specular<br />Scales up to huge amount of big overlapping light sources!<br />Fine-grained culling (16x16)<br />Only ALU cost, good future scaling<br />Could be useful for accumulating VPLs<br />The Bad:<br />Requires DX11 HW (duh)<br />CS 4.0/4.1 difficult due to atomics & scattered groupshared writes<br />Culling overhead for small light sources<br />Can accumulate them using standard light volume rendering<br />Or separate CS for tile-classific.<br />Potentially performance<br />MSAA texture loads / UAV writing might be slower then standard PS<br />The Ugly:<br /><ul><li>Can’t output to MSAA texture
    39. 39. DX11 CS UAV limitation. </li></li></ul><li>Future programming model<br />Queues as compute shader streaming in/outs<br />In addition to buffers/textures/UAVs<br />Simple & expressive model supporting irregular workloads<br />Keeps data on chip, supports variable sized caches & cores<br />Build your pipeline of stages with queues between<br />Shader & fixed function stages (sampler, rasterizer, tessellator, Zcull)<br />Developers can make the GPU feed itself!<br />GRAMPS model example [8]<br />
    40. 40. What else do we want to do?<br />WARNING: Overly enthusiastic and non all-knowing game developer ranting<br />Mixed resolution MSAA particle rendering <br />Depth test per sample, shade per quarter pixel, and depth-aware upsample directly in shader<br />Demand-paged procedural texturing / compositing<br />Zero latency “texture shaders”<br />Pre-tessellation coarse rasterization for z-culling of patches<br />Potential optimization in scenes of massive geometric overdraw<br />Can be coupled with recursive schemes<br />Deferred shading w/ many & arbitrary BRDFs/materials<br />Queue up pixels of multiple materials for coherent processing in own shader<br />Instead of incoherenct screen-space dynamic flow control<br />Latency-free lens flares <br />Finally! No false/late occlusion<br />Occlusion query results written to CB and used in shader to cull & scale<br />And much much more...<br />
    41. 41. Conclusions<br />A good parallelization model is key for good game engine performance (duh)<br />Job graphs of mixed task- & data-parallel CPU & SPU jobs works well for us<br />SPU-jobs do the heavy lifting<br />Hybrid compute/graphics pipelines looks promising<br />Efficient interopability is super important (DX11 is great)<br />Deferred lighting & shading in CS is just the start<br />Want a user-defined streaming pipeline model<br />Expressive & extensible hybrid pipelines with queues<br />Focus on the data flow & patterns instead of doing sequential memory passes<br />
    42. 42. Acknowledgements<br />DICE & Frostbite team<br />Nicolas Thibieroz, Mark Leather<br />Miguel Sainz, Yury Uralsky<br />Kayvon Fatahalian<br />Matt Swoboda, Pål-Kristian Engstad <br />Timothy Farrar, Jake Cannell<br />
    43. 43. References<br />[1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007.<br />[2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007.<br />[3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural ShaderSplatting”. Siggraph 2007.<br />[4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009.<br />[5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009.<br />[6] Aaron Lefohn. ”Programming Larrabee: Beyond Data Parallelism” – ”Beyond Programmable Shading” course. Siggraph 2008.<br />[7] Mark Cerny, Jon Olick, Vince Diesi. “PLAYSTATION Edge”. GDC 2007.<br />[8] Wolfgang Engel. “Light Pre-Pass Renderer Mark III” - “Advances in Real-Time Rendering in 3D Graphics and Games” course notes. Siggraph 2009.<br />[9] Pål-KristianEngstad, &quot;The Technology of Uncharted: Drake’s Fortune&quot;. GDC 2008.<br />[10] Matt Swoboda. “Deferred Lighting and Post Processing on PLAYSTATION®3”. GDC 2009.<br />[11] Kayvon Fatahalian et al. ”GRAMPS: A Programming Model for Graphics Pipelines”. ACM Transactions on Graphics January, 2009.<br />[12] Jared Hoberock et al. ”Stream Compaction for Deferred Shading”<br />
    44. 44. We are hiring senior developers<br />
    45. 45. Questions?<br /><ul><li>Please fill in the course evaluation at:
    46. 46. You could win a Siggraph’09 mug (yey!)
    47. 47. One winner per course, notified by email in the evening</li></ul>Email:<br />Blog:<br />Twitter:<br /><br />
    48. 48. Bonus slides<br />
    49. 49. Timing view<br />Real-time in-game overlay <br />See CPU, SPU & GPU timing events & effective parallelism<br />What we use to reduce sync-points & optimize load balancing between all processors<br />GPU timing through event queries<br />AFR-handling rather shaky, but works!*<br />Example: PC, 4 CPU cores, 2 GPUs in AFR<br />*At least on AMD 4870x2 after some alt-tab action<br />