Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    Concrete effects on the graphics pipelineInclude a few wishes & predictions of how we would like the GPU programming model to evolve

    FB1 Started out 5 years ago. Target the ”next-generation” consoles.And while we developed the engine we also worked on BFBC1, which was the pilot project. After that shipped, we have been spending quite a bit of effort on a new version of the engine of which one of the major new things is full PC & DX11 support.So I think we have some interesting experiences on both the consoles and modern PCs.And no, I won’t talk about BF3.

    These large scale environments require heavy processing.Lets go a head and talk about jobs..

    Better code structure!Gustafson’s LawFixed 33 ms/f

    90% of this is a single job graph for the rendering of a frameRely on dynamic load-balancingTask-parallel examples: Terrain culling, particles, building entitiesData-parallel examples: ParticlesHave example of low-latency GPU job later onBraided: Aaron introduced the term at this course last year

    We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU

    One of the first optimizations for multi-core that we did was to move all rendering dispatch to a seperate thread. This includes all the draw calls and states that we set to D3D.This helps a lot but it doesn’t scale that well as we only utilize a single extra core.Gather

    Software rasterization of occluders

    Conservative

    Want to rasterize on GPU, not CPU CPU Û GPU job dependencies

    Only visible trianglesJon Olick also talked about Edge at the course last yearSo I wont go into that much detail about thisThis is a huge optimization both for us and many other developers

    Initially very skepticalIntrinsics problematic

    280 instruction GSStreamOut buffer/query management is difficult & buggyWant to use CS

    Esp. heavy with MSAAParallel reduction probably faster than atomics

    MSAA texture read cost, UAV costGood: depth bounds on all HW

    Paper presented at Siggraph yesterday.Dynamic irregular workloadsElegant model, holds a lot of promise

    OpenCLRapidMind, GRAMPs

    1 Favorite

    Parallel Graphics in Frostbite - Current & Future (Siggraph 2009) - Presentation Transcript

    1. Parallel Graphics in Frostbite – Current & Future
      Johan Andersson
      DICE
    2. Menu
      Game engine CPU & GPU parallelism
      Rendering techniques & systems – old & new
      Mixed in with some future predictions & wishes
    3. Quick background
      Frostbite 1.x [1][2][3]
      Xbox 360, PS3, DX10
      Battlefield: Bad Company (shipped)
      Battlefield 1943 (shipped)
      Battlefield: Bad Company 2
      Frostbite 2 [4][5]
      In development
      Xbox 360, PS3
      DX11 (10.0, 10.1, 11)
      Disclaimer: Unless specified, pictures are from engine tests, not actual games
    4. Job-based parallelism
      Must utilize all cores in the engine
      Xbox 360: 6 HW threads
      PS3: 2 HW threads + 6 great SPUs
      PC: 2-8 HW threads
      And many more coming
      Divide up systems into Jobs
      Async function calls with explicit inputs & outputs
      Typically fully independent stateless functions
      Makes it easier on PS3 SPU & in general
      Job dependencies create job graph
      All cores consume jobs
      CELL processor – We like
    5. Frostbite CPU job graph
      Frame job graph
      from Frostbite 1 (PS3)
      Build big job graphs
      Batch, batch, batch
      Mix CPU- & SPU-jobs
      Future: Mix in low-latency GPU-jobs
      Job dependencies determine:
      Execution order
      Sync points
      Load balancing
      I.e. the effective parallelism
      Braided Parallelism* [6]
      Intermixed task- & data-parallelism
      * Still only 10 hits on google (yet!), but I like Aaron’s term
    6. Rendering jobs
      Rendering systems are heavily divided up into jobs
      Jobs:
      Terrain geometry processing
      Undergrowth generation [2]
      Decal projection [3]
      Particle simulation
      Frustum culling
      Occlusion culling
      Occlusion rasterization
      Command buffer generation
      PS3: Triangle culling
      Most will move to GPU
      Eventually.. A few have already!
      Mostly one-way data flow
      I will talk about a couple of these..
    7. Parallel command buffer recording
      Dispatch draw calls and state to multiple command buffers in parallel
      Scales linearly with # cores
      1500-4000 draw calls per frame
      Reduces latency & improves performance
      Important for all platforms, used on:
      Xbox 360
      PS3 (SPU-based)
      PC DX11
      Previously not possible on PC, but now in DX11...
    8. DX11 parallel dispatch
      First class citizen in DX11
      Killer feature for reducing CPU overhead & latency
      ~90% of our rendering dispatch job time is in D3D/driver
      DX11 deferred device context per core
      Together with dynamic resources (cbuffer/vbuffer) for usage on that deferred context
      Renderer has list of all draw calls we want to do for each rendering “layer” of the frame
      Split draw calls for each layer into chunks of ~256 and dispatch in parallel to the deferred contexts
      Each chunk generates a command list
      Render to immediate context & execute command lists
      Profit!
      Goal: close to linear scaling up to octa-core when we get full DX11 driver support (up to the IHVs now)
      Future note: This is ”just” a stopgap measure until we evolve the GPU to be able to fully feed itself (hi LRB)
    9. Occlusion culling
      Problem: Buildings & env occlude large amounts of objects
      Invisible objects still have to:
      Update logic & animations
      Generate command buffer
      Processed on CPU & GPU
      Difficult to implement full culling
      Destructible buildings
      Dynamic occludees
      Difficult to precompute
      GPU occlusion queries can be heavy to render
      From Battlefield: Bad Company PS3
    10. Our solution:
      Software occlusion rasterization
    11. Software occlusion culling
      Rasterize coarse zbuffer on SPU/CPU
      256x114 float
      Good fit in SPU LS, but could be 16-bit
      Low-poly occluder meshes
      Manually conservative
      100 m view distance
      Max 10000 vertices/frame
      Parallel SPU vertex & raster jobs
      Cost: a few milliseconds
      Then cull all objects against zbuffer
      • Before passed to all other systems = big savings
      • Screen-space bounding-box test
      Pictures & numbers from
      Battlefield: Bad Company PS3
    12. GPU occlusion culling
      Ideally want GPU rasterization & testing, but:
      Occlusion queries introduces overhead & latency
      Can be manageable, but far from ideal
      Conditional rendering only helps GPU
      Not CPU, frame memory or draw calls
      Future 1: Low-latency extra GPU exec. context
      Rasterization and testing done on GPU where it belongs
      Lockstep with CPU, need to read back data within a few ms
      Should be possible on LRB (latency?), want on all HW
      Future 2: Move entire cull & rendering to ”GPU”
      World rep., cull, systems, dispatch. End goal.
    13. PS3 geometry processing
      Problem: Slow GPU triangle & vertex setup on PS3
      Combined with unique situation with powerful & initially not fully utilized ”free” SPUs!
      Solution: SPU triangle culling
      Trade SPU time for GPU time
      Cull all back faces, micro-triangles, out of frustum
      Based on Sony’s PS3 EDGE library [7]
      Also see Jon Olick’s talk from the course last year
      5 SPU jobs processes frame geometry in parallel
      Output is new index buffer for each draw call
    14. Custom geometry processing
      Software control opens up great flexibility and programmability!
      Simple custom culling/processing that we’ve added:
      Partition bounding box culling
      Mesh part culling
      Clip plane triangle trivial accept & reject
      Triangle cull volumes (inverse clip planes)
      Others are doing: Full skinning, morph targets, CLOD, cloth
      Future wish: No forced/fixed vertex & geometry shaders
      DIY compute shaders with fixed-func stages (tesselation and rasterization)
      Software-controlled queuing of data between stages
      To avoid always spilling out to memory
    15. Decal projection
      Traditionally a CPU process
      Relying on identical visual & physics representation 
      Or duplicated mesh data in CPU memory (on PC) 
      • Consoles read visual mesh data directly
      • UMA! 
      • Project in SPU-jobs
      • Output VB/IB to GPU
    16. Decals through GS & StreamOut
      Keep the computation & data on the GPU (DX10)
      See GDC’09 ”Shadows & Decals – D3D10 techniques in Frostbite”, slides with complete source code online [4]
      Process all mesh triangles with Geometry Shader
      Test decal projection against the triangles
      Setup per-triangle clip planes for intersecting tris
      Output intersecting triangles using StreamOut
      Issues:
      StreamOut management
      Drivers (not your standard GS usage)
      Benefits:
      • CPU & GPU worlds separate
      • No CPU memory or upload
      • Huge decals + huge meshes
    17. Deferred lighting/shading
      Traditional deferred shading:
      Graphics pipeline rasterizes gbuffer for opaque surfaces
      Normal, albedos, roughness
      Light sources are rendered & accumulate lighting to a texture
      Light volume or screen-space tile rendering
      Combine shading & lighting for final output
      Also see Wolfgang’s talk “Light Pre-Pass Renderer Mark III”from Monday for a wider description [8]
    18. Screen-space tile classification
      Divide screen up into tiles and determine how many & which light sources intersect each tile
      Only apply the visible light sources on pixels in each tile
      Reduced BW & setup cost with multiple lights in single shader
      Used in Naughty Dog’s Uncharted [9] and SCEE PhyreEngine [10]
      Hmm, isn’t light classification per screen-space tile sort of similar of how a compute shader can work with 2D thread groups?
      Answer: YES, except a CS can do everything in a single pass!
      From ”The Technology of Uncharted". GDC’08 [9]
    19. CS-based deferred shading
      Deferred shading using DX11 CS
      • Experimental implementation in Frostbite 2
      • Not production tested or optimized
      • Compute Shader 5.0
      • Assumption: No shadows (for now)
      New hybrid Graphics/Compute shading pipeline:
      Graphics pipeline rasterizes gbuffers for opaque surfaces
      Compute pipeline uses gbuffers, culls light sources, computes lighting & combines with shading
      (multiple other variants also possible)
    20. CS requirements & setup
      Input data is gbuffers, depth buffer & light constants
      Output is fully composited & lit HDR texture
      1 thread per pixel, 16x16 thread groups (aka tile)
      Normal
      Roughness
      Texture2D<float4> gbufferTexture1 : register(t0);
      Texture2D<float4> gbufferTexture2 : register(t1);
      Texture2D<float4> gbufferTexture3 : register(t2);
      Texture2D<float4> depthTexture : register(t3);
      RWTexture2D<float4> outputTexture : register(u0);
      #define BLOCK_SIZE 16
      [numthreads(BLOCK_SIZE,BLOCK_SIZE,1)]
      void csMain(
      uint3 groupId : SV_GroupID,
      uint3 groupThreadId : SV_GroupThreadID,
      uint groupIndex: SV_GroupIndex,
      uint3 dispatchThreadId : SV_DispatchThreadID)
      {
      ...
      }
      Diffuse Albedo
      Specular Albedo
    21. CS steps 1-2
      groupshared uint minDepthInt;
      groupshared uint maxDepthInt;
      // --- globals above, function below -------
      float depth =
      depthTexture.Load(uint3(texCoord, 0)).r;
      uint depthInt = asuint(depth);
      minDepthInt = 0xFFFFFFFF;
      maxDepthInt = 0;
      GroupMemoryBarrierWithGroupSync();
      InterlockedMin(minDepthInt, depthInt);
      InterlockedMax(maxDepthInt, depthInt);
      GroupMemoryBarrierWithGroupSync();
      float minGroupDepth = asfloat(minDepthInt);
      float maxGroupDepth = asfloat(maxDepthInt);
      Load gbuffers & depth
      Calculate min & max z in threadgroup / tile
      Using InterlockedMin/Max on groupshared variable
      Atomics only work on ints 
      But casting works (z is always +)
      Optimization note:
      Separate pass using parallel reduction with Gather to a small texture could be faster
      Note to the future:
      GPU already has similar values in HiZ/ZCull!
      Can skip step 2 if we could resolve out min & max z to a texture directly
      Min z looks just like the occlusion software rendering output
    22. CS step 3 – Cull idea
      Determine visible light sources for each tile
      Cull all light sources against tile ”frustum”
      Light sources can either naively be all light sources in the scene, or CPU frustum culled potentially visible light sources
      Output for each tile is:
      # of visible light sources
      Index list of visible light sources
      Example numbers from test scene
      This is the key part of the algorithm and compute shader, so must try to be rather clever with the implementation!
      Per-tile visible light count
      (black = 0 lights, white = 40)
    23. CS step 3 – Cull implementation
      struct Light
      {
      float3 pos;
      float sqrRadius;
      float3 color;
      float invSqrRadius;
      };
      int lightCount;
      StructuredBuffer<Light> lights;
      groupshared uint visibleLightCount = 0;
      groupshared uint visibleLightIndices[1024];
      // ----- globals above, cont. function below -----------
      uint threadCount = BLOCK_SIZE*BLOCK_SIZE;
      uint passCount = (lightCount+threadCount-1) / threadCount;
      for (uint passIt = 0; passIt < passCount; ++passIt)
      {
      uint lightIndex = passIt*threadCount + groupIndex;
      // prevent overrun by clamping to a last ”null” light
      lightIndex = min(lightIndex, lightCount);
      if (intersects(lights[lightIndex], tile))
      {
      uint offset;
      InterlockedAdd(visibleLightCount, 1, offset);
      visibleLightIndices[offset] = lightIndex;
      }
      }
      GroupMemoryBarrierWithGroupSync();
      Each thread switches to process light sources instead of a pixel*
      Wow, parallelism switcheroo!
      256 light sources in parallel per tile
      Multiple iterations for >256 lights
      Intersect light source & tile
      Many variants dep. on accuracy requirements & performance
      Tile min & max z is used as a shader ”depth bounds” test
      For visible lights, append light index to index list
      Atomic add to threadgroup shared memory. ”inlined stream compaction”
      Prefix sum + stream compaction should be faster than atomics, but more limiting
      Synchronize group & switch back to processing pixels
      We now know which light sources affect the tile
      *Your grandfather’s pixel shader can’t do that!
    24. CS deferred shading final steps
      Computed lighting
      For each pixel, accumulate lighting from visible lights
      Read from tile visible light index list in threadgroup shared memory
      Combine lighting & shading albedos / parameters
      Output is non-MSAA HDR texture
      Render transparent surfaces on top
      float3 diffuseLight = 0;
      float3 specularLight = 0;
      for (uint lightIt = 0; lightIt < visibleLightCount; ++lightIt)
      {
      uint lightIndex = visibleLightIndices[lightIt];
      Light light = lights[lightIndex];
      evaluateAndAccumulateLight(
      light,
      gbufferParameters,
      diffuseLight,
      specularLight);
      }
      Combined final output
      (not the best example)
    25. Example results
    26. Example: 25+ analytical specular highlights per pixel
    27. Compute Shader-based
      Deferred Shading demo
    28. CS-based deferred shading
      The Good:
      Constant & absolute minimal bandwidth
      Read gbuffers & depth once!
      Doesn’t need intermediate light buffers
      Can take a lot of memory with HDR, MSAA & color specular
      Scales up to huge amount of big overlapping light sources!
      Fine-grained culling (16x16)
      Only ALU cost, good future scaling
      Could be useful for accumulating VPLs
      The Bad:
      Requires DX11 HW (duh)
      CS 4.0/4.1 difficult due to atomics & scattered groupshared writes
      Culling overhead for small light sources
      Can accumulate them using standard light volume rendering
      Or separate CS for tile-classific.
      Potentially performance
      MSAA texture loads / UAV writing might be slower then standard PS
      The Ugly:
      • Can’t output to MSAA texture
      • DX11 CS UAV limitation.
    29. Future programming model
      Queues as compute shader streaming in/outs
      In addition to buffers/textures/UAVs
      Simple & expressive model supporting irregular workloads
      Keeps data on chip, supports variable sized caches & cores
      Build your pipeline of stages with queues between
      Shader & fixed function stages (sampler, rasterizer, tessellator, Zcull)
      Developers can make the GPU feed itself!
      GRAMPS model example [8]
    30. What else do we want to do?
      WARNING: Overly enthusiastic and non all-knowing game developer ranting
      Mixed resolution MSAA particle rendering
      Depth test per sample, shade per quarter pixel, and depth-aware upsample directly in shader
      Demand-paged procedural texturing / compositing
      Zero latency “texture shaders”
      Pre-tessellation coarse rasterization for z-culling of patches
      Potential optimization in scenes of massive geometric overdraw
      Can be coupled with recursive schemes
      Deferred shading w/ many & arbitrary BRDFs/materials
      Queue up pixels of multiple materials for coherent processing in own shader
      Instead of incoherenct screen-space dynamic flow control
      Latency-free lens flares
      Finally! No false/late occlusion
      Occlusion query results written to CB and used in shader to cull & scale
      And much much more...
    31. Conclusions
      A good parallelization model is key for good game engine performance (duh)
      Job graphs of mixed task- & data-parallel CPU & SPU jobs works well for us
      SPU-jobs do the heavy lifting
      Hybrid compute/graphics pipelines looks promising
      Efficient interopability is super important (DX11 is great)
      Deferred lighting & shading in CS is just the start
      Want a user-defined streaming pipeline model
      Expressive & extensible hybrid pipelines with queues
      Focus on the data flow & patterns instead of doing sequential memory passes
    32. Acknowledgements
      DICE & Frostbite team
      Nicolas Thibieroz, Mark Leather
      Miguel Sainz, Yury Uralsky
      Kayvon Fatahalian
      Matt Swoboda, Pål-Kristian Engstad
      Timothy Farrar, Jake Cannell
    33. References
      [1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html
      [2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf
      [3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural ShaderSplatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf
      [4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html
      [5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html
      [6] Aaron Lefohn. ”Programming Larrabee: Beyond Data Parallelism” – ”Beyond Programmable Shading” course. Siggraph 2008. http://s08.idav.ucdavis.edu/lefohn-programming-larrabee.pdf
      [7] Mark Cerny, Jon Olick, Vince Diesi. “PLAYSTATION Edge”. GDC 2007.
      [8] Wolfgang Engel. “Light Pre-Pass Renderer Mark III” - “Advances in Real-Time Rendering in 3D Graphics and Games” course notes. Siggraph 2009.
      [9] Pål-KristianEngstad, "The Technology of Uncharted: Drake’s Fortune". GDC 2008. http://www.naughtydog.com/corporate/press/GDC%202008/UnchartedTechGDC2008.pdf
      [10] Matt Swoboda. “Deferred Lighting and Post Processing on PLAYSTATION®3”. GDC 2009. http://www.technology.scee.net/files/presentations/gdc2009/DeferredLightingandPostProcessingonPS3.ppt.
      [11] Kayvon Fatahalian et al. ”GRAMPS: A Programming Model for Graphics Pipelines”. ACM Transactions on Graphics January, 2009. http://graphics.stanford.edu/papers/gramps-tog/
      [12] Jared Hoberock et al. ”Stream Compaction for Deferred Shading” http://graphics.cs.uiuc.edu/~jch/papers/shadersorting.pdf
    34. We are hiring senior developers
    35. Questions?
      • Please fill in the course evaluation at: http://www.siggraph.org/courses_evaluation
      • You could win a Siggraph’09 mug (yey!)
      • One winner per course, notified by email in the evening
      Email: johan.andersson@dice.se
      Blog:http://repi.se
      Twitter:http://twitter.com/repi
      igetyourfail.com
    36. Bonus slides
    37. Timing view
      Real-time in-game overlay
      See CPU, SPU & GPU timing events & effective parallelism
      What we use to reduce sync-points & optimize load balancing between all processors
      GPU timing through event queries
      AFR-handling rather shaky, but works!*
      Example: PC, 4 CPU cores, 2 GPUs in AFR
      *At least on AMD 4870x2 after some alt-tab action
    SlideShare Zeitgeist 2009

    + repiirepii Nominate

    custom

    4593 views, 1 favs, 29 embeds more stats

    Talk in the Beyond Programmable Shading course at S more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 4593
      • 1549 on SlideShare
      • 3044 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 32
    Most viewed embeds
    • 959 views on http://planetbattlefield.gamespy.com
    • 904 views on http://repi.blogspot.com
    • 529 views on http://www.geeks3d.com
    • 301 views on http://www.bf-news.de
    • 153 views on http://www.battlefieldsingleplayer.com

    more

    All embeds
    • 959 views on http://planetbattlefield.gamespy.com
    • 904 views on http://repi.blogspot.com
    • 529 views on http://www.geeks3d.com
    • 301 views on http://www.bf-news.de
    • 153 views on http://www.battlefieldsingleplayer.com
    • 129 views on http://kimsama.egloos.com
    • 14 views on http://outpostgamez.com
    • 13 views on http://www.hanrss.com
    • 9 views on http://www.figh7club.com
    • 4 views on http://192.168.4.251
    • 4 views on http://leechen.tistory.com
    • 2 views on http://battlefieldsingleplayer.com
    • 2 views on http://michi131.pytalhost.eu
    • 2 views on function href() { // }
    • 2 views on http://battlefieldsingleplayer.info
    • 2 views on http://planetnap.gamespy.com
    • 2 views on http://209.85.227.132
    • 2 views on http://74.125.93.132
    • 1 views on http://rss.egloos.com
    • 1 views on http://www.gamestreet.net
    • 1 views on http://www.blogger.com
    • 1 views on file://
    • 1 views on http://www.f7c.de
    • 1 views on http://www.communityteam.de
    • 1 views on http://www.michi131.pytalhost.eu
    • 1 views on http://209.85.229.132
    • 1 views on http://www.tf2-news.de
    • 1 views on http://battlefieldsingleplayer.planetbattlefield.gamespy.com
    • 1 views on http://translate.googleusercontent.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories