Stockholm Game Developer Forum 2010Parallel Futures of a Game Engine (v2.0)Johan AnderssonRendering Architect, DICE
BackgroundDICEStockholm, Sweden~250 employeesPart of Electronic ArtsBattlefield & Mirror’s Edge game seriesFrostbiteProprietary game engine used at DICE & EADeveloped by DICE over the last 5 years2
http://badcompany2.ea.com/3
http://badcompany2.ea.com/4
OutlineCurrent parallelismFuturesQ&ASlides will be available online later todayhttp://publications.dice.se  and http://repi.se5
Current parallelism6
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU7
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU8
General ”game code” (1/2)This is the majority of our 1.5 million lines of C++Runs on Win32, Win64, Xbox 360 and PS3We do not use any scripting languageSimilar to general application codeHuge amount of code & logic to maintain + continue to developLow compute density”Glue code”Scattered in memory (pointer chasing)Difficult to efficiently parallelizeOut-of-order execution is a big help, but consoles are in-order Key to be able to quickly iterate & changeThis is the actual game logic & glue that builds the gameC++ not ideal, but has the invested infrastructure9
General ”game code” (2/2)PS3 is one of the main challengesStandard CPU parallelization doesn’t help (much)CELL only has 2 HW threads on the PPUSplit the code in 2: game code & system codeGame logic, policy and glue code only on CPU”If it runs well on the PS3 PPU, it runs well everywhere”Lower-level systems on PS3 SPUsMain goals going forward:Simplify & structure code baseReduce coupling with lower-level systemsIncrease in task parallelism for PCCELL processor10
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU11
Job-based parallelismEssential to utilize the cores on our target platformsXbox 360: 6 HW threadsPlayStation 3: 2 HW threads + 6 powerful SPUsPC: 2-16 HW threadsDivide up system work into Jobs (a.k.a. Tasks)15-200k C++ code each. 25k is commonCan depend on each other (if needed)Dependencies create job graphAll HW threads consume jobs ~200-300 / frame12
What is a Job for us?An asynchronous function callFunction ptr + 4 uintptr_t parametersCross-platform scheduler: EA JobManagerUses work stealing2 types of Jobs in Frostbite:CPU job (good)General code moved into job instead of threadsSPU job (great!)Stateless pure functions, no side effectsData-oriented, explicit memory DMA to local storeDesigned to run on the PS3 SPUs = also very fast on in-order CPUCan hot-swap quick iterations 13
Frostbite CPU job graphBuild big job graphs:Batch, batch, batch
Mix CPU- & SPU-jobs
Future: Mix in low-latency GPU-jobsJob dependencies determine:Execution order
Sync points
Load balancing
i.e. the effective parallelismIntermixed task- & data-parallelismaka Braided Parallelism
aka Nested Data-Parallelism
aka Tasks and Kernels14
Data-parallel jobs15
Task-parallel algorithms & coordination16
17Timing View: Battlefield Bad Company 2 (PS3)
Rendering jobsRendering systems are heavily divided up into CPU-& SPU-jobsJobs:Terrain geometry [3]Undergrowth generation [2]Decal projection [4]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generation [6]PS3: Triangle culling [6]Most will move to GPUEventually.. A few have already!PC CPU<->GPU latency wall Mostly one-way data flow18
Occlusion culling job exampleProblem: Buildings & env occlude large amounts of objectsObscured objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPU  = expensive & wasteful Difficult to implement full culling:Destructible buildingsDynamic occludeesDifficult to precompute From Battlefield: Bad Company PS319
Solution: Software occlusion cullingRasterize coarse zbuffer on SPU/CPU256x114 floatLow-poly occluder meshes100 m view distanceMax 10000 vertices/frameParallel vertex & raster SPU-jobsCost: a few millisecondsCull all objects against zbufferScreen-space bounding-box testBefore passed to all other systemsBig performance savings!20
GPU occlusion cullingIdeally want to use the GPU, but current APIs are limited:Occlusion queries introduces overhead & latencyConditional rendering only helps GPUCompute Shader impl. possible, but same latency wallFuture 1: Low-latency GPU execution contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back within a few msShould be possible on Larrabee & Fusion, want on all PCFuture 2: Move entire cull & rendering to ”GPU”World, cull, systems, dispatch. End goalVery thin GPU graphics driver – GPU feeds itself21
Levels of code in FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU22
Shader typesGenerated shaders [1]Graph-based surface shadersTreated as content, not codeArtist createdGenerates HLSL codeUsed by all meshes and 3d surfacesGraphics / Compute kernelsHand-coded & optimized HLSLStatically linked in with C++Pixel- & compute-shadersLighting, post-processing & special effectsGraph-based surface shader in FrostEd 223
Futures24
3 major challenges/goals going forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?ChallengesMost likely the same solution(s)!25
Challenge 1“How do we make it easier to develop, maintain & parallelize general game code?”Shared State Concurrency is a killerNot a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flowA more strict & adapted C++ modelSupport for true immutable & r/w-only memory accessPer-thread/task memory access opt-inReduce the possibility for side effects in parallel codeAs much compile-time validation as possibleMicro-threads / coroutines as first class citizensMore?Other languages?26
Challenge 1 - Task parallelismMultiple task librariesEA JobManagerDynamic job graphs, on-demand dependencis & continuationsTargeted to cross-platform SPU-jobs, key requirement for this generationNot geared towards super-simple to use CPU parallelismMS ConcRT, Apple GCD, Intel TBBAll has some good parts!Neither works on all of our current platformsOpenMP Just say no. Parallelism can’t be tacked on, need to be an explicit core partNeed C++ enhancements to simplify usageC++ 0x lambdas / GCD blocks Glacial C++ development & deployment Want on all platforms, so lost on this console generation27
Challenge 2 - Definition”Real-time interactive graphics & simulation at Pixar level of quality”Needed visual features:Complete anti-aliasing + natural motion blur & DOFGlobal indirect lighting & reflectionsSub-pixel geometryOITHuge improvements in character animationThese require massively more compute, BW and improved model!(animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now)28
Challenge 2 - ProblemsProblems & limitations with current GPU model:Fixed rasterization pipelineCompute pipeline not fast enough to replace itGPU is handicapped by being spoon-fed by CPUIrregular workloads are difficult & inefficientCan’t express nested data-parallelismwellCurrent HLSL is a very limited language29
Challenge 2 - SolutionsAbsolutely a job for a high-throughput oriented data-parallel processorWith a highly flexible programming modelThe CPU, as we know it, and APIs/drivers are only in the wayPure software solution not practical as next step after DX11 PC Multi-vendor & multi-architecture marketplaceBut for future consoles, the more flexible the better (perf. permitting)Long term goal: Multi-vendor & multi-platform standard/virtual DP ISAWant a rich programmable compute model as next stepNested data-parallelism & GPU self-feedingLow-latency CPU<->GPU interactionEfficiently target varying HW architectures30
”Pipelined Compute Shaders”Queues as streaming I/O between compute kernelsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresCan target multiple types of HW & architecturesHybrid graphics/compute user-defined pipelinesLanguage/API defining fixed stages inputs & outputsPipelines can feed other pipelines (similar to DrawIndirect)Reyes-style Rendering with Ray TracingShadeSplitRasterSub-DPrimsTessFrame BufferTrace31
”Pipelined Compute Shaders”Wanted for next major DirectX and OpenCL/OpenGLAs a standard, as soon as possibleRun on all: discrete GPU, integrated GPU and CPUModel is also a good fit for many of our CPU/SPU jobsParts of job graph can be seen as queues between stagesEasier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes”Or dynamic memory allocation32
Language?Future DP language is a big questionBut the concepts & infrastructure are what is important!Could be an extendedHLSL or ”data-parallel C++”Data-oriented imperative language (i.e. not standard C++)Think HLSL would probably be easier & the most explicitAmount of code is small and written from scratchShader-like implicit SoA vectorization (SIMT)Instead of SSE/LRBni-like explicit vectorizationNeed to be first class language citizen, not tacked on C++Easier to target multiple evolving architectures implicitly33
Future hardware (1/2)Single main memory & address spaceCritical to share resources between graphics, simulation and game in immersive dynamic worldsConfigurable kernel local stores / cacheSimilar to Nvidia Fermi & Intel LarrabeeLocal stores = reliability & good for regular loadsTogether with user-driven async DMA enginesCaches = essential for irregular data structuresCache coherency?Not always important for kernelsBut essential for general code, can partition?34
Future hardware (2/2)2015 = 40 TFLOPS, we would spend it on:80% graphics15% simulation4% misc1% game (wouldn’t use all 400 GFLOPS for game logic & glue!)OOE CPUs more efficient for the majority of our game codeBut for the vast majority of our FLOPS these are fully irrelevantCan evolve to a small dot on a sea of DP coresOr run scalar on general DP cores wasting some computeIn other words: no need for separate CPU and GPU!35-> single heterogeneous processor
ConclusionsFuture is an interleaved mix of task- & data-parallelismOn both the HW and SW levelBut programmable DP is where the massive compute is doneData-parallelism requires data-oriented designDeveloper productivity can’t be limited by model(s)It should enhance productivity & perf on all levelsTools & language constructs play a critical roleWe should welcome our parallel future!36
Thanks toDICE, EA and the Frostbite teamThe graphics/gamedev community on TwitterSteve McCalla, Mike BurrowsChas BoydNicolas Thibieroz, Mark LeatherDan Wexler, Yury UralskyKayvon Fatahalian37
ReferencesPrevious Frostbite-related talks:[1] Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html[2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf[3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf[4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html[5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html[6] Johan Andersson. ”Parallel Graphics in Frostbite”. Siggraph 2009, Beyond Programmable Shading course. http://repi.blogspot.com/2009/08/siggraph09-parallel-graphics-in.html38
Questions?Email:johan.andersson@dice.se   Blog: http://repi.se    Twitter: @repi39For more DICE talks: http://publications.dice.se
Bonus slides40
Game development2 year development cycleNew IP often takes much longer, 3-5 yearsEngine is continuously in development & usedAAA teams of 70-90 people40% artists30% designers20% programmers10% audioBudgets $20-40 millionCross-platform development is market realityXbox 360 and PlayStation 3PC (DX9, DX10 & DX11 for BC2)Current consoles will stay with us for many more years41
Game engine requirements (1/2)Stable real-time performanceFrame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usageFixed budgets for systems & content, fail if overAvoid runtime allocationsLove unified memory! Cross-platformThe consoles determines our base tech level & focusPS3 is design target, most difficult and good potentialScale up for PC, dual core is min spec (slow!)42

Parallel Futures of a Game Engine (v2.0)

  • 1.
    Stockholm Game DeveloperForum 2010Parallel Futures of a Game Engine (v2.0)Johan AnderssonRendering Architect, DICE
  • 2.
    BackgroundDICEStockholm, Sweden~250 employeesPartof Electronic ArtsBattlefield & Mirror’s Edge game seriesFrostbiteProprietary game engine used at DICE & EADeveloped by DICE over the last 5 years2
  • 3.
  • 4.
  • 5.
    OutlineCurrent parallelismFuturesQ&ASlides willbe available online later todayhttp://publications.dice.se and http://repi.se5
  • 6.
  • 7.
    Levels of codein FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU7
  • 8.
    Levels of codein FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU8
  • 9.
    General ”game code”(1/2)This is the majority of our 1.5 million lines of C++Runs on Win32, Win64, Xbox 360 and PS3We do not use any scripting languageSimilar to general application codeHuge amount of code & logic to maintain + continue to developLow compute density”Glue code”Scattered in memory (pointer chasing)Difficult to efficiently parallelizeOut-of-order execution is a big help, but consoles are in-order Key to be able to quickly iterate & changeThis is the actual game logic & glue that builds the gameC++ not ideal, but has the invested infrastructure9
  • 10.
    General ”game code”(2/2)PS3 is one of the main challengesStandard CPU parallelization doesn’t help (much)CELL only has 2 HW threads on the PPUSplit the code in 2: game code & system codeGame logic, policy and glue code only on CPU”If it runs well on the PS3 PPU, it runs well everywhere”Lower-level systems on PS3 SPUsMain goals going forward:Simplify & structure code baseReduce coupling with lower-level systemsIncrease in task parallelism for PCCELL processor10
  • 11.
    Levels of codein FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU11
  • 12.
    Job-based parallelismEssential toutilize the cores on our target platformsXbox 360: 6 HW threadsPlayStation 3: 2 HW threads + 6 powerful SPUsPC: 2-16 HW threadsDivide up system work into Jobs (a.k.a. Tasks)15-200k C++ code each. 25k is commonCan depend on each other (if needed)Dependencies create job graphAll HW threads consume jobs ~200-300 / frame12
  • 13.
    What is aJob for us?An asynchronous function callFunction ptr + 4 uintptr_t parametersCross-platform scheduler: EA JobManagerUses work stealing2 types of Jobs in Frostbite:CPU job (good)General code moved into job instead of threadsSPU job (great!)Stateless pure functions, no side effectsData-oriented, explicit memory DMA to local storeDesigned to run on the PS3 SPUs = also very fast on in-order CPUCan hot-swap quick iterations 13
  • 14.
    Frostbite CPU jobgraphBuild big job graphs:Batch, batch, batch
  • 15.
    Mix CPU- &SPU-jobs
  • 16.
    Future: Mix inlow-latency GPU-jobsJob dependencies determine:Execution order
  • 17.
  • 18.
  • 19.
    i.e. the effectiveparallelismIntermixed task- & data-parallelismaka Braided Parallelism
  • 20.
  • 21.
    aka Tasks andKernels14
  • 22.
  • 23.
  • 24.
    17Timing View: BattlefieldBad Company 2 (PS3)
  • 25.
    Rendering jobsRendering systemsare heavily divided up into CPU-& SPU-jobsJobs:Terrain geometry [3]Undergrowth generation [2]Decal projection [4]Particle simulationFrustum cullingOcclusion cullingOcclusion rasterizationCommand buffer generation [6]PS3: Triangle culling [6]Most will move to GPUEventually.. A few have already!PC CPU<->GPU latency wall Mostly one-way data flow18
  • 26.
    Occlusion culling jobexampleProblem: Buildings & env occlude large amounts of objectsObscured objects still have to:Update logic & animationsGenerate command bufferProcessed on CPU & GPU = expensive & wasteful Difficult to implement full culling:Destructible buildingsDynamic occludeesDifficult to precompute From Battlefield: Bad Company PS319
  • 27.
    Solution: Software occlusioncullingRasterize coarse zbuffer on SPU/CPU256x114 floatLow-poly occluder meshes100 m view distanceMax 10000 vertices/frameParallel vertex & raster SPU-jobsCost: a few millisecondsCull all objects against zbufferScreen-space bounding-box testBefore passed to all other systemsBig performance savings!20
  • 28.
    GPU occlusion cullingIdeallywant to use the GPU, but current APIs are limited:Occlusion queries introduces overhead & latencyConditional rendering only helps GPUCompute Shader impl. possible, but same latency wallFuture 1: Low-latency GPU execution contextRasterization and testing done on GPU where it belongsLockstep with CPU, need to read back within a few msShould be possible on Larrabee & Fusion, want on all PCFuture 2: Move entire cull & rendering to ”GPU”World, cull, systems, dispatch. End goalVery thin GPU graphics driver – GPU feeds itself21
  • 29.
    Levels of codein FrostbiteEditor (C#)Pipeline (C++)Game code (C++)System CPU-jobs (C++)System SPU-jobs (C++/asm)Generated shaders (HLSL)Compute kernels (HLSL)OfflineCPURuntimeGPU22
  • 30.
    Shader typesGenerated shaders[1]Graph-based surface shadersTreated as content, not codeArtist createdGenerates HLSL codeUsed by all meshes and 3d surfacesGraphics / Compute kernelsHand-coded & optimized HLSLStatically linked in with C++Pixel- & compute-shadersLighting, post-processing & special effectsGraph-based surface shader in FrostEd 223
  • 31.
  • 32.
    3 major challenges/goalsgoing forward:How do we make it easier to develop, maintain & parallelize general game code?What do we need to continue to innovate & scale up real-time computational graphics?How can we move & scale up advanced simulation and non-graphics tasks to data-parallel manycore processors?ChallengesMost likely the same solution(s)!25
  • 33.
    Challenge 1“How dowe make it easier to develop, maintain & parallelize general game code?”Shared State Concurrency is a killerNot a big believer in Software Transactional Memory either Because of performance and too ”optimistic” flowA more strict & adapted C++ modelSupport for true immutable & r/w-only memory accessPer-thread/task memory access opt-inReduce the possibility for side effects in parallel codeAs much compile-time validation as possibleMicro-threads / coroutines as first class citizensMore?Other languages?26
  • 34.
    Challenge 1 -Task parallelismMultiple task librariesEA JobManagerDynamic job graphs, on-demand dependencis & continuationsTargeted to cross-platform SPU-jobs, key requirement for this generationNot geared towards super-simple to use CPU parallelismMS ConcRT, Apple GCD, Intel TBBAll has some good parts!Neither works on all of our current platformsOpenMP Just say no. Parallelism can’t be tacked on, need to be an explicit core partNeed C++ enhancements to simplify usageC++ 0x lambdas / GCD blocks Glacial C++ development & deployment Want on all platforms, so lost on this console generation27
  • 35.
    Challenge 2 -Definition”Real-time interactive graphics & simulation at Pixar level of quality”Needed visual features:Complete anti-aliasing + natural motion blur & DOFGlobal indirect lighting & reflectionsSub-pixel geometryOITHuge improvements in character animationThese require massively more compute, BW and improved model!(animation can’t be solved with just more/better compute, so pretend it doesn’t exist for now)28
  • 36.
    Challenge 2 -ProblemsProblems & limitations with current GPU model:Fixed rasterization pipelineCompute pipeline not fast enough to replace itGPU is handicapped by being spoon-fed by CPUIrregular workloads are difficult & inefficientCan’t express nested data-parallelismwellCurrent HLSL is a very limited language29
  • 37.
    Challenge 2 -SolutionsAbsolutely a job for a high-throughput oriented data-parallel processorWith a highly flexible programming modelThe CPU, as we know it, and APIs/drivers are only in the wayPure software solution not practical as next step after DX11 PC Multi-vendor & multi-architecture marketplaceBut for future consoles, the more flexible the better (perf. permitting)Long term goal: Multi-vendor & multi-platform standard/virtual DP ISAWant a rich programmable compute model as next stepNested data-parallelism & GPU self-feedingLow-latency CPU<->GPU interactionEfficiently target varying HW architectures30
  • 38.
    ”Pipelined Compute Shaders”Queuesas streaming I/O between compute kernelsSimple & expressive model supporting irregular workloadsKeeps data on chip, supports variable sized caches & coresCan target multiple types of HW & architecturesHybrid graphics/compute user-defined pipelinesLanguage/API defining fixed stages inputs & outputsPipelines can feed other pipelines (similar to DrawIndirect)Reyes-style Rendering with Ray TracingShadeSplitRasterSub-DPrimsTessFrame BufferTrace31
  • 39.
    ”Pipelined Compute Shaders”Wantedfor next major DirectX and OpenCL/OpenGLAs a standard, as soon as possibleRun on all: discrete GPU, integrated GPU and CPUModel is also a good fit for many of our CPU/SPU jobsParts of job graph can be seen as queues between stagesEasier to write kernels/jobs with streaming I/O Instead of explicit fixed-buffers and ”memory passes”Or dynamic memory allocation32
  • 40.
    Language?Future DP languageis a big questionBut the concepts & infrastructure are what is important!Could be an extendedHLSL or ”data-parallel C++”Data-oriented imperative language (i.e. not standard C++)Think HLSL would probably be easier & the most explicitAmount of code is small and written from scratchShader-like implicit SoA vectorization (SIMT)Instead of SSE/LRBni-like explicit vectorizationNeed to be first class language citizen, not tacked on C++Easier to target multiple evolving architectures implicitly33
  • 41.
    Future hardware (1/2)Singlemain memory & address spaceCritical to share resources between graphics, simulation and game in immersive dynamic worldsConfigurable kernel local stores / cacheSimilar to Nvidia Fermi & Intel LarrabeeLocal stores = reliability & good for regular loadsTogether with user-driven async DMA enginesCaches = essential for irregular data structuresCache coherency?Not always important for kernelsBut essential for general code, can partition?34
  • 42.
    Future hardware (2/2)2015= 40 TFLOPS, we would spend it on:80% graphics15% simulation4% misc1% game (wouldn’t use all 400 GFLOPS for game logic & glue!)OOE CPUs more efficient for the majority of our game codeBut for the vast majority of our FLOPS these are fully irrelevantCan evolve to a small dot on a sea of DP coresOr run scalar on general DP cores wasting some computeIn other words: no need for separate CPU and GPU!35-> single heterogeneous processor
  • 43.
    ConclusionsFuture is aninterleaved mix of task- & data-parallelismOn both the HW and SW levelBut programmable DP is where the massive compute is doneData-parallelism requires data-oriented designDeveloper productivity can’t be limited by model(s)It should enhance productivity & perf on all levelsTools & language constructs play a critical roleWe should welcome our parallel future!36
  • 44.
    Thanks toDICE, EAand the Frostbite teamThe graphics/gamedev community on TwitterSteve McCalla, Mike BurrowsChas BoydNicolas Thibieroz, Mark LeatherDan Wexler, Yury UralskyKayvon Fatahalian37
  • 45.
    ReferencesPrevious Frostbite-related talks:[1]Johan Andersson. ”Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing Techniques ”. GDC 2007. http://repi.blogspot.com/2009/01/conference-slides.html[2] Natasha Tartarchuk & Johan Andersson. ”Rendering Architecture and Real-time Procedural Shading & Texturing Techniques”. GDC 2007. http://developer.amd.com/Assets/Andersson-Tatarchuk-FrostbiteRenderingArchitecture(GDC07_AMD_Session).pdf[3] Johan Andersson. ”Terrain Rendering in Frostbite using Procedural Shader Splatting”. Siggraph 2007. http://developer.amd.com/media/gpu_assets/Andersson-TerrainRendering(Siggraph07).pdf[4] Daniel Johansson & Johan Andersson. “Shadows & Decals – D3D10 techniques from Frostbite”. GDC 2009. http://repi.blogspot.com/2009/03/gdc09-shadows-decals-d3d10-techniques.html[5] Bill Bilodeau & Johan Andersson. “Your Game Needs Direct3D 11, So Get Started Now!”. GDC 2009. http://repi.blogspot.com/2009/04/gdc09-your-game-needs-direct3d-11-so.html[6] Johan Andersson. ”Parallel Graphics in Frostbite”. Siggraph 2009, Beyond Programmable Shading course. http://repi.blogspot.com/2009/08/siggraph09-parallel-graphics-in.html38
  • 46.
    Questions?Email:johan.andersson@dice.se Blog: http://repi.se Twitter: @repi39For more DICE talks: http://publications.dice.se
  • 47.
  • 48.
    Game development2 yeardevelopment cycleNew IP often takes much longer, 3-5 yearsEngine is continuously in development & usedAAA teams of 70-90 people40% artists30% designers20% programmers10% audioBudgets $20-40 millionCross-platform development is market realityXbox 360 and PlayStation 3PC (DX9, DX10 & DX11 for BC2)Current consoles will stay with us for many more years41
  • 49.
    Game engine requirements(1/2)Stable real-time performanceFrame-driven updates, 30 fps Few threads, instead per-frame jobs/tasks for everything Predictable memory usageFixed budgets for systems & content, fail if overAvoid runtime allocationsLove unified memory! Cross-platformThe consoles determines our base tech level & focusPS3 is design target, most difficult and good potentialScale up for PC, dual core is min spec (slow!)42
  • 50.
    Game engine requirements(2/2)Full system profiling/debuggingEngine is a vertical solution, touches everywherePIX, xbtracedump, SN Tuner, ETW, GPUViewQuick iterationsEssential in order to be creativeFast building & fast loading, hot-swapping resourcesAffects both the tools and the gameMiddlewareUse when it make senses, cross-platform & optimizedParallelism have to go through our systems43
  • 51.
    Editor & PipelineEditor(”FrostEd 2”)WYSIWYG editor for contentC#, Windows onlyBasic threading / tasksPipelineOffline/background data-processing & conversionC++, some MC++, Windows onlyTypically IO-boundA few compute-heavy steps use CPU-jobsTexture compression uses CUDA, prefer OpenCL or CSCPU parallelism models are generally not a problem here44
  • 52.
    struct FB_ALIGN(16) EntityRenderCullJobData{ enum { MaxSphereTreeCount= 2, MaxStaticCullTreeCount = 2 }; uint sphereTreeCount; const SphereNode* sphereTrees[MaxSphereTreeCount]; u8 viewCount; u8 frustumCount; u8 viewIntersectFlags[32]; Frustum frustums[32]; .... (cut out 2/3 of struct for display size) u32 maxOutEntityCount; // Output data, pre-allocated by callee u32 outEntityCount; EntityRenderCullInfo* outEntities;};void entityRenderCullJob(EntityRenderCullJobData* data);void validate(const EntityRenderCullJobData& data);Frustum culling of dynamic entities in sphere treestruct contain all input data neededMax output data pre-allocated by calleeSingle job functionCompile both as CPU & SPU jobOptional struct validation funcEntityRenderCull job example45
  • 53.
    EntityRenderCull SPU setup//local store variablesEntityRenderCullJobData g_jobData;float g_zBuffer[256*114];u16 g_terrainHeightData[64*64];int main(uintptr_t dataEa, uintptr_t, uintptr_t, uintptr_t){ dmaBlockGet("jobData", &g_jobData, dataEa, sizeof(g_jobData)); validate(g_jobData); if (g_jobData.zBufferTestEnable) { dmaAsyncGet("zBuffer", g_zBuffer, g_jobData.zBuffer, g_jobData.zBufferResX*g_jobData.zBufferResY*4); g_jobData.zBuffer = g_zBuffer; if (g_jobData.zBufferShadowTestEnable && g_jobData.terrainHeightData) { dmaAsyncGet("terrainHeight", g_terrainHeightData, g_jobData.terrainHeightData, g_jobData.terrainHeightDataSize); g_jobData.terrainHeightData = g_terrainHeightData; } dmaWaitAll(); // block on both DMAs } // run the actual job, will internally do streaming DMAs to the output entity list entityRenderCullJob(&g_jobData); // put back the data because we changed outEntityCount dmaBlockPut(dataEa, &g_jobData, sizeof(g_jobData)); return 0;}46
  • 54.
    Timing viewExample: PC,4 CPU cores, 2 GPUs in AFR (AMD Radeon 4870x2)Real-time in-game overlay See timing events & effective parallelismOn CPU, SPU & GPU – for all platformsUse to reduce sync-points & optimize load balancingGPU timing through DX event queriesOur main performance tool!47
  • 55.
    Language (cont.)Requirements:Full richdebugging, ideally in Visual StudioAssertsInternal kernel profilingHot-swapping / edit-and-continue of kernelsOpportunity for IHVs and platform providers to innovate here!Goal: Cross-vendor standardThink of the co-development of Nvidia Cg and HLSL48
  • 56.
    Unified development environmentWantto debug/profile task- & data-parallel code seamlesslyOn all processors! CPU, GPU & manycoreFrom any vendor = requires standard APIs or ISAsVisual Studio 2010 looks promising for task-parallel PC codeUsable by our offline tools & hopefully PC runtimeWant to integrate our own JobManagerNvidia Nexus looks great for data-parallel GPU codeEventual must have for all HW, how?Huge step forward!49VS2010 Parallel Tasks

Editor's Notes

  • #5 Xbox 360 version + offline super-high AA and resolution
  • #10 ”Glue code”
  • #13 Better code structure!Gustafson’s LawFixed 33 ms/fWe don’t use threads for anything that can be a bit computationally heavy, jobs instead
  • #14 Data-layout
  • #19 We have a lot of jobs across the engine, too many to go through so I chose to focus more on some of the rendering.Our intention is to move all of these to the GPU
  • #21 Conservative
  • #22 Want to rasterize on GPU, not CPU CPU Û GPU job dependencies
  • #27 Simple keywords as override &amp; sealed have been a help in other areas
  • #30 No low-latency GPU contextsNested data-parallelism
  • #31 Would require multi-vendor open ISA standard
  • #34 With nested data parallelism inside kernels
  • #37 OpenCL
  • #43 Isolate systems/areasWe do not use exceptions, nor much error handling
  • #44 Havok &amp; Kynapse
  • #49 Bindings to C very interesting!