Johan Andersson
Technical Director
Frostbite
LOW-LEVEL GRAPHICS
INTEL VCARB 2014
Email: johan@frostbite.com
Web: http://frostbite.com
Twitter: @repi
 Frostbite has 2 very different rendering use cases:
1. Rendering the world with huge amounts of objects
 Tons of draw calls and with lots of different states & pipelines
 Heavily CPU limited with high-level APIs
 Read-only view of the world and resources (except for render targets)
2. Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc
 Tons of different types of complex operations, not a lot of draw calls
 ~50 different rendering passes in Frostbite
 Managing resource state, memory and running on different queues (graphics, compute, DMA)
 Both are very important and low-level discussion & design target are for both!
RENDERING USE CASES
 I consider low-level APIs having 3 important design targets:
1. Solving the many draw calls CPU performance problem
 CPU overhead
 Parallel command buffer building
 Binding model & pipeline objects
 Explicit resource state handling
2. Enabling new GPU programmability
 Bindless
 CPU/GPU collaboration
 GPU work creation
3. Improving GPU performance & memory usage
 Utilize multiple hardware engines in parallel
 Explicit memory management / virtual memory
 Explicit resource state handling
 Believe one needs a clean slate API design to target all of these – new paradigm model
 Hence Mantle & DX12 - too much legacy in the way in DX11, GL4 and GLES
LOW-LEVEL API TARGETS
 1-2 orders of a magnitude less CPU overhead for lots of draw calls
 Thanks to explicit resource barriers, explicit memory management & few bind points
 Problem for us is not ”how to do 1 million similar draw calls with same state”
 Problem is ”how to do 10-100k draw calls with different state”
 Should in parallel also try and reduce amount of state (such as bindless), but wary of GPU overhead
 Stable & consistent performance
 Major benefit for users, will only be more important going forward
 Explicit submission to GPU, not at random commands
 No runtime late binding or compilation of shaders
 Can be a challenge for engines to know all pipeline state up front, but worth designing for!
 Improved GPU performance
 Engine has more high-level knowledge of how resources are used
 Seen examples of adv GPU driver opts that are easier to implement thanks to low-level model
PERFORMANCE
DX11
MANTLE
 Key design to significantly lower driver overhead and complexity
 Explicit hazard tracking
 Hides architecture-specific caches
 Can be challenge in the apps/engines – but worth it
 Esp. with multiple queues & out-of-order command buffers
 Requires very clear specifications and great validation layer!
 In Frostbite we mostly track this per resource instead for simplicity & performance
 Instead of per subresource
RESOURCE TRANSITIONS
 Example complex cases:
 Read-only depth testing together with stencil writing (different state for depth & stencil)
 Mipmap generation (sub-resource specific states)
 Async compute offloading part of graphics pipeline
 Critical to make sure future low-level APIs have the right states/usages exposed
 Devil is in the details
 Does the sw/hw require the transition to happen on the same queue that just used a resource?
 Look at your concrete use cases early!
 Would help if future hardware doesn’t need as many different barriers
 But at what cost?
RESOURCE TRANSITIONS
 Aka ”descriptor sets” in Mantle
 This new model has been working very well for us – even with very basic handling
 Great to have as separate objects not connected to device context
 Treated all resource references as dynamic and built every frame
 ~15k resource entries in single large table per frame in BF4 (heavy instancing)
 Lots of opportunity going forward
 Split out static resources to own persistent tables
 Split out shared common resources to own table
 Connect together with nested resource tables
 Bindless for more complex cases
RESOURCE TABLES
 Seen really good wins with both async DMAs and async compute
 And it is an important target for us going forward
 Additional opportunities
 DMA in and out of embedded memory
 Buffer/Image/Video compression/decompression
 More?
 What engines / queues does Intel have & be able to expose?
MULTIPLE GPU QUEUES
 Kick CPU job from GPU
 Possible now with explicit fences & events that CPU can poll/wait on
 Enables filling in resources just in time
 Want async command buffers
 Kick CPU job from GPU and CPU builds & unlocks an already queued command buffer
 We’ve been doing this on consoles - ”just” a software limitation
 Example use case: Sample Distributed Shadowmaps without stalling GPU pipeline
 Major opportunity going forward
 Needs support in OS:es & driver models
 Drive rendering pipeline based on data from the current GPU frame (such as the zbuffer)
 Decide where to run code based on power efficiency
 Important both for discrete & integrated GPUs
GPU/CPU COLLABORATION
 For us, have been both easier to work with & get good performance
 What we are used to from working on consoles and have architecture for
 Update buffers from any thread, not locked to a device context
 Persistently map or pin buffer & image objects for easy reading & writing
 Pool memory to reduce overhead
 Alias objects to the same memory for significant reduction in memory
 Esp. for render targets
 Built-in virtual memory mapping
 Easier & more flexible way to manage large amount of memory
EXPLICIT MEMORY MANAGEMENT
 Major issue for us during BF4, avoiding VidMM stalls
 VidMM is a black box  Difficult to know what is going on & why
 Explicitly track memory references for each command buffer
 Tweak memory pools & chunk sizes
 Force memory to different heaps
 Setting memory priorities
 Going forward will redesign to strongly avoid overcommiting
 Automatically balance streaming pool settings and cap graphics settings
 Are there any other ways the app, OS and GPUs can handle this?
 Page faulting GPUs?
OVERCOMMITING VIDEO MEMORY
 Extensions are a great fit for low-level APIs
 Low-level extensions that exposes new hardware functionality
 Examples: PixelSync and EXT_shader_pixel_local_storage
 No need for huge amount of extensions like OGL which is mostly a high-level API
 Mantle has 5 extensions for the AMD-specific hardware functionality & Windows-specific integration
 Potential challenge for DX that has (officially) not had extensions before
 Would like to see DX official extensions, including shader code extensions!
 GL & Mantle has a strong advantage here
 Other alternative would be rapid iterations on the DX API (small updates quarterly?)
EXTENSIONS
 Discuss! 
THANKS

Low-level Graphics APIs

  • 1.
    Johan Andersson Technical Director Frostbite LOW-LEVELGRAPHICS INTEL VCARB 2014 Email: johan@frostbite.com Web: http://frostbite.com Twitter: @repi
  • 2.
     Frostbite has2 very different rendering use cases: 1. Rendering the world with huge amounts of objects  Tons of draw calls and with lots of different states & pipelines  Heavily CPU limited with high-level APIs  Read-only view of the world and resources (except for render targets) 2. Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc  Tons of different types of complex operations, not a lot of draw calls  ~50 different rendering passes in Frostbite  Managing resource state, memory and running on different queues (graphics, compute, DMA)  Both are very important and low-level discussion & design target are for both! RENDERING USE CASES
  • 3.
     I considerlow-level APIs having 3 important design targets: 1. Solving the many draw calls CPU performance problem  CPU overhead  Parallel command buffer building  Binding model & pipeline objects  Explicit resource state handling 2. Enabling new GPU programmability  Bindless  CPU/GPU collaboration  GPU work creation 3. Improving GPU performance & memory usage  Utilize multiple hardware engines in parallel  Explicit memory management / virtual memory  Explicit resource state handling  Believe one needs a clean slate API design to target all of these – new paradigm model  Hence Mantle & DX12 - too much legacy in the way in DX11, GL4 and GLES LOW-LEVEL API TARGETS
  • 4.
     1-2 ordersof a magnitude less CPU overhead for lots of draw calls  Thanks to explicit resource barriers, explicit memory management & few bind points  Problem for us is not ”how to do 1 million similar draw calls with same state”  Problem is ”how to do 10-100k draw calls with different state”  Should in parallel also try and reduce amount of state (such as bindless), but wary of GPU overhead  Stable & consistent performance  Major benefit for users, will only be more important going forward  Explicit submission to GPU, not at random commands  No runtime late binding or compilation of shaders  Can be a challenge for engines to know all pipeline state up front, but worth designing for!  Improved GPU performance  Engine has more high-level knowledge of how resources are used  Seen examples of adv GPU driver opts that are easier to implement thanks to low-level model PERFORMANCE
  • 5.
  • 6.
  • 7.
     Key designto significantly lower driver overhead and complexity  Explicit hazard tracking  Hides architecture-specific caches  Can be challenge in the apps/engines – but worth it  Esp. with multiple queues & out-of-order command buffers  Requires very clear specifications and great validation layer!  In Frostbite we mostly track this per resource instead for simplicity & performance  Instead of per subresource RESOURCE TRANSITIONS
  • 8.
     Example complexcases:  Read-only depth testing together with stencil writing (different state for depth & stencil)  Mipmap generation (sub-resource specific states)  Async compute offloading part of graphics pipeline  Critical to make sure future low-level APIs have the right states/usages exposed  Devil is in the details  Does the sw/hw require the transition to happen on the same queue that just used a resource?  Look at your concrete use cases early!  Would help if future hardware doesn’t need as many different barriers  But at what cost? RESOURCE TRANSITIONS
  • 9.
     Aka ”descriptorsets” in Mantle  This new model has been working very well for us – even with very basic handling  Great to have as separate objects not connected to device context  Treated all resource references as dynamic and built every frame  ~15k resource entries in single large table per frame in BF4 (heavy instancing)  Lots of opportunity going forward  Split out static resources to own persistent tables  Split out shared common resources to own table  Connect together with nested resource tables  Bindless for more complex cases RESOURCE TABLES
  • 10.
     Seen reallygood wins with both async DMAs and async compute  And it is an important target for us going forward  Additional opportunities  DMA in and out of embedded memory  Buffer/Image/Video compression/decompression  More?  What engines / queues does Intel have & be able to expose? MULTIPLE GPU QUEUES
  • 11.
     Kick CPUjob from GPU  Possible now with explicit fences & events that CPU can poll/wait on  Enables filling in resources just in time  Want async command buffers  Kick CPU job from GPU and CPU builds & unlocks an already queued command buffer  We’ve been doing this on consoles - ”just” a software limitation  Example use case: Sample Distributed Shadowmaps without stalling GPU pipeline  Major opportunity going forward  Needs support in OS:es & driver models  Drive rendering pipeline based on data from the current GPU frame (such as the zbuffer)  Decide where to run code based on power efficiency  Important both for discrete & integrated GPUs GPU/CPU COLLABORATION
  • 12.
     For us,have been both easier to work with & get good performance  What we are used to from working on consoles and have architecture for  Update buffers from any thread, not locked to a device context  Persistently map or pin buffer & image objects for easy reading & writing  Pool memory to reduce overhead  Alias objects to the same memory for significant reduction in memory  Esp. for render targets  Built-in virtual memory mapping  Easier & more flexible way to manage large amount of memory EXPLICIT MEMORY MANAGEMENT
  • 13.
     Major issuefor us during BF4, avoiding VidMM stalls  VidMM is a black box  Difficult to know what is going on & why  Explicitly track memory references for each command buffer  Tweak memory pools & chunk sizes  Force memory to different heaps  Setting memory priorities  Going forward will redesign to strongly avoid overcommiting  Automatically balance streaming pool settings and cap graphics settings  Are there any other ways the app, OS and GPUs can handle this?  Page faulting GPUs? OVERCOMMITING VIDEO MEMORY
  • 14.
     Extensions area great fit for low-level APIs  Low-level extensions that exposes new hardware functionality  Examples: PixelSync and EXT_shader_pixel_local_storage  No need for huge amount of extensions like OGL which is mostly a high-level API  Mantle has 5 extensions for the AMD-specific hardware functionality & Windows-specific integration  Potential challenge for DX that has (officially) not had extensions before  Would like to see DX official extensions, including shader code extensions!  GL & Mantle has a strong advantage here  Other alternative would be rapid iterations on the DX API (small updates quarterly?) EXTENSIONS
  • 15.

Editor's Notes

  • #2  I’ve been at DICE & EA for 13 years