Low-level Graphics APIs

Johan Andersson
Technical Director
Frostbite
LOW-LEVEL GRAPHICS
INTEL VCARB 2014
Email: johan@frostbite.com
Web: http://frostbite.com
Twitter: @repi

 Frostbite has 2 very different rendering use cases:
1. Rendering the world with huge amounts of objects
 Tons of draw calls and with lots of different states & pipelines
 Heavily CPU limited with high-level APIs
 Read-only view of the world and resources (except for render targets)
2. Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc
 Tons of different types of complex operations, not a lot of draw calls
 ~50 different rendering passes in Frostbite
 Managing resource state, memory and running on different queues (graphics, compute, DMA)
 Both are very important and low-level discussion & design target are for both!
RENDERING USE CASES

 I consider low-level APIs having 3 important design targets:
1. Solving the many draw calls CPU performance problem
 CPU overhead
 Parallel command buffer building
 Binding model & pipeline objects
 Explicit resource state handling
2. Enabling new GPU programmability
 Bindless
 CPU/GPU collaboration
 GPU work creation
3. Improving GPU performance & memory usage
 Utilize multiple hardware engines in parallel
 Explicit memory management / virtual memory
 Explicit resource state handling
 Believe one needs a clean slate API design to target all of these – new paradigm model
 Hence Mantle & DX12 - too much legacy in the way in DX11, GL4 and GLES
LOW-LEVEL API TARGETS

 1-2 orders of a magnitude less CPU overhead for lots of draw calls
 Thanks to explicit resource barriers, explicit memory management & few bind points
 Problem for us is not ”how to do 1 million similar draw calls with same state”
 Problem is ”how to do 10-100k draw calls with different state”
 Should in parallel also try and reduce amount of state (such as bindless), but wary of GPU overhead
 Stable & consistent performance
 Major benefit for users, will only be more important going forward
 Explicit submission to GPU, not at random commands
 No runtime late binding or compilation of shaders
 Can be a challenge for engines to know all pipeline state up front, but worth designing for!
 Improved GPU performance
 Engine has more high-level knowledge of how resources are used
 Seen examples of adv GPU driver opts that are easier to implement thanks to low-level model
PERFORMANCE

 Key design to significantly lower driver overhead and complexity
 Explicit hazard tracking
 Hides architecture-specific caches
 Can be challenge in the apps/engines – but worth it
 Esp. with multiple queues & out-of-order command buffers
 Requires very clear specifications and great validation layer!
 In Frostbite we mostly track this per resource instead for simplicity & performance
 Instead of per subresource
RESOURCE TRANSITIONS

 Example complex cases:
 Read-only depth testing together with stencil writing (different state for depth & stencil)
 Mipmap generation (sub-resource specific states)
 Async compute offloading part of graphics pipeline
 Critical to make sure future low-level APIs have the right states/usages exposed
 Devil is in the details
 Does the sw/hw require the transition to happen on the same queue that just used a resource?
 Look at your concrete use cases early!
 Would help if future hardware doesn’t need as many different barriers
 But at what cost?
RESOURCE TRANSITIONS

 Aka ”descriptor sets” in Mantle
 This new model has been working very well for us – even with very basic handling
 Great to have as separate objects not connected to device context
 Treated all resource references as dynamic and built every frame
 ~15k resource entries in single large table per frame in BF4 (heavy instancing)
 Lots of opportunity going forward
 Split out static resources to own persistent tables
 Split out shared common resources to own table
 Connect together with nested resource tables
 Bindless for more complex cases
RESOURCE TABLES

 Seen really good wins with both async DMAs and async compute
 And it is an important target for us going forward
 Additional opportunities
 DMA in and out of embedded memory
 Buffer/Image/Video compression/decompression
 More?
 What engines / queues does Intel have & be able to expose?
MULTIPLE GPU QUEUES

 Kick CPU job from GPU
 Possible now with explicit fences & events that CPU can poll/wait on
 Enables filling in resources just in time
 Want async command buffers
 Kick CPU job from GPU and CPU builds & unlocks an already queued command buffer
 We’ve been doing this on consoles - ”just” a software limitation
 Example use case: Sample Distributed Shadowmaps without stalling GPU pipeline
 Major opportunity going forward
 Needs support in OS:es & driver models
 Drive rendering pipeline based on data from the current GPU frame (such as the zbuffer)
 Decide where to run code based on power efficiency
 Important both for discrete & integrated GPUs
GPU/CPU COLLABORATION

 For us, have been both easier to work with & get good performance
 What we are used to from working on consoles and have architecture for
 Update buffers from any thread, not locked to a device context
 Persistently map or pin buffer & image objects for easy reading & writing
 Pool memory to reduce overhead
 Alias objects to the same memory for significant reduction in memory
 Esp. for render targets
 Built-in virtual memory mapping
 Easier & more flexible way to manage large amount of memory
EXPLICIT MEMORY MANAGEMENT

 Major issue for us during BF4, avoiding VidMM stalls
 VidMM is a black box  Difficult to know what is going on & why
 Explicitly track memory references for each command buffer
 Tweak memory pools & chunk sizes
 Force memory to different heaps
 Setting memory priorities
 Going forward will redesign to strongly avoid overcommiting
 Automatically balance streaming pool settings and cap graphics settings
 Are there any other ways the app, OS and GPUs can handle this?
 Page faulting GPUs?
OVERCOMMITING VIDEO MEMORY

 Extensions are a great fit for low-level APIs
 Low-level extensions that exposes new hardware functionality
 Examples: PixelSync and EXT_shader_pixel_local_storage
 No need for huge amount of extensions like OGL which is mostly a high-level API
 Mantle has 5 extensions for the AMD-specific hardware functionality & Windows-specific integration
 Potential challenge for DX that has (officially) not had extensions before
 Would like to see DX official extensions, including shader code extensions!
 GL & Mantle has a strong advantage here
 Other alternative would be rapid iterations on the DX API (small updates quarterly?)
EXTENSIONS

Low-level Graphics APIs

More Related Content

What's hot

Similar to Low-level Graphics APIs

Recently uploaded

Low-level Graphics APIs

Editor's Notes