Presentation & discussion around low-level graphics APIs. This was a quickly made presentation that I put together for a discussion with Intel and fellow ISVs, thought it could be worth sharing
2. Frostbite has 2 very different rendering use cases:
1. Rendering the world with huge amounts of objects
Tons of draw calls and with lots of different states & pipelines
Heavily CPU limited with high-level APIs
Read-only view of the world and resources (except for render targets)
2. Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc
Tons of different types of complex operations, not a lot of draw calls
~50 different rendering passes in Frostbite
Managing resource state, memory and running on different queues (graphics, compute, DMA)
Both are very important and low-level discussion & design target are for both!
RENDERING USE CASES
3. I consider low-level APIs having 3 important design targets:
1. Solving the many draw calls CPU performance problem
CPU overhead
Parallel command buffer building
Binding model & pipeline objects
Explicit resource state handling
2. Enabling new GPU programmability
Bindless
CPU/GPU collaboration
GPU work creation
3. Improving GPU performance & memory usage
Utilize multiple hardware engines in parallel
Explicit memory management / virtual memory
Explicit resource state handling
Believe one needs a clean slate API design to target all of these – new paradigm model
Hence Mantle & DX12 - too much legacy in the way in DX11, GL4 and GLES
LOW-LEVEL API TARGETS
4. 1-2 orders of a magnitude less CPU overhead for lots of draw calls
Thanks to explicit resource barriers, explicit memory management & few bind points
Problem for us is not ”how to do 1 million similar draw calls with same state”
Problem is ”how to do 10-100k draw calls with different state”
Should in parallel also try and reduce amount of state (such as bindless), but wary of GPU overhead
Stable & consistent performance
Major benefit for users, will only be more important going forward
Explicit submission to GPU, not at random commands
No runtime late binding or compilation of shaders
Can be a challenge for engines to know all pipeline state up front, but worth designing for!
Improved GPU performance
Engine has more high-level knowledge of how resources are used
Seen examples of adv GPU driver opts that are easier to implement thanks to low-level model
PERFORMANCE
7. Key design to significantly lower driver overhead and complexity
Explicit hazard tracking
Hides architecture-specific caches
Can be challenge in the apps/engines – but worth it
Esp. with multiple queues & out-of-order command buffers
Requires very clear specifications and great validation layer!
In Frostbite we mostly track this per resource instead for simplicity & performance
Instead of per subresource
RESOURCE TRANSITIONS
8. Example complex cases:
Read-only depth testing together with stencil writing (different state for depth & stencil)
Mipmap generation (sub-resource specific states)
Async compute offloading part of graphics pipeline
Critical to make sure future low-level APIs have the right states/usages exposed
Devil is in the details
Does the sw/hw require the transition to happen on the same queue that just used a resource?
Look at your concrete use cases early!
Would help if future hardware doesn’t need as many different barriers
But at what cost?
RESOURCE TRANSITIONS
9. Aka ”descriptor sets” in Mantle
This new model has been working very well for us – even with very basic handling
Great to have as separate objects not connected to device context
Treated all resource references as dynamic and built every frame
~15k resource entries in single large table per frame in BF4 (heavy instancing)
Lots of opportunity going forward
Split out static resources to own persistent tables
Split out shared common resources to own table
Connect together with nested resource tables
Bindless for more complex cases
RESOURCE TABLES
10. Seen really good wins with both async DMAs and async compute
And it is an important target for us going forward
Additional opportunities
DMA in and out of embedded memory
Buffer/Image/Video compression/decompression
More?
What engines / queues does Intel have & be able to expose?
MULTIPLE GPU QUEUES
11. Kick CPU job from GPU
Possible now with explicit fences & events that CPU can poll/wait on
Enables filling in resources just in time
Want async command buffers
Kick CPU job from GPU and CPU builds & unlocks an already queued command buffer
We’ve been doing this on consoles - ”just” a software limitation
Example use case: Sample Distributed Shadowmaps without stalling GPU pipeline
Major opportunity going forward
Needs support in OS:es & driver models
Drive rendering pipeline based on data from the current GPU frame (such as the zbuffer)
Decide where to run code based on power efficiency
Important both for discrete & integrated GPUs
GPU/CPU COLLABORATION
12. For us, have been both easier to work with & get good performance
What we are used to from working on consoles and have architecture for
Update buffers from any thread, not locked to a device context
Persistently map or pin buffer & image objects for easy reading & writing
Pool memory to reduce overhead
Alias objects to the same memory for significant reduction in memory
Esp. for render targets
Built-in virtual memory mapping
Easier & more flexible way to manage large amount of memory
EXPLICIT MEMORY MANAGEMENT
13. Major issue for us during BF4, avoiding VidMM stalls
VidMM is a black box Difficult to know what is going on & why
Explicitly track memory references for each command buffer
Tweak memory pools & chunk sizes
Force memory to different heaps
Setting memory priorities
Going forward will redesign to strongly avoid overcommiting
Automatically balance streaming pool settings and cap graphics settings
Are there any other ways the app, OS and GPUs can handle this?
Page faulting GPUs?
OVERCOMMITING VIDEO MEMORY
14. Extensions are a great fit for low-level APIs
Low-level extensions that exposes new hardware functionality
Examples: PixelSync and EXT_shader_pixel_local_storage
No need for huge amount of extensions like OGL which is mostly a high-level API
Mantle has 5 extensions for the AMD-specific hardware functionality & Windows-specific integration
Potential challenge for DX that has (officially) not had extensions before
Would like to see DX official extensions, including shader code extensions!
GL & Mantle has a strong advantage here
Other alternative would be rapid iterations on the DX API (small updates quarterly?)
EXTENSIONS