Low-level Graphics APIs


Published on

Presentation & discussion around low-level graphics APIs. This was a quickly made presentation that I put together for a discussion with Intel and fellow ISVs, thought it could be worth sharing

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

  • I’ve been at DICE & EA for 13 years

  • Low-level Graphics APIs

    1. 1. Johan Andersson Technical Director Frostbite LOW-LEVEL GRAPHICS INTEL VCARB 2014 Email: johan@frostbite.com Web: http://frostbite.com Twitter: @repi
    2. 2.  Frostbite has 2 very different rendering use cases: 1. Rendering the world with huge amounts of objects  Tons of draw calls and with lots of different states & pipelines  Heavily CPU limited with high-level APIs  Read-only view of the world and resources (except for render targets) 2. Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc  Tons of different types of complex operations, not a lot of draw calls  ~50 different rendering passes in Frostbite  Managing resource state, memory and running on different queues (graphics, compute, DMA)  Both are very important and low-level discussion & design target are for both! RENDERING USE CASES
    3. 3.  I consider low-level APIs having 3 important design targets: 1. Solving the many draw calls CPU performance problem  CPU overhead  Parallel command buffer building  Binding model & pipeline objects  Explicit resource state handling 2. Enabling new GPU programmability  Bindless  CPU/GPU collaboration  GPU work creation 3. Improving GPU performance & memory usage  Utilize multiple hardware engines in parallel  Explicit memory management / virtual memory  Explicit resource state handling  Believe one needs a clean slate API design to target all of these – new paradigm model  Hence Mantle & DX12 - too much legacy in the way in DX11, GL4 and GLES LOW-LEVEL API TARGETS
    4. 4.  1-2 orders of a magnitude less CPU overhead for lots of draw calls  Thanks to explicit resource barriers, explicit memory management & few bind points  Problem for us is not ”how to do 1 million similar draw calls with same state”  Problem is ”how to do 10-100k draw calls with different state”  Should in parallel also try and reduce amount of state (such as bindless), but wary of GPU overhead  Stable & consistent performance  Major benefit for users, will only be more important going forward  Explicit submission to GPU, not at random commands  No runtime late binding or compilation of shaders  Can be a challenge for engines to know all pipeline state up front, but worth designing for!  Improved GPU performance  Engine has more high-level knowledge of how resources are used  Seen examples of adv GPU driver opts that are easier to implement thanks to low-level model PERFORMANCE
    5. 5. DX11
    6. 6. MANTLE
    7. 7.  Key design to significantly lower driver overhead and complexity  Explicit hazard tracking  Hides architecture-specific caches  Can be challenge in the apps/engines – but worth it  Esp. with multiple queues & out-of-order command buffers  Requires very clear specifications and great validation layer!  In Frostbite we mostly track this per resource instead for simplicity & performance  Instead of per subresource RESOURCE TRANSITIONS
    8. 8.  Example complex cases:  Read-only depth testing together with stencil writing (different state for depth & stencil)  Mipmap generation (sub-resource specific states)  Async compute offloading part of graphics pipeline  Critical to make sure future low-level APIs have the right states/usages exposed  Devil is in the details  Does the sw/hw require the transition to happen on the same queue that just used a resource?  Look at your concrete use cases early!  Would help if future hardware doesn’t need as many different barriers  But at what cost? RESOURCE TRANSITIONS
    9. 9.  Aka ”descriptor sets” in Mantle  This new model has been working very well for us – even with very basic handling  Great to have as separate objects not connected to device context  Treated all resource references as dynamic and built every frame  ~15k resource entries in single large table per frame in BF4 (heavy instancing)  Lots of opportunity going forward  Split out static resources to own persistent tables  Split out shared common resources to own table  Connect together with nested resource tables  Bindless for more complex cases RESOURCE TABLES
    10. 10.  Seen really good wins with both async DMAs and async compute  And it is an important target for us going forward  Additional opportunities  DMA in and out of embedded memory  Buffer/Image/Video compression/decompression  More?  What engines / queues does Intel have & be able to expose? MULTIPLE GPU QUEUES
    11. 11.  Kick CPU job from GPU  Possible now with explicit fences & events that CPU can poll/wait on  Enables filling in resources just in time  Want async command buffers  Kick CPU job from GPU and CPU builds & unlocks an already queued command buffer  We’ve been doing this on consoles - ”just” a software limitation  Example use case: Sample Distributed Shadowmaps without stalling GPU pipeline  Major opportunity going forward  Needs support in OS:es & driver models  Drive rendering pipeline based on data from the current GPU frame (such as the zbuffer)  Decide where to run code based on power efficiency  Important both for discrete & integrated GPUs GPU/CPU COLLABORATION
    12. 12.  For us, have been both easier to work with & get good performance  What we are used to from working on consoles and have architecture for  Update buffers from any thread, not locked to a device context  Persistently map or pin buffer & image objects for easy reading & writing  Pool memory to reduce overhead  Alias objects to the same memory for significant reduction in memory  Esp. for render targets  Built-in virtual memory mapping  Easier & more flexible way to manage large amount of memory EXPLICIT MEMORY MANAGEMENT
    13. 13.  Major issue for us during BF4, avoiding VidMM stalls  VidMM is a black box  Difficult to know what is going on & why  Explicitly track memory references for each command buffer  Tweak memory pools & chunk sizes  Force memory to different heaps  Setting memory priorities  Going forward will redesign to strongly avoid overcommiting  Automatically balance streaming pool settings and cap graphics settings  Are there any other ways the app, OS and GPUs can handle this?  Page faulting GPUs? OVERCOMMITING VIDEO MEMORY
    14. 14.  Extensions are a great fit for low-level APIs  Low-level extensions that exposes new hardware functionality  Examples: PixelSync and EXT_shader_pixel_local_storage  No need for huge amount of extensions like OGL which is mostly a high-level API  Mantle has 5 extensions for the AMD-specific hardware functionality & Windows-specific integration  Potential challenge for DX that has (officially) not had extensions before  Would like to see DX official extensions, including shader code extensions!  GL & Mantle has a strong advantage here  Other alternative would be rapid iterations on the DX API (small updates quarterly?) EXTENSIONS
    15. 15.  Discuss!  THANKS