RENDERING BATTLEFIELD 4
WITH MANTLE
Johan Andersson – Electronic Arts
2
3
DX11 Mantle
Avg: 78 fps
Min: 42 fps
Core i7-3970x, AMD Radeon R9 290x, 1080p ULTRA
Avg: 120 fps
Min: 94 fps+58%!
4
BF4 MANTLE GOALS
Goals:
– Significantly improve CPU performance
– More consistent & stable performance
– Improve GPU per...
5
BF4 MANTLE STRATEGIC GOALS
 Prove that low-level graphics APIs work outside of consoles
 Push the industry towards low...
6
SHADERS
7
SHADERS
 Shader resource bind points replaced with a resource table object - descriptor set
– This is how the hardware ...
8
SHADER CONVERSION
 DX11 bytecode shaders gets converted to AMDIL & mapping applied using ILC tool
– Done at load time
–...
9
DESCRIPTOR SETS
 Very simple usage in BF4: for each draw call write flat list of resources
–Essentially direct replacem...
10
DESCRIPTOR SETS
11
DESCRIPTOR SETS – FUTURE OPTIMIZATIONS
 Use static descriptor sets when possible
 Reduce resource duplication by reus...
12
COMPUTE PIPELINES
 1:1 mapping between pipeline & shader
 No state built into pipeline
 Can execute in parallel with...
13
GRAPHICS PIPELINES
 All graphics shader stages combined to a single pipeline object together with important graphics s...
14
PRE-BUILDING PIPELINES
 Graphics pipeline creation is expensive operation, do at load time instead of runtime!
– Creat...
15
PIPELINE CACHE
 Cache built pipelines both in memory cache and disk cache
– Improved loading times
– Max 300 MB
– Simp...
16
MEMORY
17
MEMORY MANAGEMENT
 Mantle devices exposes multiple memory heaps with characteristics
– Can be different between device...
18
FROSTBITE MEMORY HEAPS
 System Shared Mapped
– CPU memory that is GPU visible.
– Write combined & persistently mapped ...
19
MEMORY REFERENCES
 WDDM needs to know which memory allocations are referenced for each command buffer
– In order to ma...
20
MEMORY POOLING
 Pooling memory allocations were required for us
– Sub allocate within larger 1 – 32 MB chunks
– All re...
21
OVERCOMMITTING VIDEO MEMORY
 Avoid overcommitting video memory!
– Will lead to severe stalls as VidMM moves blocks and...
22
MEMORY PRIORITIES
 Setting priorities on the memory allocations helps VidMM choose what to page out when it has to
 5...
23
MEMORY RESIDENCY FUTURE
 For best results manage which resources are in video memory yourself & keep only ~80% used
– ...
24
RESOURCE MANAGEMENT
25
RESOURCE LIFETIMES
 App manages lifetime of all resources
– Have to make sure GPU is not using an object or memory whi...
26
LINEAR FRAME ALLOCATOR
 We use multiple linear allocators with Mantle for both transient buffers & images
– Used for h...
27
TILING
 Textures should be tiled for performance
– Explicitly handled in Mantle, user selects linear or tiled
– Some f...
28
COMMAND BUFFERS
29
COMMAND BUFFERS
 Command buffers are the atomic unit of work dispatched to the GPU
– Separate creation from execution
...
30
COMMAND BUFFER SOURCES
 Frostbite has 2 separate sources of command buffers
– World rendering
 Rendering the world wi...
31
RESOURCE TRANSITIONS
 Key design in Mantle to significantly lower driver overhead & complexity
– Explicit hazard track...
32
MANAGING RESOURCE TRANSITIONS
 Engines need a clear design on how to handle state transitions
 Multiple approaches po...
33
MANAGING RESOURCE TRANSITIONS IN FROSTBITE
 Current approach in Frostbite is quite basic:
– We keep track of a single ...
34
DYNAMIC STATE OBJECTS
 Graphics state is only set with the pipeline object and 5 dynamic state objects
– State objects...
35
QUEUES
36
QUEUES
 Universal queue can do both graphics, compute and presents
 We use also use additional queues to parallelize ...
37
 Order of execution within a queue is sequential
 Synchronize multiple queues with GPU semaphores (signal & wait)
 A...
38
QUEUES SYNCHRONIZATION CONT
 Started out with explicit semaphores
– Error prone to handle when having lots of differen...
39
GPU JOB GRAPH
 Each GPU job has list of dependencies (other command buffers)
 Dependencies has to finish first before...
40
ASYNC DMA
 AMD GPUs have dedicated hardware DMA engines, let’s use them!
– Uploading through DMA is faster than on uni...
41
ASYNC COMPUTE
 Frostbite has lots of compute shader passes that could run in parallel with graphics work
– HBAO, blurr...
42
ASYNC COMPUTE – TILE-BASED LIGHTING
 3 sequential compute shaders
– Input: zbuffer & gbuffer
– Output: HDR texture/UAV...
43
ASYNC COMPUTE – TILE-BASED LIGHTING
 We manually prepare the resources for the async compute
– Important to not access...
44
MULTI-GPU
45
MULTI-GPU
 Multi-GPU alternatives:
– AFR – Alternate Frame Rendering (1-4 GPUs of the same power)
– Heterogeneous AFR ...
46
MULTI-GPU AFR WITH MANTLE
 All resources explicitly duplicated on each GPU with async DMA
– Hidden internally in our r...
47
 GPUs are independently rendering & presenting to the screen – can cause micro-stuttering
– Frames are not presented i...
48
 GPUs are independently rendering & presenting to the screen – can cause micro-stuttering
– Frames are not presented i...
49
FRAME PACING
 Measure average frame rate on each GPU
– Short history (10-30 frames)
– Filter out spikes
 Insert delay...
50
CONCLUSION
51
MANTLE DEV RECOMMENDATIONS
 The validation layer is a critical friend!
 You’ll end up with a lot of object & memory m...
52
FUTURE
 Second wave of Frostbite Mantle titles
 Adapt Frostbite core rendering layer based on learnings from Mantle
–...
53
QUESTIONS?
Email: johan@frostbite.com
Web: http://frostbite.com
Twitter: @repi
Upcoming SlideShare
Loading in...5
×

Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14

2,687

Published on

Johan Andersson will show how the Frostbite 3 game engine is using the low-level graphics API Mantle to deliver significantly improved performance in Battlefield 4 on PC and future games from Electronic Arts in this presentation from the 2014 Game Developers Conference in San Francisco March 17-21. Also view this and other presentations on our developer website at http://developer.amd.com/resources/documentation-articles/conference-presentations/

Published in: Technology, Art & Photos
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,687
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
22
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14

  1. 1. RENDERING BATTLEFIELD 4 WITH MANTLE Johan Andersson – Electronic Arts
  2. 2. 2
  3. 3. 3 DX11 Mantle Avg: 78 fps Min: 42 fps Core i7-3970x, AMD Radeon R9 290x, 1080p ULTRA Avg: 120 fps Min: 94 fps+58%!
  4. 4. 4 BF4 MANTLE GOALS Goals: – Significantly improve CPU performance – More consistent & stable performance – Improve GPU performance where possible – Add support for a new Mantle rendering backend in a live game  Minimize changes to engine interfaces  Compatible with built PC content – Work on wide set of hardware  APU to quad-GPU  But x64 only (32-bit Windows needs to die) Non-goals: – Design new renderer from scratch for Mantle – Take advantage of asymmetric MGPU (APU+discrete) – Optimize video memory consumption
  5. 5. 5 BF4 MANTLE STRATEGIC GOALS  Prove that low-level graphics APIs work outside of consoles  Push the industry towards low-level graphics APIs everywhere  Build a foundation for the future that we can build great games on
  6. 6. 6 SHADERS
  7. 7. 7 SHADERS  Shader resource bind points replaced with a resource table object - descriptor set – This is how the hardware accesses the shader resources – Flat list of images, buffers and samplers used by any of the shader stages – Vertex shader streams converted to vertex shader buffer loads  Engine assign each shader resource to specific slot in the descriptor set(s) – Can share slots between shader stages = smaller descriptor sets – The mapping takes a while to wrap one’s head around
  8. 8. 8 SHADER CONVERSION  DX11 bytecode shaders gets converted to AMDIL & mapping applied using ILC tool – Done at load time – Don’t have to change our shaders!  Have full source & control over the process  Could write AMDIL directly or use other frontends if wanted
  9. 9. 9 DESCRIPTOR SETS  Very simple usage in BF4: for each draw call write flat list of resources –Essentially direct replacement of SetTexture/SetConstantBuffer/SetInputStream  Single dynamic descriptor set object per frame  Sub-allocate for each draw call and write list of resources  ~15000 resource slots written per frame in BF4, still very fast
  10. 10. 10 DESCRIPTOR SETS
  11. 11. 11 DESCRIPTOR SETS – FUTURE OPTIMIZATIONS  Use static descriptor sets when possible  Reduce resource duplication by reusing & sharing more across shader stages  Nested descriptor sets
  12. 12. 12 COMPUTE PIPELINES  1:1 mapping between pipeline & shader  No state built into pipeline  Can execute in parallel with rendering  ~100 compute pipelines in BF4
  13. 13. 13 GRAPHICS PIPELINES  All graphics shader stages combined to a single pipeline object together with important graphics state  ~10000 graphics pipelines in BF4 on a single level, ~25 MB of video memory  Could use smaller working pool of active state objects to keep reasonable amount in memory – Have not been required for us
  14. 14. 14 PRE-BUILDING PIPELINES  Graphics pipeline creation is expensive operation, do at load time instead of runtime! – Creating one of our graphics pipelines take ~10-60 ms each – Pre-build using N parallel low-priority jobs – Avoid 99.9% of runtime stalls caused by pipeline creation!  Requires knowing the graphics pipeline state that will be used with the shaders – Primitive type – Render target formats – Render target write masks – Blend modes  Not fully trivial to know all state, may require engine changes / pre-defining use cases – Important to design for!
  15. 15. 15 PIPELINE CACHE  Cache built pipelines both in memory cache and disk cache – Improved loading times – Max 300 MB – Simple LRU policy – LZ4 compressed (free)  Database signature: – Driver version – Vendor ID – Device ID
  16. 16. 16 MEMORY
  17. 17. 17 MEMORY MANAGEMENT  Mantle devices exposes multiple memory heaps with characteristics – Can be different between devices, drivers and OS:es  User explicitly places resources in wanted heaps – Driver suggests preferred heaps when creating objects, not a requirement Type Size Page CPU access GPU Read GPU Write CPU Read CPU Write Local 256 MB 65535 CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined 130 170 0.0058 2.8 Local 4096 MB 65535 130 180 0 0 Remote 16106 MB 65535 CpuVisible|CpuGpuCoherent|CpuUncached|CpuWriteCombined 2.6 2.6 0.1 3.3 Remote 16106 MB 65535 CpuVisible|CpuGpuCoherent 2.6 2.6 3.2 2.9
  18. 18. 18 FROSTBITE MEMORY HEAPS  System Shared Mapped – CPU memory that is GPU visible. – Write combined & persistently mapped = easy & fast to write to in parallel at any time  System Shared Pinned – CPU cached for readback. – Not used much  Video Shared – GPU memory accessible by CPU. Used for descriptor sets and dynamic buffers – Max 256 MB (legacy constraint) – Avoid keeping persistently mapped as WDMM doesn’t like this and can decide to move it back to CPU memory   Video Private – GPU private memory. – Used for render targets, textures and other resources CPU does not need to access
  19. 19. 19 MEMORY REFERENCES  WDDM needs to know which memory allocations are referenced for each command buffer – In order to make sure they are resident and not paged out – Max ~1700 memory references are supported – Overhead with having lots of references  Engine needs to keep track of what memory is referenced while building the command buffers – Easy & fast to do – Each reference is either read-only or read/write – We use a simple global list of references shared for all command buffers.
  20. 20. 20 MEMORY POOLING  Pooling memory allocations were required for us – Sub allocate within larger 1 – 32 MB chunks – All resources stored memory handle + offset – Not as elegant as just void* on consoles – Fragmentation can be a concern, not too much issues for us in practice  GPU virtual memory mapping is fully supported, can simplify & optimize management
  21. 21. 21 OVERCOMMITTING VIDEO MEMORY  Avoid overcommitting video memory! – Will lead to severe stalls as VidMM moves blocks and moves memory back and forth – VidMM is a black box  – One of the biggest issues we ran into during development  Recommendations – Balance memory pools – Make sure to use read-only memory references – Use memory priorities
  22. 22. 22 MEMORY PRIORITIES  Setting priorities on the memory allocations helps VidMM choose what to page out when it has to  5 priority levels – Very high = Render targets with MSAA – High = Render targets and UAVs – Normal = Textures – Low = Shader & constant buffers – Very low = vertex & index buffers
  23. 23. 23 MEMORY RESIDENCY FUTURE  For best results manage which resources are in video memory yourself & keep only ~80% used – Avoid all stalls – Can async DMA in and out  We are thinking of redesigning to fully avoid possibility of overcommitting  Hoping WDDM’s memory residency management can be simplified & improved in the future
  24. 24. 24 RESOURCE MANAGEMENT
  25. 25. 25 RESOURCE LIFETIMES  App manages lifetime of all resources – Have to make sure GPU is not using an object or memory while we are freeing it on the CPU – How we’ve always worked with GPUs on the consoles – Multi-GPU adds some additional complexity that consoles do not have  We keep track of lifetimes on a per frame granularity – Queues for object destruction & free memory operations – Add to queue at any time on the CPU – Process queues when GPU command buffers for the frame are done executing – Tracked with command buffer fences
  26. 26. 26 LINEAR FRAME ALLOCATOR  We use multiple linear allocators with Mantle for both transient buffers & images – Used for huge amount of small constant data and other GPU frame data that CPU writes – Easy to use and very low overhead – Don’t have to care about lifetimes or state  Fixed memory buffers for each frame – Super cheap sub-allocation from from any thread – If full, use heap allocation (also fast due to pooling)  Alternative: ring buffers – Requires being able to stall & drain pipeline at any allocation if full, additional complexity for us
  27. 27. 27 TILING  Textures should be tiled for performance – Explicitly handled in Mantle, user selects linear or tiled – Some formats (BC) can’t be accessed as linear by the GPU  On consoles we handle tiling offline as part of our data processing pipeline – We know the exact tiling formats and have separate resources per platform  For Mantle – Tiling formats are opaque, can be different between GPU architectures and image types – Tile textures with DMA image upload from SystemShared to VideoPrivate  Linear source, tiled destination  Free
  28. 28. 28 COMMAND BUFFERS
  29. 29. 29 COMMAND BUFFERS  Command buffers are the atomic unit of work dispatched to the GPU – Separate creation from execution – No “immediate context” a la DX11 that can execute work at any call – Makes resource synchronization and setup significantly easier & faster  Typical BF4 scenes have around ~50 command buffers per frame – Reasonable tradeoff for us with submission overhead vs CPU load-balancing
  30. 30. 30 COMMAND BUFFER SOURCES  Frostbite has 2 separate sources of command buffers – World rendering  Rendering the world with tons of objects, lots of draw calls. Have all frame data up front  All resources except for render targets are read-only  Generated in parallel up front each frame – Immediate rendering (“the rest”)  Setting up rendering and doing lighting, post-fx, virtual texturing, compute, etc  Managing resource state, memory and running on different queues (graphics, compute, DMA)  Sequentially generated in a single job, simulate an immediate context by splitting the command buffer  Both are very important and have different requirements
  31. 31. 31 RESOURCE TRANSITIONS  Key design in Mantle to significantly lower driver overhead & complexity – Explicit hazard tracking by the app/engine – Drives architecture-specific caches & compression – AMD: FMASK, CMASK, HTILE – Enables explicit memory management  Examples: – Optimal render target writes → Graphics shader read-only – Compute shader write-only → DrawIndirect arguments  Mantle has a strong validation layer that tracks transitions which is a major help
  32. 32. 32 MANAGING RESOURCE TRANSITIONS  Engines need a clear design on how to handle state transitions  Multiple approaches possible: – Sequential in-order command buffers  Generate one command buffer at the time in order  Transition resources on-demand when doing operation on them, very simple  Recommendation: start with this – Out-of-order multiple command buffers  Track state per command buffer, fix up transitions when order of command buffers is known – Hybrid approaches & more
  33. 33. 33 MANAGING RESOURCE TRANSITIONS IN FROSTBITE  Current approach in Frostbite is quite basic: – We keep track of a single state for each resource (not subresource) – The “immediate rendering” transition resources as needed depending on operation – The out of order “world rendering” command buffers don’t need to transition states  Already have write access to MRTs and read-access to all resources setup outside them  Avoids the problem of them not knowing the state during generation  Works now but as we do more general parallel rendering it will have to change – Track resource state for each command buffer & fixup between command buffers
  34. 34. 34 DYNAMIC STATE OBJECTS  Graphics state is only set with the pipeline object and 5 dynamic state objects – State objects: color blend, raster, viewport, depth-stencil, MSAA – No other parameters such as in DX11 with stencil ref or SetViewport functions  Frostbite use case: – Pre-create when possible – Otherwise on-demand creation (hash map) – Only ~100 state objects!  Still possible to end up with lots of state objects – Esp. with state object float & integer values (depth bounds, depth bias, viewport) – But no need to store all permutations in memory, objects are fast to create & app manages lifetimes
  35. 35. 35 QUEUES
  36. 36. 36 QUEUES  Universal queue can do both graphics, compute and presents  We use also use additional queues to parallelize GPU operations: – DMA queue – Improve perf with faster transfers & avoiding idling graphics will transfering – Compute queue - Improve perf by utilizing idle ALU and update resources simultaneously with gfx  More GPUs = more queues!
  37. 37. 37  Order of execution within a queue is sequential  Synchronize multiple queues with GPU semaphores (signal & wait)  Also works across multiple GPUs Compute Graphics QUEUES SYNCHRONIZATION S Wait W S
  38. 38. 38 QUEUES SYNCHRONIZATION CONT  Started out with explicit semaphores – Error prone to handle when having lots of different semaphores & queues – Difficult to visualize & debug  Switched to more representation more similar to a job graph  Just a model on top of the semaphores
  39. 39. 39 GPU JOB GRAPH  Each GPU job has list of dependencies (other command buffers)  Dependencies has to finish first before job can run on its queue  The dependencies can be from any queue  Was easier to work with, debug and visualize  Really extendable going forward Graphics 1 Graphics 2 DMA Compute Graphics 2
  40. 40. 40 ASYNC DMA  AMD GPUs have dedicated hardware DMA engines, let’s use them! – Uploading through DMA is faster than on universal queue, even if blocking – DMA have alignment restrictions, have to support falling back to copies on universal queue  Use case: Frame buffer & texture uploads – Used by resource initial data uploads and our UpdateSubresource – Guaranteed to be finished before the GPU universal queue starts rendering the frame  Use case: Multi-GPU frame buffer copy – Peer-to-peer copy of the frame buffer to the GPU that will present it
  41. 41. 41 ASYNC COMPUTE  Frostbite has lots of compute shader passes that could run in parallel with graphics work – HBAO, blurring, classification, tile-based lighting, etc  Running as async compute can improve GPU performance by utilizing ”free” ALU – For example while doing shadowmap rendering (ROP bound)
  42. 42. 42 ASYNC COMPUTE – TILE-BASED LIGHTING  3 sequential compute shaders – Input: zbuffer & gbuffer – Output: HDR texture/UAV  Runs in parallel with graphics pipeline that renders to other targets Compute Graphics TileZ Gbuffer Shadowmaps Reflection Distort Transp Cull lights Lighting S SWait W
  43. 43. 43 ASYNC COMPUTE – TILE-BASED LIGHTING  We manually prepare the resources for the async compute – Important to not access the resources on other queues at the same time (unless read-only state) – Have to transition resources on the queue that last used it  Up to 80% faster in our initial tests, but not fully reliable – But is a pretty small part of the frame time – Not in BF4 yet Compute Graphics TileZ Gbuffer Shadowmaps Reflection Distort Transp Cull lights Lighting S SWait W
  44. 44. 44 MULTI-GPU
  45. 45. 45 MULTI-GPU  Multi-GPU alternatives: – AFR – Alternate Frame Rendering (1-4 GPUs of the same power) – Heterogeneous AFR – 1 small + 1 big GPU (APU + Discrete) – SFR – Split Frame Rendering – Multi-GPU Job Graph – Primary strong GPU + slave GPUs helping  Frostbite supports AFR natively – No synchronization points within the frame – For resources that are not rendered every frame: re-render resources for each GPU  Example: sky envmap update on weather change  With Mantle multi-GPU is explicit and we have to build support for it ourselves
  46. 46. 46 MULTI-GPU AFR WITH MANTLE  All resources explicitly duplicated on each GPU with async DMA – Hidden internally in our rendering abstraction  Every frame alternate which GPU we build command buffers for and are using resources from  Our UpdateSubresource has to make sure it updates resources on all GPU  Presenting the screen has to in some modes copy the frame buffer to the GPU that owns the display  Bonus: – Can simulate multi-GPU mode even with single GPU! – Multi-GPU works in windowed mode!
  47. 47. 47  GPUs are independently rendering & presenting to the screen – can cause micro-stuttering – Frames are not presented in a regular intervals – Frame rate can be high but presentation & gameplay is not smooth – FCAT is a good tool to analyse this MULTI-GPU ISSUES GPU0 GPU1 Frame 0 P Frame 1 P Frame 2 P Frame 3 P GPU0 GPU1 Irregular presentation interval
  48. 48. 48  GPUs are independently rendering & presenting to the screen – can cause micro-stuttering – Frames are not presented in a regular intervals – Frame rate can be high but presentation & gameplay is not smooth – FCAT is a good tool to analyse this  We need to introduce dependency & dampening between the GPUs to alleviate this – frame pacing MULTI-GPU ISSUES GPU0 GPU1 Frame 0 P Frame 1 P Frame 2 P Frame 3 P Ideal presentation interval
  49. 49. 49 FRAME PACING  Measure average frame rate on each GPU – Short history (10-30 frames) – Filter out spikes  Insert delay on the GPU before each present – Force the frame times to become more regular and GPUs to align – Delay value is based on the calculate avg frame rate GPU0 GPU1 Frame 0 P Frame 1 P Frame 2 P Frame 3 P GPU0 GPU1 Delay D
  50. 50. 50 CONCLUSION
  51. 51. 51 MANTLE DEV RECOMMENDATIONS  The validation layer is a critical friend!  You’ll end up with a lot of object & memory management code, try share with console code  Make sure you have control over memory usage and can avoid overcommitting video memory  Build a robust solution for resource state management early  Figure out how to pre-create your graphics pipelines, can require engine design changes  Build for multi-GPU support from the start, easier than to retrofit
  52. 52. 52 FUTURE  Second wave of Frostbite Mantle titles  Adapt Frostbite core rendering layer based on learnings from Mantle – Refine binding & buffer updates to further reduce overhead – Virtual memory management – More async compute & async DMAs – Multi-GPU job graph R&D  Linux – Would like to see how our Mantle renderer behaves with different memory management & driver model
  53. 53. 53 QUESTIONS? Email: johan@frostbite.com Web: http://frostbite.com Twitter: @repi
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×