MANTLE FOR DEVELOPERS
JOHAN ANDERSSON – TECHNICAL DIRECTOR
FROSTBITE
ELECTRONIC ARTS
Mantle?
Simplify advanced development
 Improve performance

 Enable developers to innovate
 Challenge the status quo
Developer impact areas
Control

CPU performance
Programmability

GPU performance
Platforms
Control

New model
Traditional Model:
Black Box

Explicit Model:
Mantle

 Middle-ground abstraction – compromise
between performance & “usability”

 Thin low-level abstraction to expose how
hardware works

 Hidden resource memory & state

 App explicit memory management

 Resource CPU access tied to device context

 Resources are globally accessible

 Driver analyzes & synchronizes implicitly

 App explicit resource state transitions
Control

App responsibility
 Tell when render target will be used as a texture
‒ And many more resource state transitions

 Don’t destroy resources that GPU is using
‒ Keep track with fences or frames

 Manual dynamic resource renaming
‒ No DISCARD for driver resource renaming

 Resource memory tiling
 Powerful validation layer will help!
Control

Explicit control enables
 App high-level decisions & optimizations
‒ Has full scene information
‒ Easier to optimize performance & memory

 Flexible & efficient memory management
‒ Linear frame allocators
‒ Memory pools
‒ Pinned memory

 Reduced development time
‒ For advanced game engines & apps
‒ Easier to get to target performance & robustness
Control

Explicit control enables
 Transient resources
‒ Alias render targets within frame
‒ Major memory savings
‒ No need to pre-allocate everything

 Light-weight driver
‒ Easier to develop & maintain
‒ Reduced CPU draw call overhead
CPU performance
CPU perf

Core concepts
 Descriptor sets
 Monolithic pipelines
 Command buffers
CPU perf

Descriptor sets
 Table with resource references to bind to
graphics or compute pipeline
Image

Memory

Sampler

Link

 Replaces traditional resource stage binding
‒ Major performance & flexibility advantage
‒ Closer to how the hardware works

 Example 1: Single simple dynamic descriptor set
‒ Bind everything you need for a single draw call
‒ Close to DX/GL model but share between stages

Dynamic descriptor set
VertexBuffer (VS)
Texture0 (VS+PS)
Constants (VS)
Texture1 (PS)

 App managed - lots of strategies possible!
‒ Tiny vs huge sets
‒ Single vs multiple
‒ Static vs semi-static vs dynamic

Texture2 (PS)
Sampler0 (VS+PS)
CPU perf

Descriptor sets
 Table with resource references to bind to
graphics or compute pipeline
Image

 Example 2: Reuse static set with nesting
‒ Reduce update time & memory usage

Memory

Static descriptor set
Sampler

Link

Dynamic descriptor set

 Replaces traditional resource stage binding
‒ Major performance & flexibility advantage
‒ Closer to how the hardware works

Constants (VS)
Link

VertexBuffer (VS)
Texture0 (VS+PS)
Texture1 (PS)
Texture2 (PS)
Texture3 (PS)

 App managed - lots of strategies possible!
‒ Tiny vs huge sets
‒ Single vs multiple
‒ Static vs semi-static vs dynamic

Texture4 (PS)
Sampler0 (VS+PS)
Sampler1 (PS)
CPU perf

Monolithic pipelines
 Shader stages & select graphics state combined into single object
‒ No runtime compilation or patching needed!
‒ Significantly less runtime overhead to use
Pipeline state

 Supports parallel building & caching
‒ Fast loading times

 Usage & management up to the app
‒ Static vs dynamic creation
‒ Amount of pipelines
‒ State usage

IA

DB
VS

HS

DS
Tessellator

GS

RS

PS

CB
CPU perf

Command buffers
 Issue pipelined graphics & compute commands into a command buffer
‒ Bind graphics state, descriptor sets, pipeline
‒ Draw calls
‒ Render targets
‒ Clears
‒ Memory transfers
‒ NOT: resource mapping

 Fully independent objects
‒ Create multiple every frame
‒ Or pre-build up front and reuse
CPU perf

DX/GL parallelism
CPU 0
CPU 1
CPU 2

Game

Game
Game
Render
Render
Driver Render

 Automatically extracts parallelism out of most apps 
 Doesn’t scale beyond 2-3 cores 
 Additional latency 
 Driver thread often bottleneck – can collide app threads 

Render
CPU perf

Parallel dispatch with Mantle
CPU 0

Game

Game

Game

CPU 1

Render

Render

Render

CPU 2

Render

Render

Render

CPU 3

Render

Render

Render

CPU 4

Render

Render

Render

 App can go fully wide with its rendering – minimal latency 
 Close to linear scaling with CPU cores 
 No driver threads – no overhead – no contention 

 Frostbite’s approach on all consoles – and on PC with Mantle! 
GPU performance
GPU perf

GPU optimizations
 Thanks to improved CPU performance – CPU
will rarely be a bottleneck for the GPU
‒ CPU could help GPU more:
‒ Less brute force rendering
‒ Improve culling

 Resource states
‒ Gives driver a lot more knowledge & flexibility
‒ Apps can avoid expensive/redundant
transitions, such as surface decompression

 Expose existing GPU functionality
 Shader pipeline object – driver optimizations
‒ Can optimize with pipeline state knowledge
‒ Can optimize across all shader stages

‒ Quad & Rect-lists
‒ HW-specific MSAA & depth data access
‒ Programmable sample patterns
‒ And more..
GPU perf

Queues
 Modern GPUs are heterogeneous machines
with multiple engines

Graphics

‒ Graphics pipeline
‒ Compute pipeline(s)
‒ DMA transfer
‒ Video encode/decode
‒ More…

 Mantle exposes queues for the engines +
synchronization primitives

Compute
DMA
...
Queues

GPU
GPU perf

Queues
Graphics
Compute
DMA
...
Queues

GPU
GPU perf

Queue use cases
 Async DMA transfers
‒ Copy resources in parallel with graphics or
compute

Copy

DMA
Graphics

Render

Other render

Use copy
GPU perf

Queue use cases
 Async DMA transfers
‒ Copy resources in parallel with graphics or
compute

 Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units

Compute
Graphics

GBuffer

Non-shadowed lighting
Shadowmap 0
Shadowmap 1

Final lighting
GPU perf

Queue use cases
 Async DMA transfers

 Multiple compute kernels collaborating

‒ Copy resources in parallel with graphics or
compute

‒ Can be faster than über-kernel
‒ Example: Compute geometry backend & compute
rasterizer

 Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units

Compute 0
Compute 1
Graphics

Compute Geometry
Compute Rasterizer
Ordinary Rendering
GPU perf

Queue use cases
 Async DMA transfers

 Multiple compute kernels collaborating

‒ Copy resources in parallel with graphics or
compute

‒ Can be faster than über-kernel
‒ Example: Compute geometry backend & compute
rasterizer

 Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units

Compute
Graphics

Process0

Process1
Draw0

 Compute as frontend for graphics pipeline
‒ Compute runs asynchronously ahead and prepares
& optimizes geometry for graphics pipeline

Process0
Draw1

Draw2
GPU perf

Queue use cases
 Async DMA transfers

 Multiple compute kernels collaborating

‒ Copy resources in parallel with graphics or
compute

‒ Can be faster than über-kernel
‒ Example: Compute geometry backend & compute
rasterizer

 Async compute together with graphics
‒ ALU heavy compute work at the same time as
memory/ROP bound work to utilize idle units

 Compute as frontend for graphics pipeline
‒ Compute runs asynchronously ahead and prepares
& optimizes geometry for graphics pipeline

 Game engines will build large GPU job graphs
‒ Move away from single sequential submission
‒ Just as we already have done on CPU
Programmability
Programmability

Explicit Multi-GPU
 Explicit control of GPU queues and synchronization, finally!
‒ Implement your own Alternate-Frame-Rendering
‒ Or something more exotic..

 Use case: Workstation rendering with 4-8 GPUs
‒ Super high-quality rendering & simulation
‒ Load balance graphics & compute job graphs across GPUs
‒ 20-40 TFlops in a single machine!

 Use case: Low-latency rendering
‒ Important for VR and competitive games
‒ Latency optimized GPU job graph scheduling
‒ VR: Simultaneously drive 2 GPUs (1 per eye)
Programmability

New mechanisms
 Command buffer predication & flow control
‒ GPU affecting/skipping submitted commands
‒ Go beyond DrawIndirect / DispatchIndirect
‒ Advanced variable workloads
‒ Advanced culling optimizations

 Write occlusion query results into GPU buffer
‒ No CPU roundtrip needed
‒ Can drive predicated rendering
‒ Or use results directly in shaders (lens flares)
Programmability

Bindless resources
 Mantle supports bindless resources
‒ Shaders can select resources to use instead of
static binding from CPU
‒ Extension of the descriptor set support

 Examples
‒ Performance optimizations – less data to update
‒ Logic & data structures that live fully on the GPU
‒ Scene culling & rendering

‒ Material representations

 Key component that will open up a lot of
opportunities!

‒ Deferred shading
‒ Raytracing
Platforms
Platforms

Today
 Mantle gives us strong benefits on Windows today
‒ Console-like performance & programmability on both Windows 7 and Windows 8
‒ For us, well worth the dev time!

 DX & GL are the industry standards
‒ Needed for platforms that do not support Mantle
‒ Needed by devs who do not want/need more control
‒ Have to have fallback paths for GL/DX, but not limit oneself to it

 Mantle and PlayStation 4 will drive our future Frostbite designs & optimizations
‒ PS4 graphics API has great programmability & performance as well
‒ Share concepts, methods & optimization strategies
Platforms

Linux & Mac
 Want to see Mantle on Linux and Mac!
‒ Would enable support for our full engine & rendering
‒ Significantly easier to do efficient renderer with Mantle than with OpenGL

 Use cases:
‒ Workstations
‒ R&D
‒ Not limited by WDDM

‒ Games
‒ Mantle + SteamOS = powerful combination!
Platforms

Mobile
 Mobile architectures are getting closer in capabilities to desktop GPUs
 Want graphics API that allows apps to fully utilize the hardware
‒ Power efficient
‒ High performance
‒ Programmable

 Major opportunity with Mantle – leap frog GL4, DX11
‒ For mobile SoC vendors
‒ For Google and Apple
Platforms

Multi-vendor?
 Mantle is designed to be a thin hardware abstraction
‒ Not tied to AMD’s GCN architecture
‒ Forward compatible
‒ Extensions for architecture- and platform-specific functionality

 Mantle would be a much more efficient graphics API for other vendors as well
‒ Most Mantle functionality can be supported on today’s modern GPUs

 Want to see future version of Mantle supported on all platforms and on all modern GPUs!
‒ Become an active industry standard with IHVs and ISVs collaborating
‒ Enable us developers to innovate with great performance & programmability everywhere
Frostbite

Battlefield 4
 Mantle support is in development
‒ Core renderer (closer to PS4 than DX11)
‒ Implement all rendering techniques used in BF4 (many!)
‒ CPU optimizations (parallel dispatch, descriptor sets)
‒ GPU optimizations (minimize transitions, MSAA)
‒ R&D for advanced GPU optimizations
‒ Memory management
‒ Multi-GPU support
‒ ~2 months of work

 Update targeting late December
Frostbite

Plants vs Zombies: Garden Warfare
 Very different rendering
compared to BF4 
 Frostbite Mantle renderer will
work out of the box
 Focus on APU performance
Frostbite

Future
 All Frostbite games designed with Mantle
‒ 15 games in development across all of EA

 Advanced Mantle rendering & use cases
‒ Lots of exciting R&D opportunities!

 Want multi-vendor & multi-platform support!
Email: repi@dice.se
Web:
http://frostbite.com
Twitter: @repi

THE END

Mantle for Developers

  • 1.
    MANTLE FOR DEVELOPERS JOHANANDERSSON – TECHNICAL DIRECTOR FROSTBITE ELECTRONIC ARTS
  • 2.
    Mantle? Simplify advanced development Improve performance  Enable developers to innovate  Challenge the status quo
  • 4.
    Developer impact areas Control CPUperformance Programmability GPU performance Platforms
  • 5.
    Control New model Traditional Model: BlackBox Explicit Model: Mantle  Middle-ground abstraction – compromise between performance & “usability”  Thin low-level abstraction to expose how hardware works  Hidden resource memory & state  App explicit memory management  Resource CPU access tied to device context  Resources are globally accessible  Driver analyzes & synchronizes implicitly  App explicit resource state transitions
  • 6.
    Control App responsibility  Tellwhen render target will be used as a texture ‒ And many more resource state transitions  Don’t destroy resources that GPU is using ‒ Keep track with fences or frames  Manual dynamic resource renaming ‒ No DISCARD for driver resource renaming  Resource memory tiling  Powerful validation layer will help!
  • 7.
    Control Explicit control enables App high-level decisions & optimizations ‒ Has full scene information ‒ Easier to optimize performance & memory  Flexible & efficient memory management ‒ Linear frame allocators ‒ Memory pools ‒ Pinned memory  Reduced development time ‒ For advanced game engines & apps ‒ Easier to get to target performance & robustness
  • 8.
    Control Explicit control enables Transient resources ‒ Alias render targets within frame ‒ Major memory savings ‒ No need to pre-allocate everything  Light-weight driver ‒ Easier to develop & maintain ‒ Reduced CPU draw call overhead
  • 9.
  • 10.
    CPU perf Core concepts Descriptor sets  Monolithic pipelines  Command buffers
  • 11.
    CPU perf Descriptor sets Table with resource references to bind to graphics or compute pipeline Image Memory Sampler Link  Replaces traditional resource stage binding ‒ Major performance & flexibility advantage ‒ Closer to how the hardware works  Example 1: Single simple dynamic descriptor set ‒ Bind everything you need for a single draw call ‒ Close to DX/GL model but share between stages Dynamic descriptor set VertexBuffer (VS) Texture0 (VS+PS) Constants (VS) Texture1 (PS)  App managed - lots of strategies possible! ‒ Tiny vs huge sets ‒ Single vs multiple ‒ Static vs semi-static vs dynamic Texture2 (PS) Sampler0 (VS+PS)
  • 12.
    CPU perf Descriptor sets Table with resource references to bind to graphics or compute pipeline Image  Example 2: Reuse static set with nesting ‒ Reduce update time & memory usage Memory Static descriptor set Sampler Link Dynamic descriptor set  Replaces traditional resource stage binding ‒ Major performance & flexibility advantage ‒ Closer to how the hardware works Constants (VS) Link VertexBuffer (VS) Texture0 (VS+PS) Texture1 (PS) Texture2 (PS) Texture3 (PS)  App managed - lots of strategies possible! ‒ Tiny vs huge sets ‒ Single vs multiple ‒ Static vs semi-static vs dynamic Texture4 (PS) Sampler0 (VS+PS) Sampler1 (PS)
  • 13.
    CPU perf Monolithic pipelines Shader stages & select graphics state combined into single object ‒ No runtime compilation or patching needed! ‒ Significantly less runtime overhead to use Pipeline state  Supports parallel building & caching ‒ Fast loading times  Usage & management up to the app ‒ Static vs dynamic creation ‒ Amount of pipelines ‒ State usage IA DB VS HS DS Tessellator GS RS PS CB
  • 14.
    CPU perf Command buffers Issue pipelined graphics & compute commands into a command buffer ‒ Bind graphics state, descriptor sets, pipeline ‒ Draw calls ‒ Render targets ‒ Clears ‒ Memory transfers ‒ NOT: resource mapping  Fully independent objects ‒ Create multiple every frame ‒ Or pre-build up front and reuse
  • 15.
    CPU perf DX/GL parallelism CPU0 CPU 1 CPU 2 Game Game Game Render Render Driver Render  Automatically extracts parallelism out of most apps   Doesn’t scale beyond 2-3 cores   Additional latency   Driver thread often bottleneck – can collide app threads  Render
  • 16.
    CPU perf Parallel dispatchwith Mantle CPU 0 Game Game Game CPU 1 Render Render Render CPU 2 Render Render Render CPU 3 Render Render Render CPU 4 Render Render Render  App can go fully wide with its rendering – minimal latency   Close to linear scaling with CPU cores   No driver threads – no overhead – no contention   Frostbite’s approach on all consoles – and on PC with Mantle! 
  • 17.
  • 18.
    GPU perf GPU optimizations Thanks to improved CPU performance – CPU will rarely be a bottleneck for the GPU ‒ CPU could help GPU more: ‒ Less brute force rendering ‒ Improve culling  Resource states ‒ Gives driver a lot more knowledge & flexibility ‒ Apps can avoid expensive/redundant transitions, such as surface decompression  Expose existing GPU functionality  Shader pipeline object – driver optimizations ‒ Can optimize with pipeline state knowledge ‒ Can optimize across all shader stages ‒ Quad & Rect-lists ‒ HW-specific MSAA & depth data access ‒ Programmable sample patterns ‒ And more..
  • 19.
    GPU perf Queues  ModernGPUs are heterogeneous machines with multiple engines Graphics ‒ Graphics pipeline ‒ Compute pipeline(s) ‒ DMA transfer ‒ Video encode/decode ‒ More…  Mantle exposes queues for the engines + synchronization primitives Compute DMA ... Queues GPU
  • 20.
  • 21.
    GPU perf Queue usecases  Async DMA transfers ‒ Copy resources in parallel with graphics or compute Copy DMA Graphics Render Other render Use copy
  • 22.
    GPU perf Queue usecases  Async DMA transfers ‒ Copy resources in parallel with graphics or compute  Async compute together with graphics ‒ ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Compute Graphics GBuffer Non-shadowed lighting Shadowmap 0 Shadowmap 1 Final lighting
  • 23.
    GPU perf Queue usecases  Async DMA transfers  Multiple compute kernels collaborating ‒ Copy resources in parallel with graphics or compute ‒ Can be faster than über-kernel ‒ Example: Compute geometry backend & compute rasterizer  Async compute together with graphics ‒ ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Compute 0 Compute 1 Graphics Compute Geometry Compute Rasterizer Ordinary Rendering
  • 24.
    GPU perf Queue usecases  Async DMA transfers  Multiple compute kernels collaborating ‒ Copy resources in parallel with graphics or compute ‒ Can be faster than über-kernel ‒ Example: Compute geometry backend & compute rasterizer  Async compute together with graphics ‒ ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units Compute Graphics Process0 Process1 Draw0  Compute as frontend for graphics pipeline ‒ Compute runs asynchronously ahead and prepares & optimizes geometry for graphics pipeline Process0 Draw1 Draw2
  • 25.
    GPU perf Queue usecases  Async DMA transfers  Multiple compute kernels collaborating ‒ Copy resources in parallel with graphics or compute ‒ Can be faster than über-kernel ‒ Example: Compute geometry backend & compute rasterizer  Async compute together with graphics ‒ ALU heavy compute work at the same time as memory/ROP bound work to utilize idle units  Compute as frontend for graphics pipeline ‒ Compute runs asynchronously ahead and prepares & optimizes geometry for graphics pipeline  Game engines will build large GPU job graphs ‒ Move away from single sequential submission ‒ Just as we already have done on CPU
  • 26.
  • 27.
    Programmability Explicit Multi-GPU  Explicitcontrol of GPU queues and synchronization, finally! ‒ Implement your own Alternate-Frame-Rendering ‒ Or something more exotic..  Use case: Workstation rendering with 4-8 GPUs ‒ Super high-quality rendering & simulation ‒ Load balance graphics & compute job graphs across GPUs ‒ 20-40 TFlops in a single machine!  Use case: Low-latency rendering ‒ Important for VR and competitive games ‒ Latency optimized GPU job graph scheduling ‒ VR: Simultaneously drive 2 GPUs (1 per eye)
  • 28.
    Programmability New mechanisms  Commandbuffer predication & flow control ‒ GPU affecting/skipping submitted commands ‒ Go beyond DrawIndirect / DispatchIndirect ‒ Advanced variable workloads ‒ Advanced culling optimizations  Write occlusion query results into GPU buffer ‒ No CPU roundtrip needed ‒ Can drive predicated rendering ‒ Or use results directly in shaders (lens flares)
  • 29.
    Programmability Bindless resources  Mantlesupports bindless resources ‒ Shaders can select resources to use instead of static binding from CPU ‒ Extension of the descriptor set support  Examples ‒ Performance optimizations – less data to update ‒ Logic & data structures that live fully on the GPU ‒ Scene culling & rendering ‒ Material representations  Key component that will open up a lot of opportunities! ‒ Deferred shading ‒ Raytracing
  • 30.
  • 31.
    Platforms Today  Mantle givesus strong benefits on Windows today ‒ Console-like performance & programmability on both Windows 7 and Windows 8 ‒ For us, well worth the dev time!  DX & GL are the industry standards ‒ Needed for platforms that do not support Mantle ‒ Needed by devs who do not want/need more control ‒ Have to have fallback paths for GL/DX, but not limit oneself to it  Mantle and PlayStation 4 will drive our future Frostbite designs & optimizations ‒ PS4 graphics API has great programmability & performance as well ‒ Share concepts, methods & optimization strategies
  • 32.
    Platforms Linux & Mac Want to see Mantle on Linux and Mac! ‒ Would enable support for our full engine & rendering ‒ Significantly easier to do efficient renderer with Mantle than with OpenGL  Use cases: ‒ Workstations ‒ R&D ‒ Not limited by WDDM ‒ Games ‒ Mantle + SteamOS = powerful combination!
  • 33.
    Platforms Mobile  Mobile architecturesare getting closer in capabilities to desktop GPUs  Want graphics API that allows apps to fully utilize the hardware ‒ Power efficient ‒ High performance ‒ Programmable  Major opportunity with Mantle – leap frog GL4, DX11 ‒ For mobile SoC vendors ‒ For Google and Apple
  • 34.
    Platforms Multi-vendor?  Mantle isdesigned to be a thin hardware abstraction ‒ Not tied to AMD’s GCN architecture ‒ Forward compatible ‒ Extensions for architecture- and platform-specific functionality  Mantle would be a much more efficient graphics API for other vendors as well ‒ Most Mantle functionality can be supported on today’s modern GPUs  Want to see future version of Mantle supported on all platforms and on all modern GPUs! ‒ Become an active industry standard with IHVs and ISVs collaborating ‒ Enable us developers to innovate with great performance & programmability everywhere
  • 36.
    Frostbite Battlefield 4  Mantlesupport is in development ‒ Core renderer (closer to PS4 than DX11) ‒ Implement all rendering techniques used in BF4 (many!) ‒ CPU optimizations (parallel dispatch, descriptor sets) ‒ GPU optimizations (minimize transitions, MSAA) ‒ R&D for advanced GPU optimizations ‒ Memory management ‒ Multi-GPU support ‒ ~2 months of work  Update targeting late December
  • 37.
    Frostbite Plants vs Zombies:Garden Warfare  Very different rendering compared to BF4   Frostbite Mantle renderer will work out of the box  Focus on APU performance
  • 38.
    Frostbite Future  All Frostbitegames designed with Mantle ‒ 15 games in development across all of EA  Advanced Mantle rendering & use cases ‒ Lots of exciting R&D opportunities!  Want multi-vendor & multi-platform support!
  • 39.

Editor's Notes

  • #3 So what is Mantle?Mantle is a low-level graphics API and it’s goals are to improve performance and make easier to develop these really advanced application and give developers a lot of freedom to build innovative graphics solutions.And it is a bit of a challenge to the established order of things, which I think is fun and healthy for the industry
  • #4 We’ve been working with Mantle for some time now and adding support in our engine Frostbite and Battlefield 4. And I wanted to share some of our learnings and what Mantle can mean in general for developers
  • #12 Can be of any sizeSingle per-draw call small dynamic descriptor setStatic + dynamicMultiple ones nested by update frequency
  • #13 Can be of any sizeSingle per-draw call small dynamic descriptor setStatic + dynamicMultiple ones nested by update frequency
  • #15 Not needed for:Resource mapping´No implicit pipeline flushingMuch easier to track down stalls in the app itself
  • #28 Also foveated rendering
  • #30 ? Need to go beyond HLSL for pointer support in shaders
  • #34 What is next after OpenGL ES3?
  • #38 Kaveri