Next-gen Mobile GPUs and Rendering Techniques 
Game Connection, October 29-31, 2014 
Niklas Smedberg 
Senior Engine Programmer, Epic Games
Game Connection, October 29-31, 2014 
Introduction 
• Niklas Smedberg, a.k.a. “Smedis” 
– 17 years in the game industry 
– Graphics programmer since the C64 demo scene
Game Connection, October 29-31, 2014 
Content 
• Next-gen Hardware 
– Tile-based GPU vs direct-rendering GPU 
• Next-gen Rendering Techniques 
– Unreal Engine 4 
– Previous goal: Bringing AAA Console graphics to mobile (DONE!) 
– New goal: Bringing AAA PC graphics to mobile (DONE!??)
Next Generation Mobile Hardware 
Game Connection, October 29-31, 2014 
• Big leap in features and performance 
• Full-featured (e.g. OpenGL ES 3.1+AEP, DirectX 10/11) 
• Peak performance now comparable to consoles (Xbox 360/PS3) 
– About 300+ GFLOPS and 26 GB/s 
• New goal: Bring AAA PC graphics to mobile
Performance Trends (FP16 GFLOPS) 
Game Connection, October 29-31, 2014 
6.4 12.8 
25.5 
154 
300+ 
350 
300 
250 
200 
150 
100 
50 
0 
2010 2011 2012 2013 2014 
2010 SGX 535 
2011 SGX 543MP2 
2012 SGX 543MP3 
2013 G6430 
2014 Adreno, K1, GX6650
Game Connection, October 29-31, 2014 
iOS 8 Metal 
• New rendering API for iOS 
• Better match for the hardware 
– Like a “game console API” 
– Much more efficient on the CPU 
– Exposes hardware features 
• 20x faster on our rendering thread 
– OpenGL ES API: 30% of renderthread 
– Metal API: 1.6% of renderthread 
• All power and perf responsibility is now in the hands of the developer!
Tech Demo: Zen Garden 
Game Connection, October 29-31, 2014
Game Connection, October 29-31, 2014 
Tile-based Mobile GPU 
• Mobile GPUs are usually tile-based (next-gen too) 
– Tile-based: ImgTec, Qualcomm*, ARM 
– Direct: NVIDIA, Intel, Vivante 
* Qualcomm Adreno can render either tile-based or direct to frame buffer 
– Extension: GL_QCOM_binning_control
Game Connection, October 29-31, 2014 
Tile-Based Mobile GPU 
Summary: 
• Split the screen into tiles 
– E.g. 32x32 pixels (ImgTec) or 300x300 (Qualcomm) 
• The whole tile fits within GPU, on chip 
• Process all drawcalls for one tile 
– Write out final tile results to RAM 
• Repeat for each tile to fill the image in RAM
ImgTec Tile-based Rendering Process 
Game Vertex Processing Tile Data (RAM) 
Game Connection, October 29-31, 2014 
Pixel Processing (Top-most 
only) 
Frame Buffer (RAM) 
Cmd Buffer (RAM) 
Hidden Surface 
Removal 
Tile 
Memory 
Per Tile:
Game Connection, October 29-31, 2014 
ImgTec Series 6 G6430 
• One GPU core 
• Four shader units (USC) 
– 16-way scalar 
• FP16: Two SOP per clock 
– (a*b + c*d) 
• FP32: Two MADD per clock 
– (a*b + c) 
• 154 GFLOPS @ 400 MHz 
– 16-bit floating point
Game Connection, October 29-31, 2014 
FP16 Is Faster Than FP32 
• ImgTec Series 6: 50% faster 
– FP16 pipeline: Two SOP per clock 
– FP32 pipeline: Two MADD per clock 
• ImgTec Series 6XT: 100% faster 
– FP16 pipeline: Four MADD per clock 
– FP32 pipeline: Two MADD per clock 
• Qualcomm Snapdragon: 100% faster
Game Connection, October 29-31, 2014 
ImgTec Rendering Tips 
• Hidden Surface Removal 
– For opaque only 
– Don’t keep alpha-test enabled all the time 
– Don’t keep “discard” keyword in shader source, even if it’s not used 
• Group opaque drawcalls together 
• Sort on state, not distance
Game Connection, October 29-31, 2014 
Framebuffer Resolve/Restore 
• Expensive to switch Frame Buffer Object on Tile-based GPUs 
– Saves the current FBO to RAM 
– Reloads the new FBO from RAM 
• Best performance: 
– A single rendertarget for the entire frame 
– No post-processing passes 
• Does not apply to NVIDIA Tegra GPUs! 
– This made it simpler for us to make our “Rivalry” tech demo for K1
Game Connection, October 29-31, 2014 
Framebuffer Resolve/Restore 
• Clear ALL FBO attachments after new frame/rendertarget 
– Clear after eglSwapBuffers / glBindFramebuffer 
– Avoids reloading FBO from RAM 
– NOTE: Do NOT clear unnecessary on non-tile-based GPUs (e.g. NVIDIA) 
• Discard unused attachments before new frame/rendertarget 
– Discard before eglSwapBuffers / glBindFramebuffer 
– Avoids saving unused FBO attachments to RAM 
– glDiscardFramebufferEXT / glInvalidateFramebuffer
Game Connection, October 29-31, 2014 
iOS Performance Profiling 
• Screenshot from Xcode, which shows: 
– How we clear FBO at the beginning of every render pass 
– Other important performance info
Game Connection, October 29-31, 2014 
Programmable Blending 
• GL_EXT_shader_framebuffer_fetch (gl_LastFragData) 
• Reads current pixel background “for free” 
• Potential uses: 
– Custom color blending 
– Blend by background depth value (depth in alpha) 
• E.g. Soft intersection against world geometry for particles 
– Tone-mapping within tile-memory 
– Deferred shading without resolving G-buffer 
• Stay on GPU and avoid expensive round-trip to RAM 
• See also: GL_EXT_shader_pixel_local_storage
Soul: Programmable Blending 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Core Optimization: Opaque Draw Ordering 
Game Connection, October 29-31, 2014 
• All platforms, 
1. Group draws by material (shader) to reduce state changes 
• Then for all platforms except ImgTec, 
2. Skybox last: 5 ms/frame savings (vs drawing skybox first) 
3. Sort groups nearest first : extra 3 ms/frame savings 
4. Sort inside groups nearest first : extra 7 ms/frame savings 
• (Timings from a UE4 map on a Qualcomm-based device)
Game Connection, October 29-31, 2014 
Optimization: Resolution 
• Resolution does not match GPU performance! 
• Use custom resolution to match perf/pixel 
• Examples of same GPU: 
– iPhone 5S = 0.7 Mpix (1136x640) 
– iPad Air = 3.1 Mpix (2048x1536), more than 4 times slower! 
• Other examples: 
– Prev-Gen Console = 0.9 Mpix (1280x720) 
– Current-Gen Console = 2.1 Mpix (1920x1080)
Single Content, Multiple Platforms 
Game Connection, October 29-31, 2014 
• Core motivating factor in designing UE4 
– Authoring consistency between PC and Mobile 
• Authoring environment for both platforms 
– Physically-based shading model 
– High dynamic range linear color space 
– High quality post processing
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143 
Comparison: PC
Comparison: Mobile 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Game Connection, October 29-31, 2014 
Consistency: Material Editor 
• One material for many platforms 
– Artist authored feature levels to scale shader 
perf from PC to mobile 
• Using cross compiler tool to retarget 
HLSL into GLSL
Directional Light + SDF Shadows 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Consistency: HDR Directional Lightmaps 
Game Connection, October 29-31, 2014 
• Two compressed + One uncompressed textures: 
1. HDR color with log luma encoding 
2. World space 1st order spherical harmonic luma directionality 
3. SDF shadow texture 
• Optimized for Mobile and PVRTC compression: 
– PVRTC (ImgTec) lacks separately compressed encoding for alpha 
– PC color: RGB/Luma, LogLuma in Alpha 
– Mobile color: RGB/Luma * LogLuma, no Alpha (LogLuma derived with ALU) 
– Mobile: A single dynamic directional light as the “sun”
Image-Based Lighting 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Game Connection, October 29-31, 2014 
Image-Based Lighting 
• Selects a mip-level in an IBL cubemap based on per-pixel roughness 
– Same as PC, filtering same as PC 
• PC: FP16 
• Mobile: Decode(RGBM)^2 
• PC: Blend multiple cubemaps per surface and parallax correction 
• Mobile: One infinite-distance cubemap per object
1/2 DOF 
Downsample 
1/2 DOF Filter 
Tonemapping 
Game Connection, October 29-31, 2014 
Mobile Post Processing Pipeline 
FP16 RGB=Color A=Depth 
Anti-Aliasing 
1/4 DOF Near Dilation 
1/4 Lightshaft Filter, Pass 1 
1/4 Final Filter Passes and Merge 
{Light Shafts, Bloom, Vignette} 
Bloom Filter Tree 
1/4 Lightshaft Filter, Pass 2 
1/4 Smart Reduction 
Light Shaft (Sun) Mask 
A=Depth to A=CoC+Sun 
[conversion done on chip if possible] 
1/8 
1/16 
1/32 
1/8 
1/16 
1/32 
1/64
Mobile Depth of Field 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Packed Circle of Confusion and Sun Intensity 
Game Connection, October 29-31, 2014 
• Optimization done for Depth of Field and Light Shafts 
– Represent sun intensity and circle of confusion (CoC) in one FP16 value 
– Saves needing an extra render target 
Depth (0 to 65504) 
CoC (0 to 1) Sun Intensity (1 to 65504) 
0=Max Near Bokeh 
0.5=In Focus 
1=Max Far Bokeh and No Sun 
65504=Max Sun (still at Max Far Bokeh)
Vignette + Bloom + Light Shafts 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Vignette + Bloom + Light Shafts: Optimized 
Game Connection, October 29-31, 2014 
• Composited at quarter resolution 
– Using pre-multiplied alpha FP16 
– Also performs the final filtering passes for bloom and light shafts 
• Optimized to minimize rendertarget switches 
• 3 effects applied to scene with one full-res TEX fetch 
– Applied in the tonemap pass
Game Connection, October 29-31, 2014 
Mobile Light Shafts 
• Filtered at quarter resolution 
• Monochromatic with artist controlled tint on final composite 
– Bloom and light shaft down-sample pass shared 
– RGB = color for bloom, A = light shaft intensity 
• Light shaft filter runs in tonemapped space (8-bit/channel) 
– Applied in linear (reverse tonemap before tint and composite) 
• Filtering done by 3 passes of 8 taps
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143 
Mobile Bloom
Game Connection, October 29-31, 2014 
Mobile Bloom 
• Lower quality version to minimize resolves 
– Limited effect radius and less passes 
• Turns out to be faster and higher-quality 
– Standard hierarchical algorithm with some optimizations 
– Down-sample from 1:1/4 res first (shared with light shaft) 
– Then down-sample in 1:1/2 resolution passes 
– Single pass circle-based filter (instead of two-pass Gaussian) 
• 15 taps on circle during down-sampling 
• 7 taps for both circles during up-sample+merge pass
Film Post Tonemapping 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Film Post Tonemapping: Mobile Shader 
Game Connection, October 29-31, 2014 
Film Post 
RGBA8 LDR sRGB Output 
Quantization Grain 
(optional) 
Shadow Tint 
(optional) 
Color Matrix 
{Saturation, Channel Mixer} 
(optional) 
Tint, Tonemap Curve Linear to sRGB 
RGBA16F HDR Linear Color 
Blend in DOF 
(optional) 1/2 DOF 
Film Grain Jitter 
(optional) 
Blend in {Bloom, Vignette, Light Shafts} 
(optional) 
Film Grain 
(optional) 
1/4 Pre-multiplied Alpha
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143 
Anti-Aliasing
Game Connection, October 29-31, 2014 
Anti-Aliasing on Mobile 
• Using a spatial & temporal anti-aliasing filter 
– 2x temporal super-sampling (higher quality surface shading) 
– Blending two jittered frames after tonemapping (32 bpp, 
perceptual color space) 
– Some special logic to remove ghosting/judder (not using 
motion re-projection) 
• Not currently using MSAA 
– Needed super-sampled shading for HDR physically based shading model 
Pixel
All Techniques Combined 
Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
Game Connection, October 29-31, 2014 
Android Extension Pack 
• Released with Android “L” 
• OpenGL ES 3.1 plus a specific set of extensions 
– Compute shaders 
– Tessellation 
– ASTC 
– Detailed spec on www.khronos.org: GL_ANDROID_extension_pack_es31a 
• Full UE4 desktop rendering pipeline on Android! 
– Deferred rendering with G-buffer 
– Physically-based shading 
– Image-based lighting 
– Solid 30 FPS on NVIDIA K1 mobile GPU
Tech Demo: Rivalry 
Game Connection, October 29-31, 2014
Game Connection, October 29-31, 2014 
Unreal Engine 4 
Full source code available! 
unrealengine.com 
Includes all C++, shaders, tools, content 
$19/mo

Next generation mobile gp us and rendering techniques - niklas smedberg

  • 1.
    Next-gen Mobile GPUsand Rendering Techniques Game Connection, October 29-31, 2014 Niklas Smedberg Senior Engine Programmer, Epic Games
  • 2.
    Game Connection, October29-31, 2014 Introduction • Niklas Smedberg, a.k.a. “Smedis” – 17 years in the game industry – Graphics programmer since the C64 demo scene
  • 3.
    Game Connection, October29-31, 2014 Content • Next-gen Hardware – Tile-based GPU vs direct-rendering GPU • Next-gen Rendering Techniques – Unreal Engine 4 – Previous goal: Bringing AAA Console graphics to mobile (DONE!) – New goal: Bringing AAA PC graphics to mobile (DONE!??)
  • 4.
    Next Generation MobileHardware Game Connection, October 29-31, 2014 • Big leap in features and performance • Full-featured (e.g. OpenGL ES 3.1+AEP, DirectX 10/11) • Peak performance now comparable to consoles (Xbox 360/PS3) – About 300+ GFLOPS and 26 GB/s • New goal: Bring AAA PC graphics to mobile
  • 5.
    Performance Trends (FP16GFLOPS) Game Connection, October 29-31, 2014 6.4 12.8 25.5 154 300+ 350 300 250 200 150 100 50 0 2010 2011 2012 2013 2014 2010 SGX 535 2011 SGX 543MP2 2012 SGX 543MP3 2013 G6430 2014 Adreno, K1, GX6650
  • 6.
    Game Connection, October29-31, 2014 iOS 8 Metal • New rendering API for iOS • Better match for the hardware – Like a “game console API” – Much more efficient on the CPU – Exposes hardware features • 20x faster on our rendering thread – OpenGL ES API: 30% of renderthread – Metal API: 1.6% of renderthread • All power and perf responsibility is now in the hands of the developer!
  • 7.
    Tech Demo: ZenGarden Game Connection, October 29-31, 2014
  • 8.
    Game Connection, October29-31, 2014 Tile-based Mobile GPU • Mobile GPUs are usually tile-based (next-gen too) – Tile-based: ImgTec, Qualcomm*, ARM – Direct: NVIDIA, Intel, Vivante * Qualcomm Adreno can render either tile-based or direct to frame buffer – Extension: GL_QCOM_binning_control
  • 9.
    Game Connection, October29-31, 2014 Tile-Based Mobile GPU Summary: • Split the screen into tiles – E.g. 32x32 pixels (ImgTec) or 300x300 (Qualcomm) • The whole tile fits within GPU, on chip • Process all drawcalls for one tile – Write out final tile results to RAM • Repeat for each tile to fill the image in RAM
  • 10.
    ImgTec Tile-based RenderingProcess Game Vertex Processing Tile Data (RAM) Game Connection, October 29-31, 2014 Pixel Processing (Top-most only) Frame Buffer (RAM) Cmd Buffer (RAM) Hidden Surface Removal Tile Memory Per Tile:
  • 11.
    Game Connection, October29-31, 2014 ImgTec Series 6 G6430 • One GPU core • Four shader units (USC) – 16-way scalar • FP16: Two SOP per clock – (a*b + c*d) • FP32: Two MADD per clock – (a*b + c) • 154 GFLOPS @ 400 MHz – 16-bit floating point
  • 12.
    Game Connection, October29-31, 2014 FP16 Is Faster Than FP32 • ImgTec Series 6: 50% faster – FP16 pipeline: Two SOP per clock – FP32 pipeline: Two MADD per clock • ImgTec Series 6XT: 100% faster – FP16 pipeline: Four MADD per clock – FP32 pipeline: Two MADD per clock • Qualcomm Snapdragon: 100% faster
  • 13.
    Game Connection, October29-31, 2014 ImgTec Rendering Tips • Hidden Surface Removal – For opaque only – Don’t keep alpha-test enabled all the time – Don’t keep “discard” keyword in shader source, even if it’s not used • Group opaque drawcalls together • Sort on state, not distance
  • 14.
    Game Connection, October29-31, 2014 Framebuffer Resolve/Restore • Expensive to switch Frame Buffer Object on Tile-based GPUs – Saves the current FBO to RAM – Reloads the new FBO from RAM • Best performance: – A single rendertarget for the entire frame – No post-processing passes • Does not apply to NVIDIA Tegra GPUs! – This made it simpler for us to make our “Rivalry” tech demo for K1
  • 15.
    Game Connection, October29-31, 2014 Framebuffer Resolve/Restore • Clear ALL FBO attachments after new frame/rendertarget – Clear after eglSwapBuffers / glBindFramebuffer – Avoids reloading FBO from RAM – NOTE: Do NOT clear unnecessary on non-tile-based GPUs (e.g. NVIDIA) • Discard unused attachments before new frame/rendertarget – Discard before eglSwapBuffers / glBindFramebuffer – Avoids saving unused FBO attachments to RAM – glDiscardFramebufferEXT / glInvalidateFramebuffer
  • 16.
    Game Connection, October29-31, 2014 iOS Performance Profiling • Screenshot from Xcode, which shows: – How we clear FBO at the beginning of every render pass – Other important performance info
  • 17.
    Game Connection, October29-31, 2014 Programmable Blending • GL_EXT_shader_framebuffer_fetch (gl_LastFragData) • Reads current pixel background “for free” • Potential uses: – Custom color blending – Blend by background depth value (depth in alpha) • E.g. Soft intersection against world geometry for particles – Tone-mapping within tile-memory – Deferred shading without resolving G-buffer • Stay on GPU and avoid expensive round-trip to RAM • See also: GL_EXT_shader_pixel_local_storage
  • 18.
    Soul: Programmable Blending Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 19.
    Core Optimization: OpaqueDraw Ordering Game Connection, October 29-31, 2014 • All platforms, 1. Group draws by material (shader) to reduce state changes • Then for all platforms except ImgTec, 2. Skybox last: 5 ms/frame savings (vs drawing skybox first) 3. Sort groups nearest first : extra 3 ms/frame savings 4. Sort inside groups nearest first : extra 7 ms/frame savings • (Timings from a UE4 map on a Qualcomm-based device)
  • 20.
    Game Connection, October29-31, 2014 Optimization: Resolution • Resolution does not match GPU performance! • Use custom resolution to match perf/pixel • Examples of same GPU: – iPhone 5S = 0.7 Mpix (1136x640) – iPad Air = 3.1 Mpix (2048x1536), more than 4 times slower! • Other examples: – Prev-Gen Console = 0.9 Mpix (1280x720) – Current-Gen Console = 2.1 Mpix (1920x1080)
  • 21.
    Single Content, MultiplePlatforms Game Connection, October 29-31, 2014 • Core motivating factor in designing UE4 – Authoring consistency between PC and Mobile • Authoring environment for both platforms – Physically-based shading model – High dynamic range linear color space – High quality post processing
  • 22.
    Game DeGvealmopee CrCononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143 Comparison: PC
  • 23.
    Comparison: Mobile GameDeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 24.
    Game Connection, October29-31, 2014 Consistency: Material Editor • One material for many platforms – Artist authored feature levels to scale shader perf from PC to mobile • Using cross compiler tool to retarget HLSL into GLSL
  • 25.
    Directional Light +SDF Shadows Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 26.
    Consistency: HDR DirectionalLightmaps Game Connection, October 29-31, 2014 • Two compressed + One uncompressed textures: 1. HDR color with log luma encoding 2. World space 1st order spherical harmonic luma directionality 3. SDF shadow texture • Optimized for Mobile and PVRTC compression: – PVRTC (ImgTec) lacks separately compressed encoding for alpha – PC color: RGB/Luma, LogLuma in Alpha – Mobile color: RGB/Luma * LogLuma, no Alpha (LogLuma derived with ALU) – Mobile: A single dynamic directional light as the “sun”
  • 27.
    Image-Based Lighting GameDeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 28.
    Game Connection, October29-31, 2014 Image-Based Lighting • Selects a mip-level in an IBL cubemap based on per-pixel roughness – Same as PC, filtering same as PC • PC: FP16 • Mobile: Decode(RGBM)^2 • PC: Blend multiple cubemaps per surface and parallax correction • Mobile: One infinite-distance cubemap per object
  • 29.
    1/2 DOF Downsample 1/2 DOF Filter Tonemapping Game Connection, October 29-31, 2014 Mobile Post Processing Pipeline FP16 RGB=Color A=Depth Anti-Aliasing 1/4 DOF Near Dilation 1/4 Lightshaft Filter, Pass 1 1/4 Final Filter Passes and Merge {Light Shafts, Bloom, Vignette} Bloom Filter Tree 1/4 Lightshaft Filter, Pass 2 1/4 Smart Reduction Light Shaft (Sun) Mask A=Depth to A=CoC+Sun [conversion done on chip if possible] 1/8 1/16 1/32 1/8 1/16 1/32 1/64
  • 30.
    Mobile Depth ofField Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 31.
    Packed Circle ofConfusion and Sun Intensity Game Connection, October 29-31, 2014 • Optimization done for Depth of Field and Light Shafts – Represent sun intensity and circle of confusion (CoC) in one FP16 value – Saves needing an extra render target Depth (0 to 65504) CoC (0 to 1) Sun Intensity (1 to 65504) 0=Max Near Bokeh 0.5=In Focus 1=Max Far Bokeh and No Sun 65504=Max Sun (still at Max Far Bokeh)
  • 32.
    Vignette + Bloom+ Light Shafts Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 33.
    Vignette + Bloom+ Light Shafts: Optimized Game Connection, October 29-31, 2014 • Composited at quarter resolution – Using pre-multiplied alpha FP16 – Also performs the final filtering passes for bloom and light shafts • Optimized to minimize rendertarget switches • 3 effects applied to scene with one full-res TEX fetch – Applied in the tonemap pass
  • 34.
    Game Connection, October29-31, 2014 Mobile Light Shafts • Filtered at quarter resolution • Monochromatic with artist controlled tint on final composite – Bloom and light shaft down-sample pass shared – RGB = color for bloom, A = light shaft intensity • Light shaft filter runs in tonemapped space (8-bit/channel) – Applied in linear (reverse tonemap before tint and composite) • Filtering done by 3 passes of 8 taps
  • 35.
    Game DeGvealmopee CrCononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143 Mobile Bloom
  • 36.
    Game Connection, October29-31, 2014 Mobile Bloom • Lower quality version to minimize resolves – Limited effect radius and less passes • Turns out to be faster and higher-quality – Standard hierarchical algorithm with some optimizations – Down-sample from 1:1/4 res first (shared with light shaft) – Then down-sample in 1:1/2 resolution passes – Single pass circle-based filter (instead of two-pass Gaussian) • 15 taps on circle during down-sampling • 7 taps for both circles during up-sample+merge pass
  • 37.
    Film Post Tonemapping Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 38.
    Film Post Tonemapping:Mobile Shader Game Connection, October 29-31, 2014 Film Post RGBA8 LDR sRGB Output Quantization Grain (optional) Shadow Tint (optional) Color Matrix {Saturation, Channel Mixer} (optional) Tint, Tonemap Curve Linear to sRGB RGBA16F HDR Linear Color Blend in DOF (optional) 1/2 DOF Film Grain Jitter (optional) Blend in {Bloom, Vignette, Light Shafts} (optional) Film Grain (optional) 1/4 Pre-multiplied Alpha
  • 39.
    Game DeGvealmopee CrCononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143 Anti-Aliasing
  • 40.
    Game Connection, October29-31, 2014 Anti-Aliasing on Mobile • Using a spatial & temporal anti-aliasing filter – 2x temporal super-sampling (higher quality surface shading) – Blending two jittered frames after tonemapping (32 bpp, perceptual color space) – Some special logic to remove ghosting/judder (not using motion re-projection) • Not currently using MSAA – Needed super-sampled shading for HDR physically based shading model Pixel
  • 41.
    All Techniques Combined Game DeGvealmopee Cr Cononnefecrteionnc,e O, Mctoabrcehr 295-3219, 20143
  • 42.
    Game Connection, October29-31, 2014 Android Extension Pack • Released with Android “L” • OpenGL ES 3.1 plus a specific set of extensions – Compute shaders – Tessellation – ASTC – Detailed spec on www.khronos.org: GL_ANDROID_extension_pack_es31a • Full UE4 desktop rendering pipeline on Android! – Deferred rendering with G-buffer – Physically-based shading – Image-based lighting – Solid 30 FPS on NVIDIA K1 mobile GPU
  • 43.
    Tech Demo: Rivalry Game Connection, October 29-31, 2014
  • 44.
    Game Connection, October29-31, 2014 Unreal Engine 4 Full source code available! unrealengine.com Includes all C++, shaders, tools, content $19/mo

Editor's Notes

  • #6 Important: Device manufacturers tweak hardware settings! 2010 SGX 535 @ 200 MHz 6.4 GFLOPS 2011 SGX 543MP2 @ 200 MHz 12.8 GFLOPS 2012 SGX 543MP3 @ 266 MHz 25.5 GFLOPS 2012 Adreno 225 @ 400 MHz 28.8 GFLOPS 2013 G6430 @ 400 MHz 154 GFLOPS Adreno 300+ GFLOPS
  • #10 The method of “processing all drawcalls for one tile” differ between GPU architectures. Details in later slides.
  • #11 Hidden Surface Removal: Order-independent per pixel (opaque only)
  • #12 G6430 diagram: Shows one GPU core, 4 SIMD, 16 fp16 pipelines and 16 fp32 pipelines (no dual-issue). Compare with SGX USSE.
  • #19 Example of programmable blending in the “Soul” tech demo: Waterfall on the right
  • #22 Before, we were rendering into gamma-space, but now because of fp16 we can run in linear color space, with lighting in HDR. We need high-quality post-processing, because physically-based shading puts extra challenges on materials, especially when the camera is in motion.
  • #23 Here is the mobile SunTemple map with some post process changes, as rendered using the PC path in UE4. Resolution 1920x1200.
  • #24 Same map rendered with the mobile renderer with DOF on. Compared to PC: Lighting is slightly different - reflections are less accurate, anti-aliasing is not as good. However, the same game content can run on all platforms and still look quite consistent between the two platforms.
  • #25 You can create a single material that renders the same way on both platforms. If you want to make something special on one platform, we support “Feature Level switches” which lets you can special graph paths that only execute on higher-end devices. This makes it possible to create a single material that can scale from very low-end to very high-end devices.
  • #26 The signed distance field shadows enables a clean silhouette. This shot is rendered using the mobile renderer at 1920x1200 with AA, grain, depth of field, bloom, and leveraging film post for the black and white conversion. The B&W conversion has warm tint to the highlights and a cool tint to the shadows. Resolution would be lower on a mobile device.
  • #27 On mobile LogLuma is computed by taking Luminance of the RGB (extra math).
  • #28 This shot is entirely in shadow. There is no direct light from the sun. All light is coming from reflection cubemaps that has been pre-captured. Specular and reflections from the fire is generated from image based lighting. This shot is 1920x1200 rendered in the mobile path with AA, grain, depth of field, bloom, and leveraging film post for the black and white conversion. The B&W conversion has warm tint to the highlights and a cool tint to the shadows. Resolution would be lower on a mobile device.
  • #30 Lots of managed complexity: many passes ended up getting merged to save on resolves. Bloom, DOF, Light shafts, Vignette, Tonemapping, Anti-aliasing
  • #31 We can’t do the large Bokeh that we do on PC, so we changed the mobile DoF to focus on the quality of the blur and the transitions, rather than just large Bokeh. Supporting both near and far depth of field via separate focal planes. This is not physically correct, but important for gameplay purposes. Supporting only a small max circle of confusion size for performance reasons, equivalent to 16 taps.
  • #32 Optimization done for Depth of Field and Light Shafts Represent both sun intensity and CoC in one FP16 value Saves needing an extra render target (saves both an FBO switch and an extra texture lookup) Note CoC is effectively an alternative encoding for depth Encode Alpha = CoC size + sun intensity Where CoC size is {0.0 = maximum near CoC size, 0.5 = in focus (CoC=0.0), 1.0 = maximum far CoC size (the infinite far plane)} Decode CoC size = clamp(Alpha, 0.0, 1.0) sun intensity = max(0.0, Alpha - 1.0) Works because sun is only at CoC = max far size, and light shafts are monochromatic On GPUs with framebuffer fetch this transform is done inline without a resolve
  • #33 All three effects are combined in a quarter-res buffer. Warm light from outside, but cold lighting inside: Monochromatic light shafts with a blue tint + vignette with blue tint + bloom with no tint.
  • #35 Filtering of the light shafts is done in tonemapped space as an optimization for bandwidth.
  • #36 We can now run our high-quality bloom path on mobile, allowing a very wide bloom.
  • #37 Down-sampled to 1/64, then back up to 1/4 resolution. Ring-filter, where no sample is on the same vertical or horizontal axis. Applied in a hierarchical fashion, which gives us very good results. Running in FP16 linear HDR 32-bit RGBM does not filter 32-bit RGBT (RGB ratio, tonemapped luma) too expensive (ALU)
  • #38 The same scene from two different camera positions and two completely different film post settings. Left has film-grain, right does not.
  • #39 Overview of features exposed by permutations of the tonemapping shader. Uber-shader that combines many passes, for optimization. Quantization grain can be used to break up darker shades in sRGB.
  • #40 This shot uses anti-aliasing, depth of field, grain, and the film post controls. Anti-aliasing is running in linear color space.
  • #41 Bilinear fetch to remove jittered sub-pixel offset (small spatial filter) Neighborhood clamp to remove ghosting Thin features may only show in one of the two frame, so need to allow a fraction of prior frame outside neighborhood to remove flickering of these thin features Reduce allowance as camera rotation increases to remove judder Not using full motion re-projection, e.g. using a motion vector field (too expensive)