SlideShare a Scribd company logo
1 of 54
Download to read offline
Monday, August 13, 12
Unity: iOS and Android -
                        Cross-Platform Challenges and Solutions

                                     Renaldas Zioma
                                    Unity Technologies



Monday, August 13, 12
Mobile devices today
                Can render ...




                        Dead Trigger courtesy of MadFingerGames
Monday, August 13, 12
Mobile devices today
                Can render ...




                        Dead Trigger courtesy of MadFingerGames
Monday, August 13, 12
Mobile devices today
                Can render this @ 2048 x 1536




                        CSR Racing courtesy of BossAlien & NaturalMotion
Monday, August 13, 12
Mobile Platform Challenges
                Different GPU architectures
                        • API extensions
                Screen resolutions
                Performance scale
                Drivers
                Texture formats



Monday, August 13, 12
4 (or 5) GPU Architectures
                ImgTech PowerVR SGX - TBDR (TileBasedDeferred)
                        • ImgTech PowerVR MBX - TBDR (Fixed Function)
                ARM Mali - Tiled (small tiles)
                Qualcomm Adreno - Tiled (large tiles)
                        • Adreno3xx - can switch to Traditional
                        • glHint(GL_BINNING_CONTROL_HINT_QCOM, GL_RENDER_DIRECT_TO_FRAMEBUFFER_QCOM)

                NVIDIA Tegra - Traditional



Monday, August 13, 12
Tiled Architecture
                Splits screen into tiles
                        • small (for example: 16x16) - SGX, MALI
                        • relatively large (for example 256K) - Adreno
                Tile memory is on chip - fast!
                Once GPU is done rendering tile
                        • tile is “resolved” - written out to slower RAM




Monday, August 13, 12
Tiled Deferred Architecture
                Per-drawcall: polygons are transformed, assigned to tiles, stored
                   in memory (Parameter Buffer)
                Rasterization starts only after all scene drawcalls were processed
                        • every tile has access to all covering polygons
                Per-tile: Occluded polygons are rejected and only visible parts of
                   polygons are rasterized
                        • for opaque geometry rasterization will touch every pixel only once
                        • saves ALU and texture reads




Monday, August 13, 12
Tiled Deferred Architecture

                        Vertex shader         Tile Accelerator             Parameter Buffer


     Rasterization starts only after all scene drawcalls were processed


                                        Hidden Surface Removal               Tile Rasterizer



                                                                            On Chip Memory

                                                                          “Resolve”

                                                            Frame buffer RAM
Monday, August 13, 12
Not so scary in practice! Just...
                Sort opaque geom differently for Traditional vs Tiled
                        • Tiled: sort by material to reduce CPU drawcall overhead
                        • Traditional: sort roughly front-to-back to maximize ZCull efficiency
                          • then by material
                        • Tiled Deferred: render alpha-tested after opaque
                          • higher chance that expensive alpha-tested pixels will be occluded
                Separate render loop for MBX Fixed Function
                         optimized for low-end devices, can go faster than GLES2.0 loop, no per-pixel lighting, limited postFX possibilities
                         phasing it out

                Be more aggressive with 16bit framebuffers on Tiled
Monday, August 13, 12
Not so scary in practice! Just...
                Use EXT_discard_framebuffer extensions on Tiled
                        • will avoid copying data (color/depth/stencil) you're not planning to use
                Clear RenderTarget before rendering into it
                        • otherwise on Tiled driver will copy color/depth/stencil back from RAM
                        • not clearing is not an optimization!




Monday, August 13, 12
Architectural Benefits
                Benefits
                        • Tiled: MSAA is almost free (5-10% of rendering time)
                        • Tiled: AlphaBlending is significantly cheaper
                        • Tiled: less dithering artifacts for 16bit framebuffers
                Caveats
                        • TBDR: RenderTarget switch might be more expensive
                        • TBDR: Too much geometry will flush whole pipeline (ParameterBuffer overflow)




Monday, August 13, 12
Interesting Tiles
                Reminds recent works
                        •“Tile-based Deferred Shading”, Andrew Lauritzen, SIGGRAPH2010
                        •“Tile-based Forward Rendering”, Takahiro Hirada, GDC2012
                ... suitable for high-end GPUs
                        • different problems
                        • but common solutions




Monday, August 13, 12
Screen Resolutions (Android)
                Most often found resolutions are darker




                                  Image is a courtesy of OpenSignalMaps




Monday, August 13, 12
What is more scary: Drivers!
                Android specific problem!
                Graphics Drivers
                        • bugs
                        • performance variations
                        • chaos of 90ies is back!
                Quality is dramatically improving on IHV side
                        • but in many cases mobile vendors won't provide new drivers for their
                        devices: don't care / security testing / phased out devices...



Monday, August 13, 12
What is more scary: Drivers!
                Establish good relations with IHV
                Send bug-reports
                Automatize testing
                        • more on auto testing later
                Help Google with their open-source testing rig!
                        • http://source.android.com/compatibility/downloads.html




Monday, August 13, 12
Multiple Texture formats
                Android specific problem!
                ETC1 mandatory in GL ES2.0 - but NO Alpha support!
                        • Instead platform specific formats: PVRTC, ATC, DXT5, ETC2
                        • No single format which would be supported on all devices
                Uncompressed 16bit for textures with Alpha
                        • slow, large
                Yay! GL ES3.0 solves Alpha - mandatory ETC2
                        • Plus new formats: EAC, ASTC


Monday, August 13, 12
Textures with Alpha in GL ES2.0
                Pair of ETC textures: RGB in 1st texture + Alpha in 2nd
                Self-downloading application
                        • small bootrstrap app - determines GPU family on 1st run
                        • downloads and stores pack with GPU specific assets
                         • Unity: AssetBundles
                         • GooglePlay: new expansion files (up to 2GB)
                GooglePlay filtering
                         • build multiple versions of the game, each with textures for certain GPU
                         • <supports-gl-texture> tag in AndroidManifest


Monday, August 13, 12
GPU Architecture: Shader cores
                Unified - Vertex & Pixel use the same core
                        • Workload balancing
                        • SGX, Adreno, Mali T6xx
                Traditional - Vertex and Pixel cores are separate
                        • Either stage can be bottleneck at any given moment
                        • Tegra, Mali 4xx, MBX




Monday, August 13, 12
Skinning + Unified Architecture
             Offload work from GPU - skinning on CPU with NEON
                  • Favors Unified architecture - reduces vertex workload on GPU
                  • Tegra non Unified, but has 4 very fast NEON cores - so good too
             Reuse skinning results: shadows, multi-pass lighting
             Reduces code complexity & shader permutations




Monday, August 13, 12
Skinning on CPU
                Results on A9 @ 1Ghz (iPad3), NEON, 1 core:
                        •1 bones, position+normal+tangent - 12.2 Mverts/sec
                        •2 bones, position+normal+tangent - 11 Mverts/sec
                        •4 bones, position+normal+tangent - 6.7 Mverts/sec
                        •Test: 200 characters each 558 vertices




Monday, August 13, 12
Skinning on CPU
                Warning: net result of offloading work to CPU is
                   trickier when power consumption comes to play!
                        • game might run faster
                        • but can drain battery faster too (NEON is power hungry)




Monday, August 13, 12
Balancing CPU vs GPU
                Ideally would use DirectX11 Compute alike shaders
                        • if driver could run same shader on GPU or CPU depending on
                        platform / workload
                        • all reusable geometry transformations and image PostProcessing
                Might be worth trying Transform Feedback in GL ES3.0




Monday, August 13, 12
GPU Architecture: Precision of pixel ops
                Optimal precision for GPU family
                 • 11/12bit per-component (fixed) - SGX pre543, Tegra
                 • 16bit per-component (half) - SGX 543, Mali 4xx
                 • 32bit per-component (float) - Adreno, Mali T6xx
                Watch out for precision conversions
                 • most often will require additional cycles!
                 • (at least) SGX543 can hide conversion overhead sampling from texture




Monday, August 13, 12
Precision mixing examples
                BAD
                        struct Input {
                            float4 color : TEXCOORD0;
                        };
                        fixed4 frag (Input IN) : COLOR
                        {
                            return IN.color; // BAD: float -> fixed conversion
                        }




                BAD
                        fixed4 uniform;
                        ...
                        half4 result = pow (l, 2.2) + uniform; // BAD: fixed -> half conversion




                OK     half4 tex = tex2d (textureRGBA8bit, uv); // OK: conversion for free




Monday, August 13, 12
Cross platform precision considerations

                sRGB reads/writes are not available on mobiles yet
                        • though some hardware supports already
                As a result linear lighting is too expensive
                Arguable fixed point (11bit) can be enough for many pixels
                        • do per-pixel lighting in object space
                        • do fog per-vertex
                        • no depth-shadowmaps
                        • for specular could use texture lookup instead of pow ()
                         • at least 3 cycles (actually 4 to comply with ES standards)
                         • pow () result is in half/float precision, requires conversion to fixed
Monday, August 13, 12
Cross-platform shaders in Unity
                Cg/HLSL snippets wrapped in custom language
                        • helps to defines state, multipass rendering and lighting setup
                Rationale: maximizing cross-platform applicability
                        • abstract from mundane shader details
                        • generate platform specific code in:
                         • HLSL
                         • GLSL / GLSL ES
                         • DirectX or ARB assembly
                         • AGL
                         • etc

Monday, August 13, 12
Cross-platform shaders in Unity
                Artist specifies high-level shader on
                   the Material
                        • ex: "Bumped Specular", "Tree Leaves", "Unlit"
                Run-time picks specific platform
                   shader depending on
                        • supported feature set
                         • via Shader Fallback
                        • state (lights / shadows / lightmaps)
                          •via builtin Shader Keywords
                        • user-defined keys
                         •via Shader LOD + custom Shader Keywords


Monday, August 13, 12
Shader Fallback
                “If this shader can not run on this hardware, then try next one”
                Fallbacks can be chained
                            Shader "Per-pixel Lit" {
                                // shader code here ...
                                Fallback "Per-vertex Lit"
                            }




Monday, August 13, 12
Shader Keywords
                Built-in and custom shader permutations
                Using shader pre-processor macros
                         #pragma multi_compile LIGHTING_PER_PIXEL
                         ...
                         #ifdef LIGHTING_PER_PIXEL
                         // per pixel-lit
                         #else
                         // per vertex-lit
                         #endif

                         #pragma multi_compile PREFER_HALF_PRECISION
                         #ifdef PREFER_HALF_PRECISION
                           // force all operations to higher precision
                           #define scalar half
                           #define vec4 half4
                         #else
                           #define scalar fixed
                           #define vec4 fixed4
                         #endif
Monday, August 13, 12
Shader Keywords
                Example triggers custom shader permutation from script

                         // Devices with lots of muscle per pixel
                         if (iPhone.generation == iPad2Gen ||
                             iPhone.generation == iPhone4S ||
                             iPhone.generation == iPhone3GS)
                             Shader.EnableKeyword (“LIGHTING_PER_PIXEL”);

                         // Devices with SGX543
                         if (iPhone.generation ==   iPad2Gen ||
                             iPhone.generation ==   iPad3Gen ||
                             iPhone.generation ==   iPhone4S)
                             Shader.EnableKeyword   (“PREFER_HALF_PRECISION”);




Monday, August 13, 12
Shader LevelOfDetail
                Shader switch depending on platform performance
                        • LOD - integer value
                                    Shader "Lit" {
                                        SubShader { LOD 200 // per pixel-lit ..
                                        SubShader { LOD 100 // per vertex-lit ..
                                    }




                Example triggers shader LOD from script
                                   // Devices with lots of muscle per pixel
                                   if (iPhone.generation == iPad2Gen ||
                                       iPhone.generation == iPhone4S ||
                                       iPhone.generation == iPhone3GS)
                                       Shader.globalMaximumLOD = 200;
Monday, August 13, 12
Cross-platform shaders
                Surface shading and lighting snippets
                        • Instead of writing full vertex/pixel shader
                        • Just snippets of code
                Generate all “cruft” automagically depending on platform and state
                        • Shader generation is done offline
                                       #pragma surface MySurface Ramp
                                       void MySurface (Input IN, inout SurfaceOutput o) {
                                           o.Albedo = tex2D (_MainTex, ...);
                                           o.Albedo *= tex2D (_Detail, ...) * 2;
                                           o.Normal = UnpackNormal (tex2D (_BumpMap, ...));
                                       }

                                       half4 LightingRamp (SurfaceOutput s, half3 lightDir ...) {
                                           half2 NdotL = dot (s.Normal, lightDir);
                                           half3 ramp = tex2D (_Ramp, NdotL);
                                           half4 l;
                                           l.rgb = s.Albedo * ramp;
                                           ...
                                           return l;
                                       }


Monday, August 13, 12
Automatic code optimization
                cgbatch: Takes Cg/HLSL snippets and generates complete shader code in HLSL
                Preprocessor step
                hlsl2glsl: Converts HLSL to GLSL
                        • resurrected old ATI’s project, fixed & improved. Open source! https://github.com/aras-p/hlsl2glslfork
                glsls-optimizer (1): Offline GPU-independent GLSL optimization
                        • think inlining, dead code removal, copy propagation, arithmetic simplifications etc.
                        • 2 year ago many mobile drivers were bad at optimizations - we had 2-10x improvement
                        • Still very valuable
                glsls-optimizer (2): A fork of Mesa3D GLSL compiler that prints out GLSL ES
                   after all optimizations.
                        • Open source! https://github.com/aras-p/glsl-optimizer




Monday, August 13, 12
Shader generation steps in Unity
 Cg/HLSL snippets          HLSL vertex + pixel shader
 source
                                                        preprocessor

                                                 HLSL shaders



                            Optimized GLSL              GLSL




                                  Final GLSL ES vertex + fragment shaders
Monday, August 13, 12
Shader patching
                No dedicated hardware for blending, write masking,
                   flexible vertex input in (many) Mobile GPUs
                        • instead driver will patch shader code
                        • significant hiccup on the first drawcall /w new shader/state combination
                Prewarming
                        • force driver to do patching during load time
                        • issue drawcalls with dummy geometry for all shader/state combinations
                        • in Unity API: Shader.WarmupAllShaders ()



Monday, August 13, 12
Back to scary: ES2.0 API overhead
                Drawcall overhead on CPU
                        • 0.05ms per average drawcall on CPU (iPad2/iPad3)
                        • 600 drawcalls will max out CPU @ 30FPS
                Sorted by relative cost:
                        • glDrawXXX: draw call itself
                        • glUniformXXX: shader uniform uploads
                        • glVertexAttribPointer: vertex input setup
                        • state change



Monday, August 13, 12
OpenGL ES2.0 API overhead
                It is not just about drawcall counts
                        • important to minimize number of uniforms and state changes
                        • sort by Material
                        • optimize uniforms in shaders
                GL ES2.0 prevents many optimizations
                        • uniforms can not be treated as a sequential memory - drawcall setup
                        requires multiple calls
                        • uniforms are set per shader - calls on every shader change
                        • no means for binding uniform to a specific register - unlike HLSL
                Yay! GL ES3.0 - Uniform Buffer Object!
Monday, August 13, 12
Drawcall batching
                First reduce state changes and uniform uploads
                Reduce overhead by grouping multiple objects with the
                 same state into one drawcall
                Relies on sorting by material first
                        • applicable to opaque geometry mostly
                        • not applicable to multi-pass lighting either
                        • lighting data passed to shaders must be in world or view space




Monday, August 13, 12
Unity "static" batching
                Suitable for static environment
                Static VertexBuffer + Dynamic IndexBuffer
                @ Build-time
                        • objects are combined into a large shared Vertex Buffers
                        • sharing same material
                @ Run-time
                        • indices of visible objects are appended to dynamic Index Buffer




Monday, August 13, 12
Unity "static" batching
                But Dynamic Buffers are tricky on some mobile
                 platforms (see next)
                Instead could:
                        •@ Build-time organize objects into Octree, traverse in Morton order
                        and write to shared Vertex Buffer
                        •@ Run-time traverse in the same (Morton) order, render visible
                        objects with neighboring Vertex Buffer ranges in a single drawcall
                        • like “Segment Buffering”, Jon Olick, GPU Gems2
                        • all buffers are static


Monday, August 13, 12
Unity "dynamic" batching
                Suitable for dynamic objects
                        • with relatively simple geometry (see below)
                Transform object vertices to world space on CPU (NEON)
                        • append to shared dynamic Vertex Buffer and render in one drawcall
                Makes sense only for objects with low vertex count
                        • otherwise transformation cost would outweigh the cost of the drawcall itself
                        • usually 200-800 vertices per object




Monday, August 13, 12
Dynamic geometry
                Never do like this in GL ES2.0!
                                    for (;my_render_loop;)
                                        glBindBuffer (..., myBuffer);
                                        glMapBufferOES (..., GL_WRITE_ONLY_OES);
                                        // write data
                                        glUnmapBufferOES (...);
                                        glDrawElements (...);


                        • will block CPU waiting for GPU to finish rendering from your buffer




Monday, August 13, 12
Dynamic geometry
                Geometry of known size (skinning) is easy
                        • double/triple buffer - render from one buffer, while writing to another
                        • swap buffers only at the end of the frame
                Geometry of arbitrary size (particles, batching) is harder
                        • no fence / discard support in out-of-the-box GL ES2.0
                        • Yay for GL ES3.0, again!




Monday, August 13, 12
Dynamic geometry of arbitrary size
                Buffer renaming/orphaning is supported by some drivers (iOS)
                                      // orphan old buffer, driver should allocate new storage
                                      glBufferData (..., bytes, 0, GL_STREAM_DRAW);
                                      glMapBufferOES (..., GL_WRITE_ONLY_OES);


                Preallocate multiple buffers
                        • write to buffer once, mark it “busy” for 1 (or 2) frames and start rendering
                        • grab next “non-busy” buffer, otherwise allocate more buffers and continue
                        • could use NV_fence / EGL_sync extension to track if GPU is finished with certain buffer


                Do simulation and write all data to buffers before entering render-loop
                        • and don’t forget to double/triple your buffers


Monday, August 13, 12
Automated Testing




Monday, August 13, 12
Automated Testing
                Run same content on different devices
                        • different OS updates
                        • automatic on internal code changes
                Capture screenshots and compare to
                   templates
                        • per-pixel comparison
                Simplified scenes to test specific areas
                        • our test suite - 238 scenes


Monday, August 13, 12
Automated Testing
                Devices we use
                        • Nexus One (Adreno 205)
                        • Samsung Galaxy S 2 (Mali 400)
                        • Nexus S / Galaxy Nexus (SGX 540)
                        • Motorola Xoom (Tegra2)
                Why not more?




Monday, August 13, 12
Automated Testing Challenges
                Some devices simply crash on some tests
                        • drivers, argh!
                Some shader variations will just produce wrong results
                Loosing connection with the host
                        • hooking up more than one device per PC makes connection more likely
                        to fail
                Test results might differ significantly from device to device
                But there are people who manage to workaround this


Monday, August 13, 12
Fun with Automated Testing
                AlphaBlending differences on 2 distinct GPUs




Monday, August 13, 12
Fun with Automated Testing
                BlendMode differences on 2 distinct GPUs




Monday, August 13, 12
Fun with Automated Testing
                ReadPixels




Monday, August 13, 12
Q&A




           @__ReJ__
Monday, August 13, 12

More Related Content

What's hot

Mobile Performance Tuning: Poor Man's Tips And Tricks
Mobile Performance Tuning: Poor Man's Tips And TricksMobile Performance Tuning: Poor Man's Tips And Tricks
Mobile Performance Tuning: Poor Man's Tips And TricksValentin Simonov
 
Unity optimization techniques applied in Catan Universe
Unity optimization techniques applied in Catan UniverseUnity optimization techniques applied in Catan Universe
Unity optimization techniques applied in Catan UniverseExozet Berlin GmbH
 
GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...
GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...
GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...AMD Developer Central
 
Qualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing Cameras
Qualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing CamerasQualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing Cameras
Qualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing CamerasQualcomm Developer Network
 
Java me 08-mobile3d
Java me 08-mobile3dJava me 08-mobile3d
Java me 08-mobile3dHemanth Raju
 
GS-4151, Developing Thief with new AMD technology, by Jurjen Katsman
GS-4151, Developing Thief with new AMD technology, by Jurjen KatsmanGS-4151, Developing Thief with new AMD technology, by Jurjen Katsman
GS-4151, Developing Thief with new AMD technology, by Jurjen KatsmanAMD Developer Central
 
Game Engine Architecture
Game Engine ArchitectureGame Engine Architecture
Game Engine ArchitectureAttila Jenei
 
Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Intel® Software
 
Gentek Introduce(en)
Gentek Introduce(en)Gentek Introduce(en)
Gentek Introduce(en)cloudmmog
 
Unreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile Games
Unreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile GamesUnreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile Games
Unreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile GamesEpic Games China
 
Unity Internals: Memory and Performance
Unity Internals: Memory and PerformanceUnity Internals: Memory and Performance
Unity Internals: Memory and PerformanceDevGAMM Conference
 
StateScriptingInUncharted2
StateScriptingInUncharted2StateScriptingInUncharted2
StateScriptingInUncharted2gcarlton
 
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...Intel® Software
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & FutureOfer Rosenberg
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Intel® Software
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...AMD Developer Central
 
GPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O ArchitectureGPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O Architectureguestb3fc97
 
Introduce to 3d rendering engine
Introduce to 3d rendering engineIntroduce to 3d rendering engine
Introduce to 3d rendering engineDaosheng Mu
 
Making a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie TycoonMaking a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie TycoonJean-Philippe Doiron
 
Axceleon Presentation at Siggraph 2009
Axceleon Presentation at Siggraph 2009Axceleon Presentation at Siggraph 2009
Axceleon Presentation at Siggraph 2009Axceleon Inc
 

What's hot (20)

Mobile Performance Tuning: Poor Man's Tips And Tricks
Mobile Performance Tuning: Poor Man's Tips And TricksMobile Performance Tuning: Poor Man's Tips And Tricks
Mobile Performance Tuning: Poor Man's Tips And Tricks
 
Unity optimization techniques applied in Catan Universe
Unity optimization techniques applied in Catan UniverseUnity optimization techniques applied in Catan Universe
Unity optimization techniques applied in Catan Universe
 
GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...
GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...
GS-4145, Oxide discusses how Mantle enables game engine performance, by Dan B...
 
Qualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing Cameras
Qualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing CamerasQualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing Cameras
Qualcomm Vuforia Mobile Vision Platform: Smart Terrain for Depth-Sensing Cameras
 
Java me 08-mobile3d
Java me 08-mobile3dJava me 08-mobile3d
Java me 08-mobile3d
 
GS-4151, Developing Thief with new AMD technology, by Jurjen Katsman
GS-4151, Developing Thief with new AMD technology, by Jurjen KatsmanGS-4151, Developing Thief with new AMD technology, by Jurjen Katsman
GS-4151, Developing Thief with new AMD technology, by Jurjen Katsman
 
Game Engine Architecture
Game Engine ArchitectureGame Engine Architecture
Game Engine Architecture
 
Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*Forts and Fights Scaling Performance on Unreal Engine*
Forts and Fights Scaling Performance on Unreal Engine*
 
Gentek Introduce(en)
Gentek Introduce(en)Gentek Introduce(en)
Gentek Introduce(en)
 
Unreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile Games
Unreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile GamesUnreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile Games
Unreal Open Day 2017 UE4 for Mobile: The Future of High Quality Mobile Games
 
Unity Internals: Memory and Performance
Unity Internals: Memory and PerformanceUnity Internals: Memory and Performance
Unity Internals: Memory and Performance
 
StateScriptingInUncharted2
StateScriptingInUncharted2StateScriptingInUncharted2
StateScriptingInUncharted2
 
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
Scale CPU Experiences: Maximize Unity* Performance Using the Entity Component...
 
Compute API –Past & Future
Compute API –Past & FutureCompute API –Past & Future
Compute API –Past & Future
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
 
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...Computer Vision Powered by Heterogeneous System Architecture (HSA) by  Dr. Ha...
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
 
GPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O ArchitectureGPU Virtualization on VMware's Hosted I/O Architecture
GPU Virtualization on VMware's Hosted I/O Architecture
 
Introduce to 3d rendering engine
Introduce to 3d rendering engineIntroduce to 3d rendering engine
Introduce to 3d rendering engine
 
Making a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie TycoonMaking a game with Molehill: Zombie Tycoon
Making a game with Molehill: Zombie Tycoon
 
Axceleon Presentation at Siggraph 2009
Axceleon Presentation at Siggraph 2009Axceleon Presentation at Siggraph 2009
Axceleon Presentation at Siggraph 2009
 

Similar to Mobile crossplatformchallenges siggraph

Inside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private CloudInside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private CloudAtlassian
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
 
Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...
Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...
Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...Ontico
 
Saving a century a day: how the Fresco library works
Saving a century a day: how the Fresco library worksSaving a century a day: how the Fresco library works
Saving a century a day: how the Fresco library worksTyrone Nicholas
 
[UniteKorea2013] Butterfly Effect DX11
[UniteKorea2013] Butterfly Effect DX11[UniteKorea2013] Butterfly Effect DX11
[UniteKorea2013] Butterfly Effect DX11William Hugo Yang
 
Thinking cpu & memory - DroidCon Paris 18 june 2013
Thinking cpu & memory - DroidCon Paris 18 june 2013Thinking cpu & memory - DroidCon Paris 18 june 2013
Thinking cpu & memory - DroidCon Paris 18 june 2013Paris Android User Group
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptxSherazMunawar5
 
GPU Programming 360iDev
GPU Programming 360iDevGPU Programming 360iDev
GPU Programming 360iDevJanie Clayton
 
Gpu Programming With GPUImage and Metal
Gpu Programming With GPUImage and MetalGpu Programming With GPUImage and Metal
Gpu Programming With GPUImage and MetalJanie Clayton
 
Adding more visuals without affecting performance
Adding more visuals without affecting performanceAdding more visuals without affecting performance
Adding more visuals without affecting performanceSt1X
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Fisnik Kraja
 
Pixel shaders based UI components + writing your first pixel shader
Pixel shaders based UI components + writing your first pixel shaderPixel shaders based UI components + writing your first pixel shader
Pixel shaders based UI components + writing your first pixel shaderDenis Radin
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLinaro
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon Berlin
 
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unity Technologies
 

Similar to Mobile crossplatformchallenges siggraph (20)

Inside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private CloudInside the Atlassian OnDemand Private Cloud
Inside the Atlassian OnDemand Private Cloud
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...
Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...
Пиксельные шейдеры для Web-разработчиков. Программируем GPU / Денис Радин (Li...
 
Saving a century a day: how the Fresco library works
Saving a century a day: how the Fresco library worksSaving a century a day: how the Fresco library works
Saving a century a day: how the Fresco library works
 
[UniteKorea2013] Butterfly Effect DX11
[UniteKorea2013] Butterfly Effect DX11[UniteKorea2013] Butterfly Effect DX11
[UniteKorea2013] Butterfly Effect DX11
 
Thinking cpu & memory - DroidCon Paris 18 june 2013
Thinking cpu & memory - DroidCon Paris 18 june 2013Thinking cpu & memory - DroidCon Paris 18 june 2013
Thinking cpu & memory - DroidCon Paris 18 june 2013
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
 
PG-Strom
PG-StromPG-Strom
PG-Strom
 
GPU Programming 360iDev
GPU Programming 360iDevGPU Programming 360iDev
GPU Programming 360iDev
 
Gpu microprocessors
Gpu microprocessorsGpu microprocessors
Gpu microprocessors
 
Gpu Programming With GPUImage and Metal
Gpu Programming With GPUImage and MetalGpu Programming With GPUImage and Metal
Gpu Programming With GPUImage and Metal
 
Adding more visuals without affecting performance
Adding more visuals without affecting performanceAdding more visuals without affecting performance
Adding more visuals without affecting performance
 
Adobe: Changing the game
Adobe: Changing the gameAdobe: Changing the game
Adobe: Changing the game
 
UNIT 2 P5
UNIT 2 P5UNIT 2 P5
UNIT 2 P5
 
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...Using Many-Core Processors to Improve the Performance of Space Computing Plat...
Using Many-Core Processors to Improve the Performance of Space Computing Plat...
 
Pixel shaders based UI components + writing your first pixel shader
Pixel shaders based UI components + writing your first pixel shaderPixel shaders based UI components + writing your first pixel shader
Pixel shaders based UI components + writing your first pixel shader
 
LCE13: Android Graphics Upstreaming
LCE13: Android Graphics UpstreamingLCE13: Android Graphics Upstreaming
LCE13: Android Graphics Upstreaming
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
 

More from changehee lee

Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glchangehee lee
 
Fortugno nick design_and_monetization
Fortugno nick design_and_monetizationFortugno nick design_and_monetization
Fortugno nick design_and_monetizationchangehee lee
 
[Kgc2013] 모바일 엔진 개발기
[Kgc2013] 모바일 엔진 개발기[Kgc2013] 모바일 엔진 개발기
[Kgc2013] 모바일 엔진 개발기changehee lee
 
모바일 엔진 개발기
모바일 엔진 개발기모바일 엔진 개발기
모바일 엔진 개발기changehee lee
 
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)changehee lee
 
개발자여! 스터디를 하자!
개발자여! 스터디를 하자!개발자여! 스터디를 하자!
개발자여! 스터디를 하자!changehee lee
 
[Kgc2012] deferred forward 이창희
[Kgc2012] deferred forward 이창희[Kgc2012] deferred forward 이창희
[Kgc2012] deferred forward 이창희changehee lee
 
Gamificated game developing
Gamificated game developingGamificated game developing
Gamificated game developingchangehee lee
 
Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to reality   getting the most out of direct3 d 10 graphics in your gamesWindows to reality   getting the most out of direct3 d 10 graphics in your games
Windows to reality getting the most out of direct3 d 10 graphics in your gameschangehee lee
 
Basic ofreflectance kor
Basic ofreflectance korBasic ofreflectance kor
Basic ofreflectance korchangehee lee
 
Valve handbook low_res
Valve handbook low_resValve handbook low_res
Valve handbook low_reschangehee lee
 
Ndc12 이창희 render_pipeline
Ndc12 이창희 render_pipelineNdc12 이창희 render_pipeline
Ndc12 이창희 render_pipelinechangehee lee
 

More from changehee lee (20)

Visual shock vol.2
Visual shock   vol.2Visual shock   vol.2
Visual shock vol.2
 
Shader compilation
Shader compilationShader compilation
Shader compilation
 
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_gl
 
Fortugno nick design_and_monetization
Fortugno nick design_and_monetizationFortugno nick design_and_monetization
Fortugno nick design_and_monetization
 
카툰 렌더링
카툰 렌더링카툰 렌더링
카툰 렌더링
 
[Kgc2013] 모바일 엔진 개발기
[Kgc2013] 모바일 엔진 개발기[Kgc2013] 모바일 엔진 개발기
[Kgc2013] 모바일 엔진 개발기
 
Paper games 2013
Paper games 2013Paper games 2013
Paper games 2013
 
모바일 엔진 개발기
모바일 엔진 개발기모바일 엔진 개발기
모바일 엔진 개발기
 
V8
V8V8
V8
 
Wecanmakeengine
WecanmakeengineWecanmakeengine
Wecanmakeengine
 
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)
개발 과정 최적화 하기 내부툴로 더욱 강력한 개발하기 Stephen kennedy _(11시40분_103호)
 
개발자여! 스터디를 하자!
개발자여! 스터디를 하자!개발자여! 스터디를 하자!
개발자여! 스터디를 하자!
 
[Kgc2012] deferred forward 이창희
[Kgc2012] deferred forward 이창희[Kgc2012] deferred forward 이창희
[Kgc2012] deferred forward 이창희
 
Light prepass
Light prepassLight prepass
Light prepass
 
Gamificated game developing
Gamificated game developingGamificated game developing
Gamificated game developing
 
Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to reality   getting the most out of direct3 d 10 graphics in your gamesWindows to reality   getting the most out of direct3 d 10 graphics in your games
Windows to reality getting the most out of direct3 d 10 graphics in your games
 
Basic ofreflectance kor
Basic ofreflectance korBasic ofreflectance kor
Basic ofreflectance kor
 
C++11(최지웅)
C++11(최지웅)C++11(최지웅)
C++11(최지웅)
 
Valve handbook low_res
Valve handbook low_resValve handbook low_res
Valve handbook low_res
 
Ndc12 이창희 render_pipeline
Ndc12 이창희 render_pipelineNdc12 이창희 render_pipeline
Ndc12 이창희 render_pipeline
 

Mobile crossplatformchallenges siggraph

  • 2. Unity: iOS and Android - Cross-Platform Challenges and Solutions Renaldas Zioma Unity Technologies Monday, August 13, 12
  • 3. Mobile devices today Can render ... Dead Trigger courtesy of MadFingerGames Monday, August 13, 12
  • 4. Mobile devices today Can render ... Dead Trigger courtesy of MadFingerGames Monday, August 13, 12
  • 5. Mobile devices today Can render this @ 2048 x 1536 CSR Racing courtesy of BossAlien & NaturalMotion Monday, August 13, 12
  • 6. Mobile Platform Challenges Different GPU architectures • API extensions Screen resolutions Performance scale Drivers Texture formats Monday, August 13, 12
  • 7. 4 (or 5) GPU Architectures ImgTech PowerVR SGX - TBDR (TileBasedDeferred) • ImgTech PowerVR MBX - TBDR (Fixed Function) ARM Mali - Tiled (small tiles) Qualcomm Adreno - Tiled (large tiles) • Adreno3xx - can switch to Traditional • glHint(GL_BINNING_CONTROL_HINT_QCOM, GL_RENDER_DIRECT_TO_FRAMEBUFFER_QCOM) NVIDIA Tegra - Traditional Monday, August 13, 12
  • 8. Tiled Architecture Splits screen into tiles • small (for example: 16x16) - SGX, MALI • relatively large (for example 256K) - Adreno Tile memory is on chip - fast! Once GPU is done rendering tile • tile is “resolved” - written out to slower RAM Monday, August 13, 12
  • 9. Tiled Deferred Architecture Per-drawcall: polygons are transformed, assigned to tiles, stored in memory (Parameter Buffer) Rasterization starts only after all scene drawcalls were processed • every tile has access to all covering polygons Per-tile: Occluded polygons are rejected and only visible parts of polygons are rasterized • for opaque geometry rasterization will touch every pixel only once • saves ALU and texture reads Monday, August 13, 12
  • 10. Tiled Deferred Architecture Vertex shader Tile Accelerator Parameter Buffer Rasterization starts only after all scene drawcalls were processed Hidden Surface Removal Tile Rasterizer On Chip Memory “Resolve” Frame buffer RAM Monday, August 13, 12
  • 11. Not so scary in practice! Just... Sort opaque geom differently for Traditional vs Tiled • Tiled: sort by material to reduce CPU drawcall overhead • Traditional: sort roughly front-to-back to maximize ZCull efficiency • then by material • Tiled Deferred: render alpha-tested after opaque • higher chance that expensive alpha-tested pixels will be occluded Separate render loop for MBX Fixed Function  optimized for low-end devices, can go faster than GLES2.0 loop, no per-pixel lighting, limited postFX possibilities  phasing it out Be more aggressive with 16bit framebuffers on Tiled Monday, August 13, 12
  • 12. Not so scary in practice! Just... Use EXT_discard_framebuffer extensions on Tiled • will avoid copying data (color/depth/stencil) you're not planning to use Clear RenderTarget before rendering into it • otherwise on Tiled driver will copy color/depth/stencil back from RAM • not clearing is not an optimization! Monday, August 13, 12
  • 13. Architectural Benefits Benefits • Tiled: MSAA is almost free (5-10% of rendering time) • Tiled: AlphaBlending is significantly cheaper • Tiled: less dithering artifacts for 16bit framebuffers Caveats • TBDR: RenderTarget switch might be more expensive • TBDR: Too much geometry will flush whole pipeline (ParameterBuffer overflow) Monday, August 13, 12
  • 14. Interesting Tiles Reminds recent works •“Tile-based Deferred Shading”, Andrew Lauritzen, SIGGRAPH2010 •“Tile-based Forward Rendering”, Takahiro Hirada, GDC2012 ... suitable for high-end GPUs • different problems • but common solutions Monday, August 13, 12
  • 15. Screen Resolutions (Android) Most often found resolutions are darker Image is a courtesy of OpenSignalMaps Monday, August 13, 12
  • 16. What is more scary: Drivers! Android specific problem! Graphics Drivers • bugs • performance variations • chaos of 90ies is back! Quality is dramatically improving on IHV side • but in many cases mobile vendors won't provide new drivers for their devices: don't care / security testing / phased out devices... Monday, August 13, 12
  • 17. What is more scary: Drivers! Establish good relations with IHV Send bug-reports Automatize testing • more on auto testing later Help Google with their open-source testing rig! • http://source.android.com/compatibility/downloads.html Monday, August 13, 12
  • 18. Multiple Texture formats Android specific problem! ETC1 mandatory in GL ES2.0 - but NO Alpha support! • Instead platform specific formats: PVRTC, ATC, DXT5, ETC2 • No single format which would be supported on all devices Uncompressed 16bit for textures with Alpha • slow, large Yay! GL ES3.0 solves Alpha - mandatory ETC2 • Plus new formats: EAC, ASTC Monday, August 13, 12
  • 19. Textures with Alpha in GL ES2.0 Pair of ETC textures: RGB in 1st texture + Alpha in 2nd Self-downloading application • small bootrstrap app - determines GPU family on 1st run • downloads and stores pack with GPU specific assets • Unity: AssetBundles • GooglePlay: new expansion files (up to 2GB) GooglePlay filtering • build multiple versions of the game, each with textures for certain GPU • <supports-gl-texture> tag in AndroidManifest Monday, August 13, 12
  • 20. GPU Architecture: Shader cores Unified - Vertex & Pixel use the same core • Workload balancing • SGX, Adreno, Mali T6xx Traditional - Vertex and Pixel cores are separate • Either stage can be bottleneck at any given moment • Tegra, Mali 4xx, MBX Monday, August 13, 12
  • 21. Skinning + Unified Architecture Offload work from GPU - skinning on CPU with NEON • Favors Unified architecture - reduces vertex workload on GPU • Tegra non Unified, but has 4 very fast NEON cores - so good too Reuse skinning results: shadows, multi-pass lighting Reduces code complexity & shader permutations Monday, August 13, 12
  • 22. Skinning on CPU Results on A9 @ 1Ghz (iPad3), NEON, 1 core: •1 bones, position+normal+tangent - 12.2 Mverts/sec •2 bones, position+normal+tangent - 11 Mverts/sec •4 bones, position+normal+tangent - 6.7 Mverts/sec •Test: 200 characters each 558 vertices Monday, August 13, 12
  • 23. Skinning on CPU Warning: net result of offloading work to CPU is trickier when power consumption comes to play! • game might run faster • but can drain battery faster too (NEON is power hungry) Monday, August 13, 12
  • 24. Balancing CPU vs GPU Ideally would use DirectX11 Compute alike shaders • if driver could run same shader on GPU or CPU depending on platform / workload • all reusable geometry transformations and image PostProcessing Might be worth trying Transform Feedback in GL ES3.0 Monday, August 13, 12
  • 25. GPU Architecture: Precision of pixel ops Optimal precision for GPU family • 11/12bit per-component (fixed) - SGX pre543, Tegra • 16bit per-component (half) - SGX 543, Mali 4xx • 32bit per-component (float) - Adreno, Mali T6xx Watch out for precision conversions • most often will require additional cycles! • (at least) SGX543 can hide conversion overhead sampling from texture Monday, August 13, 12
  • 26. Precision mixing examples BAD struct Input { float4 color : TEXCOORD0; }; fixed4 frag (Input IN) : COLOR { return IN.color; // BAD: float -> fixed conversion } BAD fixed4 uniform; ... half4 result = pow (l, 2.2) + uniform; // BAD: fixed -> half conversion OK half4 tex = tex2d (textureRGBA8bit, uv); // OK: conversion for free Monday, August 13, 12
  • 27. Cross platform precision considerations sRGB reads/writes are not available on mobiles yet • though some hardware supports already As a result linear lighting is too expensive Arguable fixed point (11bit) can be enough for many pixels • do per-pixel lighting in object space • do fog per-vertex • no depth-shadowmaps • for specular could use texture lookup instead of pow () • at least 3 cycles (actually 4 to comply with ES standards) • pow () result is in half/float precision, requires conversion to fixed Monday, August 13, 12
  • 28. Cross-platform shaders in Unity Cg/HLSL snippets wrapped in custom language • helps to defines state, multipass rendering and lighting setup Rationale: maximizing cross-platform applicability • abstract from mundane shader details • generate platform specific code in: • HLSL • GLSL / GLSL ES • DirectX or ARB assembly • AGL • etc Monday, August 13, 12
  • 29. Cross-platform shaders in Unity Artist specifies high-level shader on the Material • ex: "Bumped Specular", "Tree Leaves", "Unlit" Run-time picks specific platform shader depending on • supported feature set • via Shader Fallback • state (lights / shadows / lightmaps) •via builtin Shader Keywords • user-defined keys •via Shader LOD + custom Shader Keywords Monday, August 13, 12
  • 30. Shader Fallback “If this shader can not run on this hardware, then try next one” Fallbacks can be chained Shader "Per-pixel Lit" { // shader code here ... Fallback "Per-vertex Lit" } Monday, August 13, 12
  • 31. Shader Keywords Built-in and custom shader permutations Using shader pre-processor macros #pragma multi_compile LIGHTING_PER_PIXEL ... #ifdef LIGHTING_PER_PIXEL // per pixel-lit #else // per vertex-lit #endif #pragma multi_compile PREFER_HALF_PRECISION #ifdef PREFER_HALF_PRECISION // force all operations to higher precision #define scalar half #define vec4 half4 #else #define scalar fixed #define vec4 fixed4 #endif Monday, August 13, 12
  • 32. Shader Keywords Example triggers custom shader permutation from script // Devices with lots of muscle per pixel if (iPhone.generation == iPad2Gen || iPhone.generation == iPhone4S || iPhone.generation == iPhone3GS) Shader.EnableKeyword (“LIGHTING_PER_PIXEL”); // Devices with SGX543 if (iPhone.generation == iPad2Gen || iPhone.generation == iPad3Gen || iPhone.generation == iPhone4S) Shader.EnableKeyword (“PREFER_HALF_PRECISION”); Monday, August 13, 12
  • 33. Shader LevelOfDetail Shader switch depending on platform performance • LOD - integer value Shader "Lit" { SubShader { LOD 200 // per pixel-lit .. SubShader { LOD 100 // per vertex-lit .. } Example triggers shader LOD from script // Devices with lots of muscle per pixel if (iPhone.generation == iPad2Gen || iPhone.generation == iPhone4S || iPhone.generation == iPhone3GS) Shader.globalMaximumLOD = 200; Monday, August 13, 12
  • 34. Cross-platform shaders Surface shading and lighting snippets • Instead of writing full vertex/pixel shader • Just snippets of code Generate all “cruft” automagically depending on platform and state • Shader generation is done offline #pragma surface MySurface Ramp void MySurface (Input IN, inout SurfaceOutput o) { o.Albedo = tex2D (_MainTex, ...); o.Albedo *= tex2D (_Detail, ...) * 2; o.Normal = UnpackNormal (tex2D (_BumpMap, ...)); } half4 LightingRamp (SurfaceOutput s, half3 lightDir ...) { half2 NdotL = dot (s.Normal, lightDir); half3 ramp = tex2D (_Ramp, NdotL); half4 l; l.rgb = s.Albedo * ramp; ... return l; } Monday, August 13, 12
  • 35. Automatic code optimization cgbatch: Takes Cg/HLSL snippets and generates complete shader code in HLSL Preprocessor step hlsl2glsl: Converts HLSL to GLSL • resurrected old ATI’s project, fixed & improved. Open source! https://github.com/aras-p/hlsl2glslfork glsls-optimizer (1): Offline GPU-independent GLSL optimization • think inlining, dead code removal, copy propagation, arithmetic simplifications etc. • 2 year ago many mobile drivers were bad at optimizations - we had 2-10x improvement • Still very valuable glsls-optimizer (2): A fork of Mesa3D GLSL compiler that prints out GLSL ES after all optimizations. • Open source! https://github.com/aras-p/glsl-optimizer Monday, August 13, 12
  • 36. Shader generation steps in Unity Cg/HLSL snippets HLSL vertex + pixel shader source preprocessor HLSL shaders Optimized GLSL GLSL Final GLSL ES vertex + fragment shaders Monday, August 13, 12
  • 37. Shader patching No dedicated hardware for blending, write masking, flexible vertex input in (many) Mobile GPUs • instead driver will patch shader code • significant hiccup on the first drawcall /w new shader/state combination Prewarming • force driver to do patching during load time • issue drawcalls with dummy geometry for all shader/state combinations • in Unity API: Shader.WarmupAllShaders () Monday, August 13, 12
  • 38. Back to scary: ES2.0 API overhead Drawcall overhead on CPU • 0.05ms per average drawcall on CPU (iPad2/iPad3) • 600 drawcalls will max out CPU @ 30FPS Sorted by relative cost: • glDrawXXX: draw call itself • glUniformXXX: shader uniform uploads • glVertexAttribPointer: vertex input setup • state change Monday, August 13, 12
  • 39. OpenGL ES2.0 API overhead It is not just about drawcall counts • important to minimize number of uniforms and state changes • sort by Material • optimize uniforms in shaders GL ES2.0 prevents many optimizations • uniforms can not be treated as a sequential memory - drawcall setup requires multiple calls • uniforms are set per shader - calls on every shader change • no means for binding uniform to a specific register - unlike HLSL Yay! GL ES3.0 - Uniform Buffer Object! Monday, August 13, 12
  • 40. Drawcall batching First reduce state changes and uniform uploads Reduce overhead by grouping multiple objects with the same state into one drawcall Relies on sorting by material first • applicable to opaque geometry mostly • not applicable to multi-pass lighting either • lighting data passed to shaders must be in world or view space Monday, August 13, 12
  • 41. Unity "static" batching Suitable for static environment Static VertexBuffer + Dynamic IndexBuffer @ Build-time • objects are combined into a large shared Vertex Buffers • sharing same material @ Run-time • indices of visible objects are appended to dynamic Index Buffer Monday, August 13, 12
  • 42. Unity "static" batching But Dynamic Buffers are tricky on some mobile platforms (see next) Instead could: •@ Build-time organize objects into Octree, traverse in Morton order and write to shared Vertex Buffer •@ Run-time traverse in the same (Morton) order, render visible objects with neighboring Vertex Buffer ranges in a single drawcall • like “Segment Buffering”, Jon Olick, GPU Gems2 • all buffers are static Monday, August 13, 12
  • 43. Unity "dynamic" batching Suitable for dynamic objects • with relatively simple geometry (see below) Transform object vertices to world space on CPU (NEON) • append to shared dynamic Vertex Buffer and render in one drawcall Makes sense only for objects with low vertex count • otherwise transformation cost would outweigh the cost of the drawcall itself • usually 200-800 vertices per object Monday, August 13, 12
  • 44. Dynamic geometry Never do like this in GL ES2.0! for (;my_render_loop;) glBindBuffer (..., myBuffer); glMapBufferOES (..., GL_WRITE_ONLY_OES); // write data glUnmapBufferOES (...); glDrawElements (...); • will block CPU waiting for GPU to finish rendering from your buffer Monday, August 13, 12
  • 45. Dynamic geometry Geometry of known size (skinning) is easy • double/triple buffer - render from one buffer, while writing to another • swap buffers only at the end of the frame Geometry of arbitrary size (particles, batching) is harder • no fence / discard support in out-of-the-box GL ES2.0 • Yay for GL ES3.0, again! Monday, August 13, 12
  • 46. Dynamic geometry of arbitrary size Buffer renaming/orphaning is supported by some drivers (iOS) // orphan old buffer, driver should allocate new storage glBufferData (..., bytes, 0, GL_STREAM_DRAW); glMapBufferOES (..., GL_WRITE_ONLY_OES); Preallocate multiple buffers • write to buffer once, mark it “busy” for 1 (or 2) frames and start rendering • grab next “non-busy” buffer, otherwise allocate more buffers and continue • could use NV_fence / EGL_sync extension to track if GPU is finished with certain buffer Do simulation and write all data to buffers before entering render-loop • and don’t forget to double/triple your buffers Monday, August 13, 12
  • 48. Automated Testing Run same content on different devices • different OS updates • automatic on internal code changes Capture screenshots and compare to templates • per-pixel comparison Simplified scenes to test specific areas • our test suite - 238 scenes Monday, August 13, 12
  • 49. Automated Testing Devices we use • Nexus One (Adreno 205) • Samsung Galaxy S 2 (Mali 400) • Nexus S / Galaxy Nexus (SGX 540) • Motorola Xoom (Tegra2) Why not more? Monday, August 13, 12
  • 50. Automated Testing Challenges Some devices simply crash on some tests • drivers, argh! Some shader variations will just produce wrong results Loosing connection with the host • hooking up more than one device per PC makes connection more likely to fail Test results might differ significantly from device to device But there are people who manage to workaround this Monday, August 13, 12
  • 51. Fun with Automated Testing AlphaBlending differences on 2 distinct GPUs Monday, August 13, 12
  • 52. Fun with Automated Testing BlendMode differences on 2 distinct GPUs Monday, August 13, 12
  • 53. Fun with Automated Testing ReadPixels Monday, August 13, 12
  • 54. Q&A @__ReJ__ Monday, August 13, 12