1
Masked Occlusion Culling
A practical guide to efficient CPU culling
GDC 2018
2
Introduction
• Leigh Davies Intel
• Leigh.Davies@Intel.com
• Senior Game / Graphics Application Engineer,
DRD / SSG,
UK.
• Marcus Svensson, Avalanche Studios
• marcus.svensson@avalanchestudios.se
• Graphics Application Engineer
Sweden.
3
Agenda
● Part 1: Intro to Software Occlusion Culling
● Part 2: Past and Future of CPU Culling in Apex*
● Part 3: Introduction to Masked Occlusion Culling (MOC)
● Masked Occlusion Culling Overview
● Masked Occlusion performance
● Part 4: Avalanche Studios Case Study
● Integration with Apex*
● Art pipeline considerations
● Results
● Part 5: Conclusion/Q&A
4
What is Software Occlusion Culling ?
Problem: Modern games draw a lot of objects!
Environment potentially occludes a large proportion
Culled very late in the GPU pipeline
Wasted CPU time in logic and driver.
Wasted GPU time getting to the culling phase
Precomputed solutions don’t help with
● Destructible buildings
● Dynamic occludees
*Other names and brands may be claimed as the property of others.
5
Traditional software Occlusion Systems
Low Resolution depth buffer: [Col11]
• Requires manually inner
conservative for Occluders
Daniel Collin (DICE, GDC 2011)*
Hierarchical Z Buffer (HiZ) [Greene93]
• Rasterize to full resolution z buffer
• Create HiZ buffer
– Find maximum depth in each NxN tile
• Perform occlusion query with HiZ buffer
– Doesn’t fit with hierarchical scene traversal.
• Intel: Occlusion Culling Sample [CMK16]
*Other names and brands may be claimed as the property of others.
6
Masked Occlusion Culling
Based on Masked Occlusion Culling for Graphics Hardware [AHAM15]
• Created by Andersson M., Hasselgren J.,
Akenine-Möller T (AHAM)*
• Directly update HiZ buffer without
computing a full res z buffer during
rasterization
• Being conservative is the only requirement
-Depth must be >= to full resolution buffer for any pixel
Depth Buffer ( Near == dark, Far == white )
Approximate, conservative HiZ buffer
Conservative representation, picks furthest Z.
Queries will never cull a visible object
7
In a game studio far far away...
8
About Avalanche Studios
● Founded in 2003
● Based in Stockholm and New York
● Focused on open world action games
9
MAD MAX*
2015
10
AVALANCHE STUDIOS
AT A GLANCE
JUNE 2016
JUST CAUSE 3*
2015
11
AVALANCHE STUDIOS
AT A GLANCE
JUNE 2016
theHunter: Call of the Wild*
2017
12
About Avalanche Studios
● Currently working on…
○ One unannounced self published game
○ Two unannounced AAA games
● Apex* — Avalanche Open World Engine
○ Cross platform
● Large open worlds
○ Just Cause 3* covers 1000 km2
13
Occlusion Culling in Apex*
● Cull as early as possible
○ Coarse CPU culling
● Dynamic occluders and occludees
○ Destruction
● Spend as little time as possible on
making occlusion geometry
○ Relatively small game teams
14
Past Occlusion Culling
● BFBC — Brute Force Box Culling
○ Artist placed occluder boxes
○ A handful occluders each frame
○ Selection heuristic based on size and
distance
○ Frustum from OBB vs AABB
● Used in all released games since
Just Cause 2*
15
Why Change?
● Limited by boxes
○ Creating a terrain representation using only
boxes is challenging
● Culled individually, not by union
○ Problematic in indoor areas
● Constant battle with selection heuristics
○ Lots of time spent on making occluders work
well from all views
16
Future Occlusion Culling
● Software occlusion
○ Mesh occluders
○ Occluder fusion
○ Can use larger number of occluders
● Masked Occlusion Culling
○ Worked with Intel before
■ Optimized Just Cause 3* for Intel GPUs
○ Fast algorithm
○ Cross platform
17
Masked Occlusion Culling 101...
18
Masked Occlusion Culling optimised for SIMD intrinsics
• SSE2/4.1, AVX2 codepaths today
• Easily expanded to AVX-512
• Cross platform:- VS/LLVM/Clang
• Removed union’s and operator overloading
• 5x improvement between SSE2 and AVX2
MaskedOcclusionCulling: Fillrate test:
• 29Million Triangles, 7 buckets of(10x10 to 500x500)
• Clipping disabled, single threaded
19
Masked Occlusion Algorithm OverviewTriangleSetup
Tile
Traversal
SIMD-Wide
triangle setup
SIMD -
Scanlines
256* Pixels
‘N’ sub-tiles
(8x4 Pixels)
Not covered today, see:
https://www.slideshare.net/IntelSoftware/
masked-software-occlusion-culling
SIMD optimised, independent of MOC
Update is the critical difference
to other occlusion algorithms
*Num pixels dependant on SIMD width
20
An Alternative HiZ Representation
Decouple depth and coverage information
• Two depth values per tile
• Per-pixel selection mask (1-bit per pixel)
Example:
• A 8×4 pixel tile is fully covered by a blue polygon, then later partially covered by a yellow triangle.
HiZ-representation
in screen space.
Yellow samples associated with Z1max
Blue samples associated with Z0max
Z1Max
Working Layer
Z0Max
Reference Layer
Coverage Mask
21
HiZ Depth Test
• Depth buffer split into 32xN tiles
• 32x4 for SSE4.1
• 32x8 for AVX2
• Sub-tiles are 8x4 for all SIMD sizes
• Two float depth values per sub-tile
• One 32bit mask per sub-tile
• Interpolate conservative depth (per 8x4 sub-tile)
SSE2/4.1
AVX Register
AVX2
22
Building Hi-Z Buffer[AHAM15]
Simple example;
• Z0Max set to clear value or fully covered by
triangle
23
Building Hi-Z Buffer [AHAM15]
Simple example;
• Z0Max set to clear value or fully covered by
triangle
• Z1Max set to incoming triangle depth,
coverage mask is for 1 triangle.
24
Building Hi-Z Buffer [AHAM15]
Simple example;
• Z0Max set to clear value or fully covered by
triangle
• Z1Max set to incoming triangle depth
• 3 Options for the depth after next triangle;
• Discard Z1 if close to Z0Max
• Merge coverage with Z1Max
• Keep and discard Z1Max and coverage
25
Building Hi-Z Buffer [AHAM15]
Culled
Not Culled
26
Masked Occlusion Simple Scenes
• Merge heuristics work best
with front to back sort
• Improves efficiency due to
early out in tile update
• Sort major occluders by
bounding box screen space
position
• Apply normal Fustrum culling,
render with appropriate flags
Used in Conqueror's Blade*
*The Blade for All Conquerors: Making the Most of Intel Core for the Best Gaming Experience (GDC 2018)
Other names and brands may be claimed as the property of others.
27
Easy to sort buildings front to back.
Ground intersects scene and cannot be sorted
Creates temporal instability on silhouettes
Improving Complex Scenes
Not present if terrain and foreground objects
are rendered separately
unsorted triangle data
within a single draw call
28
Merging Buffers[HAAM16]
*Intel® Core™ i7-6950X
Terrain
Rasterization*
Foreground
Rasterization*
Total
Rasterization*
Merge
Time*
% spent
in merge
640x400 0.19 0.652 0.842 0.008 0.9%
1280x800 0.26 0.772 1.068 0.027 2.5%
1920x1080 0.31 0.834 1.144 0.051 4.4%
Working Layers
Reference Layers
Merge layers has more context to scene than
updating buffer on a per triangle basis.
Can “special case” hard to sort meshes.
• Triangle sort of individual mesh
• BSP tree
29
Today in Github
Library can be configured for performance vs
quality trade-offs using compile time #defines
• Quick Mask=0: AHAM15 algorithm.
• Quick Mask=1: HAAM16 algorithm.
• Precise Coverage: Match D3D rules for
triangle
• Clipping Order: Ensure triangle ordering is
preserved after clipping
Masked Occlusion Library
• API Example
• Validation Test
• Fillrate test
Compile option comparison of fillrate test
performance
30
Update Tile, New Method [HAAM16]
Update [HAAM16] is quicker than original [AHAM15] , including Occlusion Queries
• z0max is the reference layer: Maximum value for the entire tile
z1max is the working layer: Maximum value for a subset of the tile
– Updated as
– New depth = max(z1max , ztrimax)
– New mask = TriangleMask OR LayerMask
• Whenever working layer mask is full, overwrite reference layer
[AHAM15] [HAAM16]
31
Multi-Core Scaling
• Transform triangles into screen space.
• Bin by screen-space tiles.
• Use ScissorRect to clip contents.
• 1 active tile per thread.
32
Masked Software Occlusion Culling
Benefits of Masked Occlusion Culling [AHAM15, HAAM16]
• Much less memory to read/write than full res z-buffer
• Updates use bitmasks – can process many pixels in parallel (i.e.
SSE/AVX)
• No need to compute per-pixel depths
• No need for conservative art assets
Cons:
• Doesn’t cull everything a full rez buffer would ( 98% accurate)*
• Conservative depth error is sensitive to render order
• Small amount of temporal instability in depth errors
*98% quoted in High Performance Graphics (2016) abstract
Addressed with HI-Z Buffer merging
33
Back in that other galaxy...
34
Masked Occlusion Culling in Apex*
● Smooth integration
○ No major changes to existing systems
● Up and running on all platforms
○ PC uses highest compatible instruction set
○ Xbox One and Playstation 4 use SSE4.1
● Artist placed occluders
● Terrain occluders
35
Artist Placed Occluders
● Boxes and quads
○ Might consider adding support for custom
meshes in the future
● Modular workflow
○ More efficient than before
○ Occluders placed in entities
● Closely fitted to avoid holes
○ Not overly conservative for lowres
36
Terrain Occluders
● Low poly conservative mesh
● Offline generation per patch
○ Each patch is 512x512m
○ Conservative reduction from base mesh
● Volumetric terrain may take many forms
○ Triangle count dependent on patch complexity
37
● Halfres depth buffer
Occluder Rendering
● Using quick mask update method [HAAM16]
○ Depth discrepancies occur in mostly same places, size makes little difference
● Threaded rendering
○ Merging vs binned rendering
○ High overhead cost for binned rendering
○ Merging proved more suitable for us
○ Improved temporal stability is a nice bonus
38
Frame Setup
Occluder render task Culling task
Culling taskTerrain render task
...
...
Other work Culling task
Culling task ...
...
Other work
Core 0
Core 1
Core 2
Core 3
39
● Cull occluders outside the frustum or too small at a certain distance
● Sort by distance to OBB instead of AABB to minimize depth discrepancies
● Render occluders until a fixed limit is reached
● Last task to finish merges buffers
Render Task
Clear buffer
Sort front to
back
Size cull +
frustum cull
Render
occluders
Merge
buffers
40
Culling Task
BatchBatch Batch
● Occludees
○ Linear arrays of AABBs
○ Coarse single level hierarchy built on top
of some of the larger arrays
● Processed in batches
○ 512 occludees per batch
○ Each task processes one batch at a time
● Distance cull + frustum cull
● Project AABB vertices
● Test screen space rect
41
Statistics
● Outdoor settlement
● 31296 static objects
● Queries passed avg: ~2-5%
● Culled by MOC: ~10-20%
PS4 Pro i7-4790K
Render time ~1.0ms ~0.6ms
Merge time ~0.1ms ~0.05ms
Query time ~2.0ms ~1.0ms
Measurements show accumulated CPU time
across all threads. ~3000 triangles rasterized in
halfres = 960x540px.
42
Statistics
● Indoor area
● 38549 static objects
● Queries passed avg: ~1-3%
● Culled by MOC: ~30-40%
PS4 Pro i7-4790K
Render time ~1.2ms ~0.8ms
Merge time ~0.1ms ~0.05ms
Query time ~2.0ms ~1.0ms
Measurements show accumulated CPU time
across all threads. ~3000 triangles rasterized in
halfres = 960x540px.
43
Results
Pros
● Lower pressure on GPU rendering and CPU systems
● Performance increase, especially noticeable in worst case scenarios
○ Early results showed ~8-10% overall lower frame time on consoles
● Modular workflow is easier and more effective to use for artists
● Conservative terrain mesh
Cons
● Need to manually make sure there are no holes between placed occluders
● Depth discrepancies occur within terrain patches
44
Conclusion
Intel
● Masked Occlusion Culling is fast and scalable
● Simple API making it fast to prototype with
● Rough front to back sort is a requirement
● Will scale with future CPU instruction sets (AVX512)
Avalanche Studios
● MOC can easily be integrated into a game engine
● No need to remodel the entire occlusion pipeline
● Reduces workload on CPU and GPU
45
Legal Disclaimers and Optimization Notices
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products
are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall
have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
All products, platforms, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect
actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel
Performance Benchmark Limitations.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. § For more information go to www.intel.com/benchmarks.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision
#20110804.
The Intel Core and Itanium processor families may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
The benchmark results reported may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer
system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.
The code names Arrandale, Bloomfield, Boazman, Boulder Creek, Calpella, Chief River, Clarkdale, Cliffside, Cougar Point, Gulftown, Huron River, Ivy Bridge, Kilmer Peak, King’s Creek, Lewisville, Lynnfield, Maho Bay, Montevina, Montevina Plus, Nehalem, Penryn,
Puma Peak, Rainbow Peak, Sandy Bridge, Sugar Bay, Tylersburg, and Westmere presented in this document are only for use by Intel to identify a product, technology, or service in development, that has not been made commercially available to the public, i.e.,
announced, launched or shipped. It is not a "commercial" name for products or services and is not intended to function as a trademark.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site.
Intel, Intel Core, Core Inside, Itanium, and the Intel Logo are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Intel collects and uses personal information from employees as part of SES, including capturing audio recording of sessions ( both presenters and audience QA ) as well as photographs and video recording of various event activities during the event. By registering and
attending the SES conference, you give your consent for this capture. This includes both speakers and attendees. Intel will not retain your personal information longer than is necessary for the purposes for which it is collected.
46
References
• [AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for
Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9
• [CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software
Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/
software-occlusion-culling, (2013–2016).
http://www.gdcvault.com/play/1017837/Why-Render-Hidden-Objects-Cull
• [Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011).
http://www.gdcvault.com/play/1014491/Culling-the-Battlefield-Data-Oriented
• [Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of
SIGGRAPH, (1993), pp. 231–238
• [HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion
Culling. High Performance Graphics, (2016)
https://www.slideshare.net/IntelSoftware/masked-software-occlusion-culling

Masked Occlusion Culling

  • 1.
    1 Masked Occlusion Culling Apractical guide to efficient CPU culling GDC 2018
  • 2.
    2 Introduction • Leigh DaviesIntel • Leigh.Davies@Intel.com • Senior Game / Graphics Application Engineer, DRD / SSG, UK. • Marcus Svensson, Avalanche Studios • marcus.svensson@avalanchestudios.se • Graphics Application Engineer Sweden.
  • 3.
    3 Agenda ● Part 1:Intro to Software Occlusion Culling ● Part 2: Past and Future of CPU Culling in Apex* ● Part 3: Introduction to Masked Occlusion Culling (MOC) ● Masked Occlusion Culling Overview ● Masked Occlusion performance ● Part 4: Avalanche Studios Case Study ● Integration with Apex* ● Art pipeline considerations ● Results ● Part 5: Conclusion/Q&A
  • 4.
    4 What is SoftwareOcclusion Culling ? Problem: Modern games draw a lot of objects! Environment potentially occludes a large proportion Culled very late in the GPU pipeline Wasted CPU time in logic and driver. Wasted GPU time getting to the culling phase Precomputed solutions don’t help with ● Destructible buildings ● Dynamic occludees *Other names and brands may be claimed as the property of others.
  • 5.
    5 Traditional software OcclusionSystems Low Resolution depth buffer: [Col11] • Requires manually inner conservative for Occluders Daniel Collin (DICE, GDC 2011)* Hierarchical Z Buffer (HiZ) [Greene93] • Rasterize to full resolution z buffer • Create HiZ buffer – Find maximum depth in each NxN tile • Perform occlusion query with HiZ buffer – Doesn’t fit with hierarchical scene traversal. • Intel: Occlusion Culling Sample [CMK16] *Other names and brands may be claimed as the property of others.
  • 6.
    6 Masked Occlusion Culling Basedon Masked Occlusion Culling for Graphics Hardware [AHAM15] • Created by Andersson M., Hasselgren J., Akenine-Möller T (AHAM)* • Directly update HiZ buffer without computing a full res z buffer during rasterization • Being conservative is the only requirement -Depth must be >= to full resolution buffer for any pixel Depth Buffer ( Near == dark, Far == white ) Approximate, conservative HiZ buffer Conservative representation, picks furthest Z. Queries will never cull a visible object
  • 7.
    7 In a gamestudio far far away...
  • 8.
    8 About Avalanche Studios ●Founded in 2003 ● Based in Stockholm and New York ● Focused on open world action games
  • 9.
  • 10.
    10 AVALANCHE STUDIOS AT AGLANCE JUNE 2016 JUST CAUSE 3* 2015
  • 11.
    11 AVALANCHE STUDIOS AT AGLANCE JUNE 2016 theHunter: Call of the Wild* 2017
  • 12.
    12 About Avalanche Studios ●Currently working on… ○ One unannounced self published game ○ Two unannounced AAA games ● Apex* — Avalanche Open World Engine ○ Cross platform ● Large open worlds ○ Just Cause 3* covers 1000 km2
  • 13.
    13 Occlusion Culling inApex* ● Cull as early as possible ○ Coarse CPU culling ● Dynamic occluders and occludees ○ Destruction ● Spend as little time as possible on making occlusion geometry ○ Relatively small game teams
  • 14.
    14 Past Occlusion Culling ●BFBC — Brute Force Box Culling ○ Artist placed occluder boxes ○ A handful occluders each frame ○ Selection heuristic based on size and distance ○ Frustum from OBB vs AABB ● Used in all released games since Just Cause 2*
  • 15.
    15 Why Change? ● Limitedby boxes ○ Creating a terrain representation using only boxes is challenging ● Culled individually, not by union ○ Problematic in indoor areas ● Constant battle with selection heuristics ○ Lots of time spent on making occluders work well from all views
  • 16.
    16 Future Occlusion Culling ●Software occlusion ○ Mesh occluders ○ Occluder fusion ○ Can use larger number of occluders ● Masked Occlusion Culling ○ Worked with Intel before ■ Optimized Just Cause 3* for Intel GPUs ○ Fast algorithm ○ Cross platform
  • 17.
  • 18.
    18 Masked Occlusion Cullingoptimised for SIMD intrinsics • SSE2/4.1, AVX2 codepaths today • Easily expanded to AVX-512 • Cross platform:- VS/LLVM/Clang • Removed union’s and operator overloading • 5x improvement between SSE2 and AVX2 MaskedOcclusionCulling: Fillrate test: • 29Million Triangles, 7 buckets of(10x10 to 500x500) • Clipping disabled, single threaded
  • 19.
    19 Masked Occlusion AlgorithmOverviewTriangleSetup Tile Traversal SIMD-Wide triangle setup SIMD - Scanlines 256* Pixels ‘N’ sub-tiles (8x4 Pixels) Not covered today, see: https://www.slideshare.net/IntelSoftware/ masked-software-occlusion-culling SIMD optimised, independent of MOC Update is the critical difference to other occlusion algorithms *Num pixels dependant on SIMD width
  • 20.
    20 An Alternative HiZRepresentation Decouple depth and coverage information • Two depth values per tile • Per-pixel selection mask (1-bit per pixel) Example: • A 8×4 pixel tile is fully covered by a blue polygon, then later partially covered by a yellow triangle. HiZ-representation in screen space. Yellow samples associated with Z1max Blue samples associated with Z0max Z1Max Working Layer Z0Max Reference Layer Coverage Mask
  • 21.
    21 HiZ Depth Test •Depth buffer split into 32xN tiles • 32x4 for SSE4.1 • 32x8 for AVX2 • Sub-tiles are 8x4 for all SIMD sizes • Two float depth values per sub-tile • One 32bit mask per sub-tile • Interpolate conservative depth (per 8x4 sub-tile) SSE2/4.1 AVX Register AVX2
  • 22.
    22 Building Hi-Z Buffer[AHAM15] Simpleexample; • Z0Max set to clear value or fully covered by triangle
  • 23.
    23 Building Hi-Z Buffer[AHAM15] Simple example; • Z0Max set to clear value or fully covered by triangle • Z1Max set to incoming triangle depth, coverage mask is for 1 triangle.
  • 24.
    24 Building Hi-Z Buffer[AHAM15] Simple example; • Z0Max set to clear value or fully covered by triangle • Z1Max set to incoming triangle depth • 3 Options for the depth after next triangle; • Discard Z1 if close to Z0Max • Merge coverage with Z1Max • Keep and discard Z1Max and coverage
  • 25.
    25 Building Hi-Z Buffer[AHAM15] Culled Not Culled
  • 26.
    26 Masked Occlusion SimpleScenes • Merge heuristics work best with front to back sort • Improves efficiency due to early out in tile update • Sort major occluders by bounding box screen space position • Apply normal Fustrum culling, render with appropriate flags Used in Conqueror's Blade* *The Blade for All Conquerors: Making the Most of Intel Core for the Best Gaming Experience (GDC 2018) Other names and brands may be claimed as the property of others.
  • 27.
    27 Easy to sortbuildings front to back. Ground intersects scene and cannot be sorted Creates temporal instability on silhouettes Improving Complex Scenes Not present if terrain and foreground objects are rendered separately unsorted triangle data within a single draw call
  • 28.
    28 Merging Buffers[HAAM16] *Intel® Core™i7-6950X Terrain Rasterization* Foreground Rasterization* Total Rasterization* Merge Time* % spent in merge 640x400 0.19 0.652 0.842 0.008 0.9% 1280x800 0.26 0.772 1.068 0.027 2.5% 1920x1080 0.31 0.834 1.144 0.051 4.4% Working Layers Reference Layers Merge layers has more context to scene than updating buffer on a per triangle basis. Can “special case” hard to sort meshes. • Triangle sort of individual mesh • BSP tree
  • 29.
    29 Today in Github Librarycan be configured for performance vs quality trade-offs using compile time #defines • Quick Mask=0: AHAM15 algorithm. • Quick Mask=1: HAAM16 algorithm. • Precise Coverage: Match D3D rules for triangle • Clipping Order: Ensure triangle ordering is preserved after clipping Masked Occlusion Library • API Example • Validation Test • Fillrate test Compile option comparison of fillrate test performance
  • 30.
    30 Update Tile, NewMethod [HAAM16] Update [HAAM16] is quicker than original [AHAM15] , including Occlusion Queries • z0max is the reference layer: Maximum value for the entire tile z1max is the working layer: Maximum value for a subset of the tile – Updated as – New depth = max(z1max , ztrimax) – New mask = TriangleMask OR LayerMask • Whenever working layer mask is full, overwrite reference layer [AHAM15] [HAAM16]
  • 31.
    31 Multi-Core Scaling • Transformtriangles into screen space. • Bin by screen-space tiles. • Use ScissorRect to clip contents. • 1 active tile per thread.
  • 32.
    32 Masked Software OcclusionCulling Benefits of Masked Occlusion Culling [AHAM15, HAAM16] • Much less memory to read/write than full res z-buffer • Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX) • No need to compute per-pixel depths • No need for conservative art assets Cons: • Doesn’t cull everything a full rez buffer would ( 98% accurate)* • Conservative depth error is sensitive to render order • Small amount of temporal instability in depth errors *98% quoted in High Performance Graphics (2016) abstract Addressed with HI-Z Buffer merging
  • 33.
    33 Back in thatother galaxy...
  • 34.
    34 Masked Occlusion Cullingin Apex* ● Smooth integration ○ No major changes to existing systems ● Up and running on all platforms ○ PC uses highest compatible instruction set ○ Xbox One and Playstation 4 use SSE4.1 ● Artist placed occluders ● Terrain occluders
  • 35.
    35 Artist Placed Occluders ●Boxes and quads ○ Might consider adding support for custom meshes in the future ● Modular workflow ○ More efficient than before ○ Occluders placed in entities ● Closely fitted to avoid holes ○ Not overly conservative for lowres
  • 36.
    36 Terrain Occluders ● Lowpoly conservative mesh ● Offline generation per patch ○ Each patch is 512x512m ○ Conservative reduction from base mesh ● Volumetric terrain may take many forms ○ Triangle count dependent on patch complexity
  • 37.
    37 ● Halfres depthbuffer Occluder Rendering ● Using quick mask update method [HAAM16] ○ Depth discrepancies occur in mostly same places, size makes little difference ● Threaded rendering ○ Merging vs binned rendering ○ High overhead cost for binned rendering ○ Merging proved more suitable for us ○ Improved temporal stability is a nice bonus
  • 38.
    38 Frame Setup Occluder rendertask Culling task Culling taskTerrain render task ... ... Other work Culling task Culling task ... ... Other work Core 0 Core 1 Core 2 Core 3
  • 39.
    39 ● Cull occludersoutside the frustum or too small at a certain distance ● Sort by distance to OBB instead of AABB to minimize depth discrepancies ● Render occluders until a fixed limit is reached ● Last task to finish merges buffers Render Task Clear buffer Sort front to back Size cull + frustum cull Render occluders Merge buffers
  • 40.
    40 Culling Task BatchBatch Batch ●Occludees ○ Linear arrays of AABBs ○ Coarse single level hierarchy built on top of some of the larger arrays ● Processed in batches ○ 512 occludees per batch ○ Each task processes one batch at a time ● Distance cull + frustum cull ● Project AABB vertices ● Test screen space rect
  • 41.
    41 Statistics ● Outdoor settlement ●31296 static objects ● Queries passed avg: ~2-5% ● Culled by MOC: ~10-20% PS4 Pro i7-4790K Render time ~1.0ms ~0.6ms Merge time ~0.1ms ~0.05ms Query time ~2.0ms ~1.0ms Measurements show accumulated CPU time across all threads. ~3000 triangles rasterized in halfres = 960x540px.
  • 42.
    42 Statistics ● Indoor area ●38549 static objects ● Queries passed avg: ~1-3% ● Culled by MOC: ~30-40% PS4 Pro i7-4790K Render time ~1.2ms ~0.8ms Merge time ~0.1ms ~0.05ms Query time ~2.0ms ~1.0ms Measurements show accumulated CPU time across all threads. ~3000 triangles rasterized in halfres = 960x540px.
  • 43.
    43 Results Pros ● Lower pressureon GPU rendering and CPU systems ● Performance increase, especially noticeable in worst case scenarios ○ Early results showed ~8-10% overall lower frame time on consoles ● Modular workflow is easier and more effective to use for artists ● Conservative terrain mesh Cons ● Need to manually make sure there are no holes between placed occluders ● Depth discrepancies occur within terrain patches
  • 44.
    44 Conclusion Intel ● Masked OcclusionCulling is fast and scalable ● Simple API making it fast to prototype with ● Rough front to back sort is a requirement ● Will scale with future CPU instruction sets (AVX512) Avalanche Studios ● MOC can easily be integrated into a game engine ● No need to remodel the entire occlusion pipeline ● Reduces workload on CPU and GPU
  • 45.
    45 Legal Disclaimers andOptimization Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. All products, platforms, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. § For more information go to www.intel.com/benchmarks. Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision #20110804. The Intel Core and Itanium processor families may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. The benchmark results reported may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. The code names Arrandale, Bloomfield, Boazman, Boulder Creek, Calpella, Chief River, Clarkdale, Cliffside, Cougar Point, Gulftown, Huron River, Ivy Bridge, Kilmer Peak, King’s Creek, Lewisville, Lynnfield, Maho Bay, Montevina, Montevina Plus, Nehalem, Penryn, Puma Peak, Rainbow Peak, Sandy Bridge, Sugar Bay, Tylersburg, and Westmere presented in this document are only for use by Intel to identify a product, technology, or service in development, that has not been made commercially available to the public, i.e., announced, launched or shipped. It is not a "commercial" name for products or services and is not intended to function as a trademark. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site. Intel, Intel Core, Core Inside, Itanium, and the Intel Logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Intel collects and uses personal information from employees as part of SES, including capturing audio recording of sessions ( both presenters and audience QA ) as well as photographs and video recording of various event activities during the event. By registering and attending the SES conference, you give your consent for this capture. This includes both speakers and attendees. Intel will not retain your personal information longer than is necessary for the purposes for which it is collected.
  • 46.
    46 References • [AHAM15] ANDERSSONM., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9 • [CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/ software-occlusion-culling, (2013–2016). http://www.gdcvault.com/play/1017837/Why-Render-Hidden-Objects-Cull • [Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011). http://www.gdcvault.com/play/1014491/Culling-the-Battlefield-Data-Oriented • [Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of SIGGRAPH, (1993), pp. 231–238 • [HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion Culling. High Performance Graphics, (2016) https://www.slideshare.net/IntelSoftware/masked-software-occlusion-culling