Masked Occlusion Culling

1
Masked Occlusion Culling
A practical guide to efficient CPU culling
GDC 2018

2
Introduction
• Leigh Davies Intel
• Leigh.Davies@Intel.com
• Senior Game / Graphics Application Engineer,
DRD / SSG,
UK.
• Marcus Svensson, Avalanche Studios
• marcus.svensson@avalanchestudios.se
• Graphics Application Engineer
Sweden.

3
Agenda
● Part 1: Intro to Software Occlusion Culling
● Part 2: Past and Future of CPU Culling in Apex*
● Part 3: Introduction to Masked Occlusion Culling (MOC)
● Masked Occlusion Culling Overview
● Masked Occlusion performance
● Part 4: Avalanche Studios Case Study
● Integration with Apex*
● Art pipeline considerations
● Results
● Part 5: Conclusion/Q&A

4
What is Software Occlusion Culling ?
Problem: Modern games draw a lot of objects!
Environment potentially occludes a large proportion
Culled very late in the GPU pipeline
Wasted CPU time in logic and driver.
Wasted GPU time getting to the culling phase
Precomputed solutions don’t help with
● Destructible buildings
● Dynamic occludees
*Other names and brands may be claimed as the property of others.

5
Traditional software Occlusion Systems
Low Resolution depth buffer: [Col11]
• Requires manually inner
conservative for Occluders
Daniel Collin (DICE, GDC 2011)*
Hierarchical Z Buffer (HiZ) [Greene93]
• Rasterize to full resolution z buffer
• Create HiZ buffer
– Find maximum depth in each NxN tile
• Perform occlusion query with HiZ buffer
– Doesn’t fit with hierarchical scene traversal.
• Intel: Occlusion Culling Sample [CMK16]

6
Masked Occlusion Culling
Based on Masked Occlusion Culling for Graphics Hardware [AHAM15]
• Created by Andersson M., Hasselgren J.,
Akenine-Möller T (AHAM)*
• Directly update HiZ buffer without
computing a full res z buffer during
rasterization
• Being conservative is the only requirement
-Depth must be >= to full resolution buffer for any pixel
Depth Buffer ( Near == dark, Far == white )
Approximate, conservative HiZ buffer
Conservative representation, picks furthest Z.
Queries will never cull a visible object

7
In a game studio far far away...

8
About Avalanche Studios
● Founded in 2003
● Based in Stockholm and New York
● Focused on open world action games

10
AVALANCHE STUDIOS
AT A GLANCE
JUNE 2016
JUST CAUSE 3*
2015

11
AVALANCHE STUDIOS
AT A GLANCE
JUNE 2016
theHunter: Call of the Wild*
2017

12
About Avalanche Studios
● Currently working on…
○ One unannounced self published game
○ Two unannounced AAA games
● Apex* — Avalanche Open World Engine
○ Cross platform
● Large open worlds
○ Just Cause 3* covers 1000 km2

13
Occlusion Culling in Apex*
● Cull as early as possible
○ Coarse CPU culling
● Dynamic occluders and occludees
○ Destruction
● Spend as little time as possible on
making occlusion geometry
○ Relatively small game teams

14
Past Occlusion Culling
● BFBC — Brute Force Box Culling
○ Artist placed occluder boxes
○ A handful occluders each frame
○ Selection heuristic based on size and
distance
○ Frustum from OBB vs AABB
● Used in all released games since
Just Cause 2*

15
Why Change?
● Limited by boxes
○ Creating a terrain representation using only
boxes is challenging
● Culled individually, not by union
○ Problematic in indoor areas
● Constant battle with selection heuristics
○ Lots of time spent on making occluders work
well from all views

16
Future Occlusion Culling
● Software occlusion
○ Mesh occluders
○ Occluder fusion
○ Can use larger number of occluders
● Masked Occlusion Culling
○ Worked with Intel before
■ Optimized Just Cause 3* for Intel GPUs
○ Fast algorithm
○ Cross platform

17
Masked Occlusion Culling 101...

18
Masked Occlusion Culling optimised for SIMD intrinsics
• SSE2/4.1, AVX2 codepaths today
• Easily expanded to AVX-512
• Cross platform:- VS/LLVM/Clang
• Removed union’s and operator overloading
• 5x improvement between SSE2 and AVX2
MaskedOcclusionCulling: Fillrate test:
• 29Million Triangles, 7 buckets of(10x10 to 500x500)
• Clipping disabled, single threaded

19
Masked Occlusion Algorithm OverviewTriangleSetup
Tile
Traversal
SIMD-Wide
triangle setup
SIMD -
Scanlines
256* Pixels
‘N’ sub-tiles
(8x4 Pixels)
Not covered today, see:
https://www.slideshare.net/IntelSoftware/
masked-software-occlusion-culling
SIMD optimised, independent of MOC
Update is the critical difference
to other occlusion algorithms
*Num pixels dependant on SIMD width

20
An Alternative HiZ Representation
Decouple depth and coverage information
• Two depth values per tile
• Per-pixel selection mask (1-bit per pixel)
Example:
• A 8×4 pixel tile is fully covered by a blue polygon, then later partially covered by a yellow triangle.
HiZ-representation
in screen space.
Yellow samples associated with Z1max
Blue samples associated with Z0max
Z1Max
Working Layer
Z0Max
Reference Layer
Coverage Mask

21
HiZ Depth Test
• Depth buffer split into 32xN tiles
• 32x4 for SSE4.1
• 32x8 for AVX2
• Sub-tiles are 8x4 for all SIMD sizes
• Two float depth values per sub-tile
• One 32bit mask per sub-tile
• Interpolate conservative depth (per 8x4 sub-tile)
SSE2/4.1
AVX Register
AVX2

22
Building Hi-Z Buffer[AHAM15]
Simple example;
• Z0Max set to clear value or fully covered by
triangle

23
Building Hi-Z Buffer [AHAM15]
Simple example;
triangle
• Z1Max set to incoming triangle depth,
coverage mask is for 1 triangle.

24
Simple example;
triangle
• Z1Max set to incoming triangle depth
• 3 Options for the depth after next triangle;
• Discard Z1 if close to Z0Max
• Merge coverage with Z1Max
• Keep and discard Z1Max and coverage

25
Culled
Not Culled

26
Masked Occlusion Simple Scenes
• Merge heuristics work best
with front to back sort
• Improves efficiency due to
early out in tile update
• Sort major occluders by
bounding box screen space
position
• Apply normal Fustrum culling,
render with appropriate flags
Used in Conqueror's Blade*
*The Blade for All Conquerors: Making the Most of Intel Core for the Best Gaming Experience (GDC 2018)
Other names and brands may be claimed as the property of others.

27
Easy to sort buildings front to back.
Ground intersects scene and cannot be sorted
Creates temporal instability on silhouettes
Improving Complex Scenes
Not present if terrain and foreground objects
are rendered separately
unsorted triangle data
within a single draw call

28
Merging Buffers[HAAM16]
*Intel® Core™ i7-6950X
Terrain
Rasterization*
Foreground
Rasterization*
Total
Rasterization*
Merge
Time*
% spent
in merge
640x400 0.19 0.652 0.842 0.008 0.9%
1280x800 0.26 0.772 1.068 0.027 2.5%
1920x1080 0.31 0.834 1.144 0.051 4.4%
Working Layers
Reference Layers
Merge layers has more context to scene than
updating buffer on a per triangle basis.
Can “special case” hard to sort meshes.
• Triangle sort of individual mesh
• BSP tree

29
Today in Github
Library can be configured for performance vs
quality trade-offs using compile time #defines
• Quick Mask=0: AHAM15 algorithm.
• Quick Mask=1: HAAM16 algorithm.
• Precise Coverage: Match D3D rules for
triangle
• Clipping Order: Ensure triangle ordering is
preserved after clipping
Masked Occlusion Library
• API Example
• Validation Test
• Fillrate test
Compile option comparison of fillrate test
performance

30
Update Tile, New Method [HAAM16]
Update [HAAM16] is quicker than original [AHAM15] , including Occlusion Queries
• z0max is the reference layer: Maximum value for the entire tile
z1max is the working layer: Maximum value for a subset of the tile
– Updated as
– New depth = max(z1max , ztrimax)
– New mask = TriangleMask OR LayerMask
• Whenever working layer mask is full, overwrite reference layer
[AHAM15] [HAAM16]

31
Multi-Core Scaling
• Transform triangles into screen space.
• Bin by screen-space tiles.
• Use ScissorRect to clip contents.
• 1 active tile per thread.

32
Masked Software Occlusion Culling
Benefits of Masked Occlusion Culling [AHAM15, HAAM16]
• Much less memory to read/write than full res z-buffer
• Updates use bitmasks – can process many pixels in parallel (i.e.
SSE/AVX)
• No need to compute per-pixel depths
• No need for conservative art assets
Cons:
• Doesn’t cull everything a full rez buffer would ( 98% accurate)*
• Conservative depth error is sensitive to render order
• Small amount of temporal instability in depth errors
*98% quoted in High Performance Graphics (2016) abstract
Addressed with HI-Z Buffer merging

33
Back in that other galaxy...

34
Masked Occlusion Culling in Apex*
● Smooth integration
○ No major changes to existing systems
● Up and running on all platforms
○ PC uses highest compatible instruction set
○ Xbox One and Playstation 4 use SSE4.1
● Artist placed occluders
● Terrain occluders

35
Artist Placed Occluders
● Boxes and quads
○ Might consider adding support for custom
meshes in the future
● Modular workflow
○ More efficient than before
○ Occluders placed in entities
● Closely fitted to avoid holes
○ Not overly conservative for lowres

36
Terrain Occluders
● Low poly conservative mesh
● Offline generation per patch
○ Each patch is 512x512m
○ Conservative reduction from base mesh
● Volumetric terrain may take many forms
○ Triangle count dependent on patch complexity

37
● Halfres depth buffer
Occluder Rendering
● Using quick mask update method [HAAM16]
○ Depth discrepancies occur in mostly same places, size makes little difference
● Threaded rendering
○ Merging vs binned rendering
○ High overhead cost for binned rendering
○ Merging proved more suitable for us
○ Improved temporal stability is a nice bonus

38
Frame Setup
Occluder render task Culling task
Culling taskTerrain render task
...
...
Other work Culling task
Culling task ...
...
Other work
Core 0
Core 1
Core 2
Core 3

39
● Cull occluders outside the frustum or too small at a certain distance
● Sort by distance to OBB instead of AABB to minimize depth discrepancies
● Render occluders until a fixed limit is reached
● Last task to finish merges buffers
Render Task
Clear buffer
Sort front to
back
Size cull +
frustum cull
Render
occluders
Merge
buffers

40
Culling Task
BatchBatch Batch
● Occludees
○ Linear arrays of AABBs
○ Coarse single level hierarchy built on top
of some of the larger arrays
● Processed in batches
○ 512 occludees per batch
○ Each task processes one batch at a time
● Distance cull + frustum cull
● Project AABB vertices
● Test screen space rect

41
Statistics
● Outdoor settlement
● 31296 static objects
● Queries passed avg: ~2-5%
● Culled by MOC: ~10-20%
PS4 Pro i7-4790K
Render time ~1.0ms ~0.6ms
Merge time ~0.1ms ~0.05ms
Query time ~2.0ms ~1.0ms
Measurements show accumulated CPU time
across all threads. ~3000 triangles rasterized in
halfres = 960x540px.

42
Statistics
● Indoor area
● 38549 static objects
● Queries passed avg: ~1-3%
● Culled by MOC: ~30-40%
PS4 Pro i7-4790K
Render time ~1.2ms ~0.8ms
Merge time ~0.1ms ~0.05ms
Query time ~2.0ms ~1.0ms
Measurements show accumulated CPU time
across all threads. ~3000 triangles rasterized in
halfres = 960x540px.

43
Results
Pros
● Lower pressure on GPU rendering and CPU systems
● Performance increase, especially noticeable in worst case scenarios
○ Early results showed ~8-10% overall lower frame time on consoles
● Modular workflow is easier and more effective to use for artists
● Conservative terrain mesh
Cons
● Need to manually make sure there are no holes between placed occluders
● Depth discrepancies occur within terrain patches

44
Conclusion
Intel
● Masked Occlusion Culling is fast and scalable
● Simple API making it fast to prototype with
● Rough front to back sort is a requirement
● Will scale with future CPU instruction sets (AVX512)
Avalanche Studios
● MOC can easily be integrated into a game engine
● No need to remodel the entire occlusion pipeline
● Reduces workload on CPU and GPU

45
Legal Disclaimers and Optimization Notices
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.
EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF
INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products
are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall
have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
All products, platforms, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect
actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel
Performance Benchmark Limitations.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. § For more information go to www.intel.com/benchmarks.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice Revision
#20110804.
The Intel Core and Itanium processor families may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
The benchmark results reported may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer
system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.
The code names Arrandale, Bloomfield, Boazman, Boulder Creek, Calpella, Chief River, Clarkdale, Cliffside, Cougar Point, Gulftown, Huron River, Ivy Bridge, Kilmer Peak, King’s Creek, Lewisville, Lynnfield, Maho Bay, Montevina, Montevina Plus, Nehalem, Penryn,
Puma Peak, Rainbow Peak, Sandy Bridge, Sugar Bay, Tylersburg, and Westmere presented in this document are only for use by Intel to identify a product, technology, or service in development, that has not been made commercially available to the public, i.e.,
announced, launched or shipped. It is not a "commercial" name for products or services and is not intended to function as a trademark.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel's Web Site.
Intel, Intel Core, Core Inside, Itanium, and the Intel Logo are trademarks of Intel Corporation in the U.S. and other countries.
Intel collects and uses personal information from employees as part of SES, including capturing audio recording of sessions ( both presenters and audience QA ) as well as photographs and video recording of various event activities during the event. By registering and
attending the SES conference, you give your consent for this capture. This includes both speakers and attendees. Intel will not retain your personal information longer than is necessary for the purposes for which it is collected.

46
References
• [AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for
Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9
• [CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software
Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/
software-occlusion-culling, (2013–2016).
http://www.gdcvault.com/play/1017837/Why-Render-Hidden-Objects-Cull
• [Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011).
http://www.gdcvault.com/play/1014491/Culling-the-Battlefield-Data-Oriented
• [Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of
SIGGRAPH, (1993), pp. 231–238
• [HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion
Culling. High Performance Graphics, (2016)
https://www.slideshare.net/IntelSoftware/masked-software-occlusion-culling

Masked Occlusion Culling

More Related Content

What's hot

Similar to Masked Occlusion Culling

More from Intel® Software

Recently uploaded

Masked Occlusion Culling