The document discusses solutions for common problems that arise in deferred rendering engines, such as handling multiple shading models and lighting translucent geometry. It proposes using multiple light rendering passes where the scene is masked in each pass to render only for specific shading models, avoiding expensive branching. It also details using object space light probes to efficiently light alpha objects and particle systems directly on the GPU within the deferred rendering framework.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Develop2012 deferred sanchez_stachowiak
1.
2. Solving Some Common Problems
in a Modern Deferred Rendering
Engine
Jose Luis Sanchez Bonet
Tomasz Stachowiak
/* @h3r2tic */
3. Deferred rendering – pros and cons
• Pros ( some )
– Very scalable
– No shader permutation explosion
– G-Buffer useful in other techniques
• SSAO, SRAA, decals, …
• Cons ( some )
– Difficult to use multiple shading models
– Does not handle translucent geometry
• Some variants do, but may be impractical
4. • The BRDF defines the look of a surface
– Bidirectional Reflectance Distribution Function
𝐿 𝑜 = 𝐿 𝑒 +
Ω
𝐿𝑖 ∙ ∙ cos 𝜃 ∙ 𝛿𝜔
• Typically games use just one ( Blinn-Phong )
– Simple, but inaccurate
• Very important in physically based rendering
– Want more: Oren-Nayar, Kajiya-Kay, Penner, Cook-Torrance, …
Reflectance models
5. BRDFs vs. rendering
• Forward rendering
– Material shader directly evaluates the BRDF
• Trivial
• Deferred rendering
– Light shaders decoupled from materials
– No obvious solution
Material G-Buffer Light
BRDF ???
6. BRDFs vs. deferred – branching?
• Read shading model ID in the lighting shader, branch
• Might be the way to go on next-gen
• Expensive on current consoles
– Tax for branches never taken
• Don’t want to pay it for every light
Three different BRDFs, only one used
( branch always yields the first one )
Platform 1 BRDF 2 BRDFs 3 BRDFs
360 1.85 ms 2.1 ms
( + 0.25 ms )
2.35 ms
( + 0.5 ms )
PS3 1.9 ms 2.48 ms
( + 0.58 ms )
2.8 ms
( + 0.9 ms )
7. BRDFs vs. deferred – LUTs?
• Pre-calculate BRDF look-up tables
• Might be shippable enough
– See: S.T.A.L.K.E.R.
• Limited control over parameters
– Roughness
– Anisotropy, etc.
• BRDFs highly dimensional
– Isotropic with roughness control → 3D LUT
8. BRDFs vs. deferred – our approach
• One default BRDF
– Others a relatively rare case
• Shading model ID in stencil
• Multi-pass light rendering
• Mask out parts of the scene in each pass
9. Multi-pass – tax avoidance
• For each light
– Find all affected BRDFs
– Render the light volume once for each model
• Analogous to multi-pass forward rendering!
• Store bounding volumes of objects with non-standard
BRDFs
– Intersect with light volumes
10. Making it practical
• Needs to work with depth culling of lights
• Hierarchical stencil on 360 and PS3
11. Depth culling of lights
• Assuming viewer is outside
the light volume
• Render back faces of light
volume
– Increment stencil; no color
output
• Render front faces
– Only where stencil = 0; write
color
• Render back faces
– Clear stencil; no color output
12. Depth culling of lights
• Assuming viewer is outside
the light volume
• Start with stencil = 0
• Render front faces
– Only where stencil = 0; write
color
• Render back faces
– Clear stencil; no color output
13. Depth culling of lights
• Assuming viewer is outside
the light volume
• Start with stencil = 0
• Render back faces of light
volume
– Increment stencil; no color
output
• Render back faces
– Clear stencil; no color output
14. Culling with BRDFs
• Pack the culling bit and BRDF together
• Use masks to read/affect required parts
• Assuming 8 supported BRDFs:
Unused BRDF ID
Culling
bit
7 6 5 4 3 2 1 0
culling_mask = 0x01
brdf_mask = 0x0E
brdf_shift = 1
16. Handling miscellaneous data in stencil
• Stencil value may contain extra data
– Used in earlier / later rendering passes
– Need to ignore it somehow
– Stencil read mask?
• Doesn’t work with the 360’s hi-stencil
Garbage BRDF ID
Culling
bit
7 6 5 4 3 2 1 0
19. Spanner in the works
Breaks if stencil
contains garbage
we can’t mask out
20. Handling stencil garbage
• Can’t do it in a non-destructive manner
– Take off and nuke the entire site from orbit
– It’s the only way to be sure
• Extra cleaning pass?
– Don’t want to pay for it!
• Do it as we go!
• Save your stencil if you need it
– Sorry for calling it garbage :`(
– We were already restoring it later on the 360
– Don’t need to destroy it on the PS3, use a read mask!
21. Performance
Platform 1 BRDF 2 BRDFs 3 BRDFs
360
( branching )
1.85 ms
2.1 ms
( + 0.25 ms )
2.35 ms
( + 0.5 ms )
360
( stencil )
1.85 ms
1.99 ms
( + 0.14 ms )
2.13 ms
( + 0.28 ms )
PS3
( branching )
1.9 ms
2.48 ms
( + 0.58 ms )
2.8 ms
( + 0.9 ms )
PS3
( stencil )
1.9 ms
2.13 ms
( + 0.23 ms )
2.31 ms
( + 0.41 ms )
For each BRDF
Platform Initial setup Mask Render Cleanup
360 0.03 ms 0.1 ms >= 0.036 ms 0.022 ms
PS3 0.03 ms 0.1 ms >= 0.06 ms 0.14 ms
22. Multi-pass light rendering – final notes
• No change in single-BRDF rendering
– Use your madly optimized routines
• No need for a ‘default’ shading model
– It’s just our use case
– As long as you efficiently find influenced BRDFs
• Flush your hi-stencil
• Tiny lights? Try branching instead.
– Performance figures only from huge lights!
– With tiny lights, hi-stencil juggling becomes inefficient
23. Lighting alpha objects in deferred
rendering engines
• Classic solutions:
– Forward rendering.
– CPU based, one light probe per each object.
• Our solution:
– GPU based.
– More than one light probe.
– Calculate a lightmap for each object each frame.
– Used for objects and particle systems.
– Fits perfectly into a deferred rendering pipeline.
24. • Object space map:
Our solution for alpha objects
Every pixel stores the local space
position on the object’s surface
Image attribution: Zephyris at en.wikipedia.
25. • For each object:
– Use baked positions as light probes
• Transform object space map into world space
– Render lights, reusing deferred shading code
– Accumulate into lightmap
– Render object in alpha pass using lightmap
Our solution for alpha objects
Image attribution: Zephyris at en.wikipedia.
26. • Camera oriented quad fitted around and
centered in the particle system.
Our solution for particle systems
27. • For each particle system:
– Allocate a texture quad and fill it with interpolated
positions as light probes
– Render lights and accumulate into lightmap
– Render particles in alpha pass, converting from
clip space to lightmap coordinates.
Our solution for particle systems
28. Implementation details
• For performance reasons we pack all position
maps to a single texture.
• Every entity that needs alpha lighting will
allocate and use a region inside the texture.
World space
position
Light maps
29. Integration with deferred rendering
Fill G-Buffer
(Solid pass)
Render Lights Render Alpha
Deferred rendering
30. Our solution
Fill G-Buffer
(Solid pass)
Fill world
space light
probes
position map
Render lights
Render lights using world
space light probes map as
input and calculate alpha
lightmap
Render alpha using
alpha lightmap
31. Improvements
• Calculate a second texture with light direction
information.
• Other parameterizations for particle systems:
– Dust (one pixel per mote).
– Ribbons (a line of pixels).
• 3D volume slices for particle systems.
– Allocate a region for every slice
– Adds depth to the lighting solution.
36. Questions?
Jose Luis Sanchez Bonet
jose.sanchez@creative-assembly.com
Tomasz Stachowiak
tomasz.stachowiak@creative-assembly.com
twitter: h3r2tic
Editor's Notes
Good news everyone! I'm Tom, this is Jose, and we're going to talk about deferred rendering. The focus is on current generation consoles, but the presented techniques can be used on just about any platform, so we hope anyone can benefit from them.
Deferred rendering has been very popular lately due to its scalability, and because it
plays nicely with other techniques, which can reuse the G-Buffer. At the same time, it doesn’t come without downsides. We are going to cover two of them in this presentation, and propose the
custom solutions we've developed for our upcoming console title.
The two problems are: handling many shading models, and rendering translucent
geometry. I'm going to cover the former in the first half of the presentation, and then
Jose will talk about translucency.
In graphics rendering, we use simple mathematical formulas, to approximate the look of some classes of surfaces. The most commonly used model, or Bidirectional Reflectance Distribution Function, is Blinn-Phong, which works reasonably well as an approximation of some dielectrics. It is used due to its simplicity, but for the same reason, it cannot reproduce the look of many surfaces accurately. You might want to render your plastics with Blinn, skin with Eric Penner's pre-integrated model, hair with Kajiya-Kay or Marschner, brushed metal with anisotropic Ward, and so on. The visual properties of these surfaces are vastly different, and can not be covered with just a single, simple mathematical model.
So how do we render with multiple shading models? If you use forward rendering, this is trivial. Because the BRDF is combined with the material in the same shader, it just works.
However, in deferred rendering, we need to evaluate the reflectance model in the light shader, and these don't bear any connection to material shaders that the BRDFs are associated with.
One approach would be to branch in the light shader. That is, the solid pass emits an identifier of the BRDF into the G-Buffer. The light shader reads it and branches upon its value.
This solution might be viable on next-gen hardware, but it doesn't fare quite well on current consoles. In a small test case we did with a single full-screen light, branching brough the rendering cost from 1.85 to 2.1 milliseconds for just a single extra shading model. This is the tax you pay for not even taking the branch. That is, our test case is synthetic, and only the first BRDF is ever used. And it gets much worse on the PS3, which doesn't even have control flow instructions.
One could also tabulate the BRDF data, and sample it using a combination of an ID, as well as some geometric parameters, such as N dot L and N dot H. One such approach has been used successfully in the game S.T.A.L.K.E.R., so it might be enough for your title as well. The trouble is, BRDFs are highly dimensional functions, so tabulation might be difficult; for example, the data for an isotropic BRDF parameterized by surface roughness, is already at least a 3-dimensional function. /* See Michael Ashikhmin’s "Distribution-Based BRDFs“. */
We decided to use a single reflectance model for most of our scene geometry, and then special-case rendering in rare instances, such as skin and hair.
The core of the idea is pretty simple: when rendering the solid pass, we store the ID of the shading model in the stencil buffer. Then in the lighting pass, we draw light geometry once for each BRDF, using the ID as a mask.
Implemented like this, the idea would be inefficient. We would be multiplying the number of draw calls and shader switches by the number of supported BRDFs. However, when rendering a light, we can detect which BRDFs it can potentially use, and skip any extra processing. If you think of it, this is a very similar idea to multi-pass forward rendering.
Here's a scene with two objects, both of which use different shading models. We have two lights influencing them. The light on the left is interesting, in that it will only affect just one object, hence only one BRDF. Therefore it doesn't need to run the multi-BRDF code path at all.
To accomplish this optimization, we store the bounding boxes of all objects which use non-standard shading models. During light rendering, we intersect light volumes with these bounds, and conservatively find a list of all BRDFs which a light may potentially touch.
/* We could in theory detect which BRDFs a light may affect and only use dynamic branching there, but then we either always pay a high cost, or we would need to create lots of shader permutations, for example “shading model A and B, A with B and C, A with C, B with C, et cetera.” For this reason we are just going to use multi-pass rendering. */
Now, there are two more bits to the algorithm, needed to make it practical. Firstly, it needs to work with the commonly used stencil and depth-based light culling trick. Secondly, it must play well with the hierarchical stencil buffer.
Let's start with a quick reminder of depth culling for lights. Consider a surface rendered into the G-buffer, and three lights. The left one is completely in front of the surface, so cannot influence it. The right one is behind the surface, so cannot influence it either. Only the middle one contributes to lighting, because its volume intersects the surface in the G-buffer.
So how do we accomplish that using stencil testing? Let's consider the case when the viewer is outside of the light's volume. The stencil is initially clear.
We start by writing the value of one into the stencil by using the back faces of the light volume. This will result in the stencil being set where the light is completely in front of the surface. Therefore we only want to render where the stencil is zero …
… and we do so using the front faces with stencil testing enabled.
Note that this is a vanilla version of the algorithm, and you may be using an optimized one.
Extending this idea to selectively rendering multiple shading models, we need to pack both the culling bit and the shading model identifier in the stencil buffer.
Because stencil testing supports read and write masks, we can act upon and affect portions of the stencil value.
Here’s a sample layout assuming a maximum of eight supported BRDFs. Note that the BRDF bits can be placed at any offset in the byte.
OK, let's get down to the actual rendering passes. First of all, we will be using the hierarchical stencil buffer, so that the GPU may reject entire rasterization tiles. This is where the bulk of our time savings actually comes from, as the regular stencil test happens after you’ve already paid the pixel shading cost.
We start the same as with just depth-based culling. We draw back faces of the light volume with the stencil set to Increment. Once again, this marks areas we don’t want to render to. At this point, we have determined the list of BRDFs the light can potentially influence. For each of them, we create a hi-stencil mask first, then we render the volume again with the actual shader. Creating the mask is fairly cheap, so even though we render twice, we typically save time by hi-stencil culling the expensive shader.
Finally, the last step restores the affected stencil area, so that the next light can render.
We have been assuming that the stencil values are clear of any unrelated data. Yet in practice, they will carry multiple meanings, and rendering engines will have their own 'magic' stencil encodings. /* One example would be using a single bit of stencil to mask out dynamic objects from being affected by deferred decals. */ Unfortunately, such extra bits turn out to be garbage from the point of view of the proposed algorithm, and we cannot simply ignore them with read masks, at least not on the XBox 360.
Let's take a look at the stencil operation to figure out why.
The GPU first reads the original value and applies a user-specified mask to it. This value is then compared with a reference constant using one of several predicates, such as Greater, Less, Equal, et cetera. Upon the result of this comparison as well as the the depth test, an operator may be applied to the stencil value, such as incrementing or zeroing it. Finally, the resulting value is written back into the stencil buffer.
How does the hi-stencil integrate with this pipeline? On the PS3, we get to specify a mask and a comparison function for the hi-stencil test, very much like in the regular one. This means that we can ignore any bits we don’t like. The 360 however, takes its hi-stencil value from the completely opposite end of the pipe, from the final value written back to the stencil buffer. Furthermore, we may only specify a trivial equality or inequality predicate against a reference value.
Unfortunately, this throws a spanner in our hi-stencil mask creation. Since the 360 can only create its mask from the full value, any garbage bits will cause the corresponding tiles to be culled.
Well, if we can’t ignore the extra bits, I say we nuke them from orbit. The easiest way would be to have a separate pass which cleans the stencil buffer, removing any garbage bits. On the other hand, we don't want to add any more fixed cost steps into our rendering, especially at the end of the current hardware generation, when everyone is battling for the last microseconds. Fortunately, we can clear out the garbage bits as we go. When creating the hi-stencil mask, we will set the regular stencil operator to do so, while skipping over the ID of the shading model.
Now, I've been calling these "garbage bits", but you may have good reasons for extra information in your stencil buffer. Chances are that on the 360 you restore them at a later point anyway, due to limited EDRAM resources. On the PS3 we don't need to clobber the bits at all, due to its more flexible hi-stencil buffer creation process.
How’s performance then? Let’s recall the figures from one of the first slides. With the dynamic branching approach, we had to pay a pretty hefty tax, especially on the PS3. How does the proposed algorithm stack against that? We still pay a slight tax, but only for the lights which render with multiple shading models, and only for the models we actually use.
This is especially important if we support many shading models, but each light affects very few on average. Then we end up paying a considerably smaller cost for the extra shading models
That's pretty much the whole algorithm. I'd just like to emphasize a few extra points.
First of all, nothing is changed for single-BRDF rendering! If you conservatively figure out that a light only influences geometry with a single reflectance model, you can reuse your old light rendering code!
Secondly, you don't really need to have a 'default' shading model for the whole level. As long as you can quickly classify which BRDFs a light can potentially influence, then you're golden.
Next, remember to flush your hi-stencil when changing the reference value or the comparison function, otherwise you’ll get false culling.
Finally, we’ve only given performance figures for lights taking a up significant portion of the screen. When a light is small and rendered with multiple BRDFs, the cost will be dominated by hi-stencil juggling. It might be worthwhile to use dynamic branching in the light shader below a certain size threshold.
Okay, that’s all for me, now Jose is going to tell you about lighting translucent geometry!
Classic solutions:
Forward rendering. Best quality solution, it calculates lighting for every pixel.
Problems:
Too expensive, especially if a lot of alpha layers are used.
Shader permutation explosion if you want to support a lot of light types and combinations.
Completely different than deferred rendering, we need to support two pipelines.
We can use Forward+, but we are aiming to X360 and PS3.
Calculated in CPU, one light probe (intensity, SH, etc) for each object.
Problems:
Only one light probe per object, it means same light configuration for all of the objects, a lot of issues with big ones.
It is not easy to support shadow map casting lights.
Our solution:
GPU based.
More than one light probe per object. Quality between the two classic solutions.
It is just a lightmap for every object updated every frame. Lighting is calculated in object space.
It can be used for objects and particle systems.
It fits perfectly into a deferred engine pipeline.
For each alpha object we will create a distribution of light probes on the surface.
Artists will define an UV channel with an unwrapped version of the object (like lightmaps), during export we will create a texture (we call it object space map, the size will depend of the surface area of the object). Every pixel in the object space map will represent a local space position on the surface of the object.
We convert every probe from the object space map to world space using the world matrix of the object.
Render lights: We render a pass with a very similar shader that in deferred rendering. The input is a texture with world space light probe positions (calculated from the object space map) and the output will be a lightmap with the light that the light probes receive. It can reuse a lot of functions from deferred rendering code, like shadowmapping.
Render object in alpha pass using lightmap. We use the UV channel for the object space map to access the lightmap.
For each particle system we need a set of light probes distributed around it. As the particles are camera oriented, we are going to use a camera oriented quad fitted around and centered in the particle system.
It is not a perfect representation, but it is really fast and it is simple, and it works in practice.
If the particle system intersects the camera frustum we can just fit our quad, so we can improve the quality when the particle system fills the screen.
For recovering the lighting information we just use a 2D matrix that converts from clip space coordinates (our quad is screen space orientated) to lightmap texture space.
The two solution have a lot in common. For performance reasons we pack all the world space position maps to one single texture, so we can calculate the lighting of all the objects at the same time.
Two GPU textures:
Input: World space position texture, similar to the gbuffer in deferred rendering.
Output: Accumulated light.
Every object that needs calculate lighting will allocate a region inside the textures and fill it with the positions of the light probes. The size of the region can depend on the screen space size of the object to improve performance and scalability.
For improving performance, we check on CPU every light against every object, so we only apply the light shader to the regions that are inside the light.
Deferred rendering engine.
Fill gbuffer
Render lights
Render alpha
Added two extra steps in our deferred engine.
Having light direction information will allow bump mapping, occlusion and scattering effects.
For performance reasons, we can disable 3D volume slices when the particle system is far from the camera.
Thanks to Howard Rayner, our technical artist and vfx magician for preparing these demos!