We drop the view space normal z and assume that it’s always negative. This is technically incorrect, but no artefacts were visible and everything still looked great, so we went with it.
Here’s some code to pack and unpack a depth value into and from two 8-bit components. We’re assuming the depth value is in between 0 and 1, and it really needs to be linear for this compression method. So just multiply and divide by the far clip value. As you can see, the x component contains the hi-bits, and the y-component the lo.
Here’s some code to reconstruct the view space position from texture coordinates (vUV) and depth (fDepth, which is now in the [0, FarClip] range). Obviously, this is really fast as it involves a mad, mul and maybe a mov too. The scale factor moves the texture coordinates first into screen space (so in the [-1, 1] range) and then scales by the inverse projection matrix x and y scale values. (This of course assumes you have a simple projection matrix, but you do in 99% of cases). In some circumstances you can even move the scaling to the vertex shader, which means this whole thing is just one mul making it very, very fast indeed.
Especially since on the PLAYSTATION®3 we had some free SPU time, which we’d be best off using for something.
Don’t need a full run-down, everyone’s heard it all before...
Never read from GDDR3... Put things in the correctly mapped memory.
Data brought into memory thought explicit DMA.Limited local store, not enough for the frame buffer.
The frame time using the same techniques as the XBox360 was simply too long – lighting was a big number.Image quality could not be compromised, no reduction in number of dynamic lights or their radii.No weird hacks in the lighting which breaks the accuracy even more... Lighting is already hacky enough.Deferring as in [Swoboda09] not an option for us.Blur runs at FULL 720p with MSAA on both platforms.
Before doing anything look at what you’re actually trying to achieve.Data is really, really important.Stream processing is hugely scalable, since we’re porting from GPU data has already been thought about.
So it’s pretty obvious that what our inputs and outputs are here...We want to produce a buffer containing light.We have: Normals, depth and lights.
Originally normals and depth in separate buffers. We sacrificed depth precision for more optimal data layout, this helped the GPU too on all platforms. This means no stencil buffer access, but we can get round that as we’ll see later.
As mentioned, moving from stream processing model which is already highly suited to parallelisation.Obvious benefits to moving over multiple cores.Begs the obvious questions: What size should the unit of work be?Too small, too much overhead, sub-optimal use of DMA engine.Too large, not enough parallelism...
Frame buffers are not linear. RSX processes pixels in quads, makes no sense to store frame buffer linearly for that reason alone.Sampling from the buffer presents similar problems for cache performance if the buffer was linear.Sized based on format of RSX generated buffer.
It should be a goal of multi-threaded system to avoid sync points, sometimes tricky to do.In this case it’s easy since SPU has an ATO.Explain ATO -- lets you atomically modify an ATO cache line, use this to move the index.Fetch tile corresponding to index into local store for processing. Small cache 4 cache lines.
Explain multi-buffering:+ trade memory for speed+ no contention for LS (DMA engine trumps SXU)+ triple buffer means: 1 in, 1 out, 1 being processed+ 2us of time spent waiting for data...
Our multi-buffering strategy, avoid contention for LS from MFC and SXU.
Pre-computation is good, costs a little extra memory per light, but we can load anything from this structure in 6 cycles as it’s 16 byte aligned.
Reads and writes to and from LS are always 16 byte aligned and sized.Smaller read-modify-writes are done with, load-shuffle-store pattern, less than efficient.Large wins (especially in pixel writing code) from batching in 16byte chunks... This means writing 4 pixels at once in our case.
Program limited by instructions issued, especially in inner loop.If we have dual-issue it’s wise to use it, can increase program speed two-fold by doubling the rate of instruction issue.Needs write type of instructions no dependancies.No need for C programmers to worry about alignment, compiler does a reasonable job of placing nop/lnop to align for dual-issue.More dual-issue can usually be achieved by large batches if you have a reasonable balance of work (i.e.: not all even pipe). Costs register file, but usually worth it assuming no spillage.
Batch size is important, batch means number of pixels lit in your inner loop.Too large and you get register spilling.Too small and you pay increase setup cost, and have wasted cycles in your inner loop.Perfect balance is no spilling + no stalled cycles.This will be linked to the complexity of your lighting equation, for us this number was ~16 pixels.
We don’t have access to the stencil buffer, so that means we can’t use that to cull.Instead we use depth buffer – maintain min and max depth for a tile, if there are both far clip then we can discard the tile (in practise almost as good as stencil buffer ).We cull more granularly than the tile sized which is our “unit of work”. Your mileage may vary.
Need to be careful at which level you branch.Discard tile rather than pixel, branching at a low-level is a sure-fire performance killer.
Synchronisation can be really painful.We had a few hard to find bugs synchronising between RSX and SPU, worth the pain in the end though.Must focus on the bigger picture too!! Micro-optimisation is fun, but we must be aware of our impact on the system at large – very hard to predict this type of thing up front.
A Bizarre Way to do Real-Time Lighting
A Bizarre Way to do Real-Time Lighting<br />Stephen McAuley & Steven Tovey<br />Graphics Programmers, Bizarre Creations Ltd.<br />firstname.lastname@example.org<br />email@example.com<br />http://www.bizarrecreations.com/<br />
“Welcome, I think not!”<br />Let us start by wishing you a good bonfire night!<br />
Agenda<br />A sneak preview of Blur<br />Light Pre-Pass Rendering<br />10 Step Guide to free Lighting on PS3<br />The Future...<br />
Blur<br />Coming 2010 on X360, PS3 and PC.<br />Twenty cars on track for intense wheel-to-wheel racing.<br />Exciting power-ups bring depth and strategy to racing.<br />Real-world cars and locations, set between dusk and dawn.<br />Extensive multiplayer options.<br />
Technical Analysis<br />So, we have twenty cars, racing around a track in the dark…<br />…they all have headlights, rear lights, brake lights…<br />…not to mention any other effects we might have going on around the track…<br />…therefore, we need some sort of real-time lighting solution.<br />
Light Pre-Pass<br />Many people came up with this… so you know it’s good!<br />Given its name by [Engel08].<br />Credits also due to [Balestra08].<br />Half-way between traditional and deferred rendering.<br />
Step #1: Render Pre-Pass<br />Render scene normals and depth.<br />We pack view spacenormals and depth into one RGBA8 surface:<br />This means all the info we need is in one texture, not two!<br />It’s also faster to calculate view space position than world space position.<br />normal x<br />normal y<br />depth hi<br />depth lo<br />R<br />G<br />B<br />A<br />
Step #1: Render Pre-Pass<br />Get view space position from texture coordinates and depth:<br />float3 vPosition<br /> = float3(g_vScale.xy * vUV + g_vScale.zw, 1.f)<br /> * fDepth;<br />In some circumstances, possible to move this to the vertex shader.<br />In [0, FarClip] range<br />g_vScalemoves vUV into [-1, 1] range and scales by inverse projection matrix values<br />
Step #1: Render Pre-Pass<br />Normal X, Normal Y, Depth Hi, Depth Lo<br />Normals X & Y<br />Depth Hi & Lo<br />
Step #1: Render Pre-Pass<br />Some good advice: at this stage, it’s really best to render only what you need…<br />So don’t render geometry that isn’t affected by real-time lights!<br />Why not also try bringing in the far clip plane?<br />We also don’t render the very, very vertex-heavy cars.<br />They get their real-time lighting from a spherical harmonic. Doesn’t look too bad! <br />
Step #2: The Lighting<br />We render the lighting to an RGBA8 texture.<br />Lighting is in [0, 1] range.<br />We just about got away with range and precision issues.<br />Two types of lights:<br />Point lights<br />Spot lights<br />
Step #2: Point Lights<br />First up, it’s the point lights turn.<br />Let’s copy [Balestra08] and render them tiled.<br />Split the screen into tiles:<br />Big savings!<br />Save on fill rate.<br />Minimise overhead of unpacking view space position and normal.<br />for each tile<br /> gather affecting lights<br /> select shader<br /> render tile<br />end<br />
Step #2: Point Lights<br />Optimise: mask out the sky in the stencil buffer.<br />
Step #2: Point Lights<br />Real-Time Lighting (Point Lights)<br />
Step #2: Spot Lights<br />Next, it’s the spot lights.<br />Three different types:<br />Bog standard.<br />2D projected texture.<br />Volume texture.<br />Render as volumes.<br />A cone for the bog-standard and projected.<br />A box for the volume textured.<br />If they’re big enough on screen, do a stencil test.<br />
Step #2: Spot Lights<br />Render back faces:<br />Colour write disabled<br />Depth test greater-equal<br />Stencil write enabled<br />
Step #2: Spot Lights<br />Render front faces:<br />Colour write enabled<br />Depth test less-equal<br />Stencil test enabled<br />
Step #2: Spot Lights<br />Hold on a minute… what happens if the camera goes inside the light volume?<br />Rendering the front faces doesn’t work any more…<br />
Step #2: Spot Lights<br />Worst case scenario! Not only does the light fill the whole screen, but…<br />You just have to bite your tongue and only render back faces.<br />You lose your stencil test. <br />And maybe even early-z too. <br />
Step #2: The Lighting<br />Real-Time Lighting<br />
Step #3: Render the Scene<br />Just do everything as you normally would…<br />Except that you now have a texture containing the real-time lighting for each pixel!<br />But remember to composite it properly…<br />
Step #3: Render the Scene<br />From our lightmaps.<br />The real-time lighting from the texture.<br />half3 vDiffuseLighting =<br /> vStaticLighting.rgb + vDynamicLighting.rgb;<br />half3 vFinalColour =<br />vDiffuseLighting * vAlbedoColour.rgb +<br />vSpecularLighting;<br />You’d probably want to do something clever involving a Fresnel term here.<br />
Real-Time Lighting in Blur<br />Point Lights: pick-ups<br />
Real-Time Lighting in Blur<br />Point Lights: power-up effects<br />
Real-Time Lighting in Blur<br />Spot Lights: headlights<br />
Real-Time Lighting in Blur<br />Spot Lights: start line effects<br />
Great, It Works!<br />But can we make it faster?<br />Deferred lighting is image processing – no rasterization required.<br />See how we draw our point lights.<br />Seems like this suits the PLAYSTATION®3’s SPUs…<br />
PLAYSTATION®3: In Brief<br />Time to switch gears a little bit...<br />So you’ve heard this stuff a million times before... Here are the important takeaway facts:<br />PS3 has 6 SPUs.<br />SPUs are fast!<br />(...Given the right data! )<br />
Goals for PLAYSTATION®3<br />Reduce overall frame latency to acceptable level (<33ms).<br />Preserve picture quality (and resolution).<br />Blur runs @ 720p on X360 and PS3.<br />Preserve lighting accuracy.<br />Lighting and main scene must match:<br />Cars move fast... <br />Deferring the lighting simply not an option, works great in [Swoboda09] though.<br />
Step #1: Look At The Data<br />Data is *really* important! <br />Trivially easy in this case as we’re coming from a stream processing model, but never hurts to understand it anyway.<br />Kinda gives us a small glimpse of DX11 compute shaders . <br />
Step #1: Look At The Data<br />xform<br />Lights<br />
Step #1: Look At The Data<br />xform<br />Lights<br />
Step #2: Parallelism<br />Stream processing highly suited to parallelisation and we have 6 x SPUs.<br />The obvious question arises:What size should a unit of work be?<br />Answer: Look at the data again!<br />
Step #3: Look At The Data<br />Fun fact: Frame buffers are not usually linear!<br />Many reasons for this (Think filtering and RSX™ quads).<br />Our unit size is closely tied to the internal format of frame buffer produced by the RSX™.<br />Not going to get into the exact formats here, it’s dull and it’s all in the Sony SDK Docs – RTFM!<br />Recommend PhyreEngine for good reference examples.<br />
Step #4: Arbitrating Work<br />Synchronisation points are fail. Keep to an absolute minimum.<br />Solution: Atomics are your friend! <br />Target hardware has an ATO, <br /> Use it, <3 it... <br />Move through data in tiles, tile dictated by an index – DMA into the local store for processing.<br />
Step #6: Lighting (Balance)<br />Lighting SPU program performance limited by number of instructions issued.<br />Pipeline balance is vital!<br />SPU dual issues if:<br />Correctly aligned within single fetch group.<br />No dependencies.<br />Instructions are for correct pipelines.<br />Luckily, compiler maintained balance quite well with nop/lnop insertion and some instruction re-ordering.<br />Lighting larger batches helps out balance at the cost of register file usage<br />Mileage may vary here again, how bad are you hammering the even pipe?<br />
Step #6: Lighting (Batch II)<br />Fixed setup cost for a single line of our sub-tile size (32 pixels wide).<br />Unfortunately, too many to process at once despite SPU’s massive register file . Loop ispipelined and lots of live variables to multiplex onto register file.<br />Settled for 16 pixels, no spilling .<br />Note: First attempt worked on 4 pixel batches like RSX™. Lots of wasted cycles in inner loop – less dual issue.<br />32 Pixels<br /><br />Register spilling...<br />16 Pixels<br />16 Pixels<br /><br />Happy medium<br />4<br />4<br />4<br />4<br />4<br />4<br />4<br />4<br /><br />Wasted cycles and increased setup overhead<br />
Step #7: Culling<br />Culling works on more granular sub-tiles. Allows us to potentially reject more tiles (of course, YMMV ).<br />(Note: diagram below is an example, it’s not our actual sub-tile size).<br />Similar to GPU, basically a tile is culled if...<br />Depth max and min depth are both far clip.<br />No lights intersect the frustum constructed for the tile.<br />Sub-tile<br />
Step #7: Culling<br />Remember, SPUs can execute general purpose code.<br />Take advantage of high-level constructs where they are suitable – this means branches, early-outs, etc.<br />Note: Branches generally suck. Not suitable in lighting inner-loop, discard an entire sub-tile at once.<br />
Step #8: Synchronisation<br />Custom SPURS policy module made RSX™ initiated jobs easy. Our jobs can optionally depend on a 128 byte line written by RSX™ (or PPU, whatever).<br />Non-blocking :<br />Freedom to run other scheduler tasks while waiting.<br />Really should investigate using SPE’s mailboxes to stop us from hammering the bus.<br />Physics team happy again!<br />Not pre-emptive.<br />
Step #8: Synchronisation<br />Can be painful!<br />Expect hard to find bugs here.<br />We had a couple, *ahem* both were other Steve’s fault ;-)<br />Worth it in the end though!<br />Keep an eye on overall timings.<br />Originally lighting pushed out physics.<br />Very easy to forget the bigger picture.<br />Impossible to predict up front.<br />
Step #9: Slotting it in...<br />Ended up running the lighting on 3 SPUs, still easily within our timeframe and no longer pushed the physics out.<br />
Step #10: Profit! <br />SPU implementation faster than RSX™ even without parallelism. (~2-3ms on 3 SPUs).<br />Overall frame latency reduced by up to 25%!<br />More benefits:<br />Blending in alternative colour space becomes trivial.<br />Add value by outputting other useful stuff from SPU program – down-sampled Z buffer anyone? <br />Lighting becomes free*.<br />* - In the strictest computer science sense of the word, ;-).<br />
The Future...<br />MSAA -- Big challenge, but solvable...<br />Experiment with different colour spaces?<br />Remove de-coding step...<br />Upsets my OCD as not really needed for the data transformation –<br />But also allows us to overlap input and output buffers.<br />Specular.<br />Better normals:<br />Ideally higher precision for use in main pass.<br />Fix positive z-component sign assumption.<br />Stereographic Projection<br />Lambert Azimuthal Equal-area Projection et al.<br />
References<br />[Engel08] W. Engel, “Light Pre-Pass Renderer”, http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html, accessed on 4th July 2009 <br />[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune”, GDC2008. <br />[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3”, GDC2009.<br />
Special Thanks!<br />Matt Swoboda and Colin Hughes (SCE R&D) <br />and <br />The Bizarre Creations Core Tech Team<br />
Shameless Plug <br /> Steve and I contributed to this book... It’s out March 2010, you should buy it for your desk, studio library, etc.<br />http://gpupro.blogspot.com<br />
Thanks for Listening!Questions?Check out www.blurgame.com<br />