A Bizarre Way to do Real-Time Lighting

A Bizarre Way to do Real-Time LightingStephen McAuley & Steven ToveyGraphics Programmers, Bizarre Creations Ltd.stephen.mcauley@bizarrecreations.comsteven.tovey@bizarrecreations.comhttp://www.bizarrecreations.com/

“Welcome, I think not!”Let us start by wishing you a good bonfire night!

AgendaA sneak preview of BlurLight Pre-Pass Rendering10 Step Guide to free Lighting on PS3The Future...

BlurComing 2010 on X360, PS3 and PC.Twenty cars on track for intense wheel-to-wheel racing.Exciting power-ups bring depth and strategy to racing.Real-world cars and locations, set between dusk and dawn.Extensive multiplayer options.

Technical AnalysisSo, we have twenty cars, racing around a track in the dark……they all have headlights, rear lights, brake lights……not to mention any other effects we might have going on around the track……therefore, we need some sort of real-time lighting solution.

Light Pre-PassMany people came up with this… so you know it’s good!Given its name by [Engel08].Credits also due to [Balestra08].Half-way between traditional and deferred rendering.

GeometryNormalsFinal ColourGeometryReal-Time LightingDepthLight Pre-Pass

Light Pre-Pass in BlurFinal Image

Step #1: Render Pre-PassRender scene normals and depth.We pack view spacenormals and depth into one RGBA8 surface:This means all the info we need is in one texture, not two!It’s also faster to calculate view space position than world space position.normal xnormal ydepth hidepth loRGBA

Step #1: Render Pre-PassPack depth:Unpack depth:(Note: here fDepth is in [0, 1] range)half2 vPackedDepth = half2( floor(fDepth * 255.f) / 255.f,frac(fDepth * 255.f) );float fDepth =vPackedDepth.x + vPackedDepth.y * (1.f / 255.f);

Step #1: Render Pre-PassGet view space position from texture coordinates and depth:float3 vPosition = float3(g_vScale.xy * vUV + g_vScale.zw, 1.f) * fDepth;In some circumstances, possible to move this to the vertex shader.In [0, FarClip] rangeg_vScalemoves vUV into [-1, 1] range and scales by inverse projection matrix values

Step #1: Render Pre-PassNormal X, Normal Y, Depth Hi, Depth LoNormals X & YDepth Hi & Lo

Step #1: Render Pre-PassSome good advice: at this stage, it’s really best to render only what you need…So don’t render geometry that isn’t affected by real-time lights!Why not also try bringing in the far clip plane?We also don’t render the very, very vertex-heavy cars.They get their real-time lighting from a spherical harmonic. Doesn’t look too bad! 

Step #2: The LightingWe render the lighting to an RGBA8 texture.Lighting is in [0, 1] range.We just about got away with range and precision issues.Two types of lights:Point lightsSpot lights

Step #2: Point LightsFirst up, it’s the point lights turn.Let’s copy [Balestra08] and render them tiled.Split the screen into tiles:Big savings!Save on fill rate.Minimise overhead of unpacking view space position and normal.for each tile gather affecting lights select shader render tileend

Step #2: Point LightsOptimise: mask out the sky in the stencil buffer.

Step #2: Point LightsReal-Time Lighting (Point Lights)

Step #2: Spot LightsNext, it’s the spot lights.Three different types:Bog standard.2D projected texture.Volume texture.Render as volumes.A cone for the bog-standard and projected.A box for the volume textured.If they’re big enough on screen, do a stencil test.

Step #2: Spot LightsRender back faces:Colour write disabledDepth test greater-equalStencil write enabled

Step #2: Spot LightsRender front faces:Colour write enabledDepth test less-equalStencil test enabled

Step #2: Spot LightsHold on a minute… what happens if the camera goes inside the light volume?Rendering the front faces doesn’t work any more…

Step #2: Spot LightsWorst case scenario! Not only does the light fill the whole screen, but…You just have to bite your tongue and only render back faces.You lose your stencil test. And maybe even early-z too. 

Step #2: The LightingReal-Time Lighting

Step #3: Render the SceneJust do everything as you normally would…Except that you now have a texture containing the real-time lighting for each pixel!But remember to composite it properly…

Step #3: Render the SceneFrom our lightmaps.The real-time lighting from the texture.half3 vDiffuseLighting = vStaticLighting.rgb + vDynamicLighting.rgb;half3 vFinalColour =vDiffuseLighting * vAlbedoColour.rgb +vSpecularLighting;You’d probably want to do something clever involving a Fresnel term here.

Real-Time Lighting in BlurPoint Lights: brake lights, rear lights

Real-Time Lighting in BlurPoint Lights: pick-ups

Real-Time Lighting in BlurPoint Lights: power-up effects

Real-Time Lighting in BlurSpot Lights: headlights

Real-Time Lighting in BlurSpot Lights: start line effects

Great, It Works!But can we make it faster?Deferred lighting is image processing – no rasterization required.See how we draw our point lights.Seems like this suits the PLAYSTATION®3’s SPUs…

PLAYSTATION®3: In BriefTime to switch gears a little bit...So you’ve heard this stuff a million times before... Here are the important takeaway facts:PS3 has 6 SPUs.SPUs are fast!(...Given the right data! )

Main Memory(XDR - 256MB)SPUPLAYSTATION®3: In BriefSPURSX™SPUSPUGraphics Memory(GDDR3 - 256MB)SPUSPU

PLAYSTATION®3: In BriefMain Memory(256MB)Graphics Memory(256MB)SPEMFCSPULocal Store(256KiB)SXU

Goals for PLAYSTATION®3Reduce overall frame latency to acceptable level (<33ms).Preserve picture quality (and resolution).Blur runs @ 720p on X360 and PS3.Preserve lighting accuracy.Lighting and main scene must match:Cars move fast... Deferring the lighting simply not an option, works great in [Swoboda09] though.

Step #1: Look At The DataData is *really* important! Trivially easy in this case as we’re coming from a stream processing model, but never hurts to understand it anyway.Kinda gives us a small glimpse of DX11 compute shaders .

Step #1: Look At The DataxformLights

Step #2: ParallelismStream processing highly suited to parallelisation and we have 6 x SPUs.The obvious question arises:What size should a unit of work be?Answer: Look at the data again!

Step #3: Look At The DataFun fact: Frame buffers are not usually linear!Many reasons for this (Think filtering and RSX™ quads).Our unit size is closely tied to the internal format of frame buffer produced by the RSX™.Not going to get into the exact formats here, it’s dull and it’s all in the Sony SDK Docs – RTFM!Recommend PhyreEngine for good reference examples.

Step #4: Arbitrating WorkSynchronisation points are fail. Keep to an absolute minimum.Solution: Atomics are your friend! Target hardware has an ATO,  Use it, <3 it... Move through data in tiles, tile dictated by an index – DMA into the local store for processing.

IndexStep #4: Arbitrating WorkSPUSPUSPUSPUSPUSPU

Step #5: Multi-BufferingMove data and process data at the same time.Costs local store, but usually worth it.Different tag group for each buffer.

Step #5: Multi-BufferingWe used triple-buffering, since we’re decoding the normal/depth buffer.Normal/Depth Buffer (Main)SXULighting Buffer (Main)MFC

Step #6: Lighting (SOA)SOA is basically a transpose of the obvious layout:XYZWXXXXXYYYYYZWXYZZZZZWZXYWWWWWqword dot_xx = si_fm(v, v);qword dot_xx_r4 = si_rotqbyi(dot_xx, 4);dot_xx = si_fa(dot_xx, dot_xx_r4);qword dot_xx_r8 = si_rotqbyi(dot_xx, 8);dot_xx = si_fa(dot_xx, dot_xx_r8);return si_to_float(dot_xx);Vs.1x square length (~18 cycles) qword dot_x = si_fm(x, x);qword dot_y = si_fma(y, y, dot_x);qword dot_z = si_fma(z, z, dot_y);return dot_z; 4 x square lengths (~12 cycles) 

Step #6: Lighting (SOA)Pre-transpose lighting data, splat values across entire qword.16 byte aligned, single lqd.4 copies of world-space X, in each element of the arraystruct light{ float m_x[4]; float m_y[4]; float m_z[4]; float m_inv_radius_sq[4]; float m_colour_r[4]; float m_colour_g[4]; float m_colour_b[4];};Never actually used radius, pre-compute (1/radius)^2

Step #6: Lighting (Batch I)qword everywhere.Batch reads and writes into 16 byte chunks.Read 4 pixels from normal/depth.Write 4 pixels to lighting buffer.qword depth_addr = si_from_ptr(depth_buf);qword depth0 = si_lqd(depth_addr, 0x00); qword depth1 = si_lqd(depth_addr, 0x10);qword depth2 = si_lqd(depth_addr, 0x20);qword depth3 = si_lqd(depth_addr, 0x30);qword clmp0 = si_cfltu(diffuse0, 0x20);qword clmp1 = si_cfltu(diffuse1, 0x20);qword clmp2 = si_cfltu(diffuse2, 0x20);qword clmp3 = si_cfltu(diffuse3, 0x20);qword r = si_ila(0x8000);qword scl = si_ilh(0xff00); dif0 = si_mpyhhau(clmp0, scl, r); dif1 = si_mpyhhau(clmp1, scl, r); dif2 = si_mpyhhau(clmp2, scl, r); dif3 = si_mpyhhau(clmp3, scl, r);const vector unsigned char _shuf_uint = { 0xc0, 0x00, 0x04, 0x08, 0xc0, 0x10, 0x14, 0x18, 0xc0, 0x00, 0x04, 0x08, 0xc0, 0x10, 0x14, 0x18 };qword shuf_ = (const qword)_shuf_uint;qword base_add = si_from_ptr(pResult);qword p0_1 = si_shufb(dif0, dif1, shuf_);qword p0_2 = si_shufb(dif2, dif3, shuf_);qword pix0 = si_selb(p0_1, p0_2, m_00ff);si_stqd(pix0, base_add, 0x0);

Step #6: Lighting (Balance)Lighting SPU program performance limited by number of instructions issued.Pipeline balance is vital!SPU dual issues if:Correctly aligned within single fetch group.No dependencies.Instructions are for correct pipelines.Luckily, compiler maintained balance quite well with nop/lnop insertion and some instruction re-ordering.Lighting larger batches helps out balance at the cost of register file usageMileage may vary here again, how bad are you hammering the even pipe?

Step #6: Lighting (Batch II)Fixed setup cost for a single line of our sub-tile size (32 pixels wide).Unfortunately, too many to process at once despite SPU’s massive register file . Loop ispipelined and lots of live variables to multiplex onto register file.Settled for 16 pixels, no spilling .Note: First attempt worked on 4 pixel batches like RSX™. Lots of wasted cycles in inner loop – less dual issue.32 PixelsRegister spilling...16 Pixels16 PixelsHappy medium44444444Wasted cycles and increased setup overhead

Step #7: CullingCulling works on more granular sub-tiles. Allows us to potentially reject more tiles (of course, YMMV ).(Note: diagram below is an example, it’s not our actual sub-tile size).Similar to GPU, basically a tile is culled if...Depth max and min depth are both far clip.No lights intersect the frustum constructed for the tile.Sub-tile

Step #7: CullingRemember, SPUs can execute general purpose code.Take advantage of high-level constructs where they are suitable – this means branches, early-outs, etc.Note: Branches generally suck. Not suitable in lighting inner-loop, discard an entire sub-tile at once.

Step #8: SynchronisationCustom SPURS policy module made RSX™ initiated jobs easy. Our jobs can optionally depend on a 128 byte line written by RSX™ (or PPU, whatever).Non-blocking :Freedom to run other scheduler tasks while waiting.Really should investigate using SPE’s mailboxes to stop us from hammering the bus.Physics team happy again!Not pre-emptive.

Step #8: SynchronisationCan be painful!Expect hard to find bugs here.We had a couple, *ahem* both were other Steve’s fault ;-)Worth it in the end though!Keep an eye on overall timings.Originally lighting pushed out physics.Very easy to forget the bigger picture.Impossible to predict up front.

Step #9: Slotting it in...SPU:AudioCommand BufferScene GraphAudioPhysicsCar DamageCommand BufferScene GraphLighting #1PhysicsCommand BufferScene GraphLighting #2PhysicsPhysicsCommand BufferLighting #3PhysicsCommand BufferGPU:MirrorReflectionMain ScenePre-Pass

Step #9: Slotting it in...Ended up running the lighting on 3 SPUs, still easily within our timeframe and no longer pushed the physics out.

Step #10: Profit! SPU implementation faster than RSX™ even without parallelism. (~2-3ms on 3 SPUs).Overall frame latency reduced by up to 25%!More benefits:Blending in alternative colour space becomes trivial.Add value by outputting other useful stuff from SPU program – down-sampled Z buffer anyone? Lighting becomes free*.* - In the strictest computer science sense of the word, ;-).

The Future...MSAA -- Big challenge, but solvable...Experiment with different colour spaces?Remove de-coding step...Upsets my OCD as not really needed for the data transformation –But also allows us to overlap input and output buffers.Specular.Better normals:Ideally higher precision for use in main pass.Fix positive z-component sign assumption.Stereographic ProjectionLambert Azimuthal Equal-area Projection et al.

References[Engel08] W. Engel, “Light Pre-Pass Renderer”, http://diaryofagraphicsprogrammer.blogspot.com/2008/03/light-pre-pass-renderer.html, accessed on 4th July 2009 [Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune”, GDC2008. [Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3”, GDC2009.

Special Thanks!Matt Swoboda and Colin Hughes (SCE R&D) and The Bizarre Creations Core Tech Team

Shameless Plug Steve and I contributed to this book... It’s out March 2010, you should buy it for your desk, studio library, etc.http://gpupro.blogspot.com

Thanks for Listening!Questions?Check out www.blurgame.com

A Bizarre Way to do Real-Time Lighting

More Related Content

What's hot

Similar to A Bizarre Way to do Real-Time Lighting

Recently uploaded

A Bizarre Way to do Real-Time Lighting

Editor's Notes