Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Bizarre Way to do Real-Time Lighting


Published on

Published in: Technology
  • Login to see the comments

A Bizarre Way to do Real-Time Lighting

  1. 1. A Bizarre Way to do Real-Time Lighting<br />Stephen McAuley & Steven Tovey<br />Graphics Programmers, Bizarre Creations Ltd.<br /><br /><br /><br />
  2. 2. “Welcome, I think not!”<br />Let us start by wishing you a good bonfire night!<br />
  3. 3. Agenda<br />A sneak preview of Blur<br />Light Pre-Pass Rendering<br />10 Step Guide to free Lighting on PS3<br />The Future...<br />
  4. 4. Blur<br />Coming 2010 on X360, PS3 and PC.<br />Twenty cars on track for intense wheel-to-wheel racing.<br />Exciting power-ups bring depth and strategy to racing.<br />Real-world cars and locations, set between dusk and dawn.<br />Extensive multiplayer options.<br />
  5. 5.
  6. 6. Technical Analysis<br />So, we have twenty cars, racing around a track in the dark…<br />…they all have headlights, rear lights, brake lights…<br />…not to mention any other effects we might have going on around the track…<br />…therefore, we need some sort of real-time lighting solution.<br />
  7. 7. Light Pre-Pass<br />Many people came up with this… so you know it’s good!<br />Given its name by [Engel08].<br />Credits also due to [Balestra08].<br />Half-way between traditional and deferred rendering.<br />
  8. 8. Geometry<br />Normals<br />Final Colour<br />Geometry<br />Real-Time Lighting<br />Depth<br />Light Pre-Pass<br />
  9. 9. Light Pre-Pass in Blur<br />Final Image<br />
  10. 10. Step #1: Render Pre-Pass<br />Render scene normals and depth.<br />We pack view spacenormals and depth into one RGBA8 surface:<br />This means all the info we need is in one texture, not two!<br />It’s also faster to calculate view space position than world space position.<br />normal x<br />normal y<br />depth hi<br />depth lo<br />R<br />G<br />B<br />A<br />
  11. 11. Step #1: Render Pre-Pass<br />Pack depth:<br />Unpack depth:<br />(Note: here fDepth is in [0, 1] range)<br />half2 vPackedDepth =<br /> half2( floor(fDepth * 255.f) / 255.f,<br />frac(fDepth * 255.f) );<br />float fDepth =<br />vPackedDepth.x + vPackedDepth.y * (1.f / 255.f);<br />
  12. 12. Step #1: Render Pre-Pass<br />Get view space position from texture coordinates and depth:<br />float3 vPosition<br /> = float3(g_vScale.xy * vUV +, 1.f)<br /> * fDepth;<br />In some circumstances, possible to move this to the vertex shader.<br />In [0, FarClip] range<br />g_vScalemoves vUV into [-1, 1] range and scales by inverse projection matrix values<br />
  13. 13. Step #1: Render Pre-Pass<br />Normal X, Normal Y, Depth Hi, Depth Lo<br />Normals X & Y<br />Depth Hi & Lo<br />
  14. 14. Step #1: Render Pre-Pass<br />Some good advice: at this stage, it’s really best to render only what you need…<br />So don’t render geometry that isn’t affected by real-time lights!<br />Why not also try bringing in the far clip plane?<br />We also don’t render the very, very vertex-heavy cars.<br />They get their real-time lighting from a spherical harmonic. Doesn’t look too bad! <br />
  15. 15. Step #2: The Lighting<br />We render the lighting to an RGBA8 texture.<br />Lighting is in [0, 1] range.<br />We just about got away with range and precision issues.<br />Two types of lights:<br />Point lights<br />Spot lights<br />
  16. 16. Step #2: Point Lights<br />First up, it’s the point lights turn.<br />Let’s copy [Balestra08] and render them tiled.<br />Split the screen into tiles:<br />Big savings!<br />Save on fill rate.<br />Minimise overhead of unpacking view space position and normal.<br />for each tile<br /> gather affecting lights<br /> select shader<br /> render tile<br />end<br />
  17. 17. Step #2: Point Lights<br />1<br />1<br />1<br />2<br />1<br />1<br />1<br />
  18. 18. Step #2: Point Lights<br />Optimise: mask out the sky in the stencil buffer.<br />
  19. 19. Step #2: Point Lights<br />Real-Time Lighting (Point Lights)<br />
  20. 20. Step #2: Spot Lights<br />Next, it’s the spot lights.<br />Three different types:<br />Bog standard.<br />2D projected texture.<br />Volume texture.<br />Render as volumes.<br />A cone for the bog-standard and projected.<br />A box for the volume textured.<br />If they’re big enough on screen, do a stencil test.<br />
  21. 21. Step #2: Spot Lights<br />Render back faces:<br />Colour write disabled<br />Depth test greater-equal<br />Stencil write enabled<br />
  22. 22. Step #2: Spot Lights<br />Render front faces:<br />Colour write enabled<br />Depth test less-equal<br />Stencil test enabled<br />
  23. 23. Step #2: Spot Lights<br />Hold on a minute… what happens if the camera goes inside the light volume?<br />Rendering the front faces doesn’t work any more…<br />
  24. 24. Step #2: Spot Lights<br />Worst case scenario! Not only does the light fill the whole screen, but…<br />You just have to bite your tongue and only render back faces.<br />You lose your stencil test. <br />And maybe even early-z too. <br />
  25. 25. Step #2: Spot Lights<br />
  26. 26. Step #2: The Lighting<br />Real-Time Lighting<br />
  27. 27. Step #3: Render the Scene<br />Just do everything as you normally would…<br />Except that you now have a texture containing the real-time lighting for each pixel!<br />But remember to composite it properly…<br />
  28. 28. Step #3: Render the Scene<br />From our lightmaps.<br />The real-time lighting from the texture.<br />half3 vDiffuseLighting =<br /> vStaticLighting.rgb + vDynamicLighting.rgb;<br />half3 vFinalColour =<br />vDiffuseLighting * vAlbedoColour.rgb +<br />vSpecularLighting;<br />You’d probably want to do something clever involving a Fresnel term here.<br />
  29. 29. And Finally…<br />
  30. 30. Real-Time Lighting in Blur<br />Point Lights: brake lights, rear lights<br />
  31. 31. Real-Time Lighting in Blur<br />Point Lights: pick-ups<br />
  32. 32. Real-Time Lighting in Blur<br />Point Lights: power-up effects<br />
  33. 33. Real-Time Lighting in Blur<br />Spot Lights: headlights<br />
  34. 34. Real-Time Lighting in Blur<br />Spot Lights: start line effects<br />
  35. 35. Great, It Works!<br />But can we make it faster?<br />Deferred lighting is image processing – no rasterization required.<br />See how we draw our point lights.<br />Seems like this suits the PLAYSTATION®3’s SPUs…<br />
  36. 36. PLAYSTATION®3: In Brief<br />Time to switch gears a little bit...<br />So you’ve heard this stuff a million times before... Here are the important takeaway facts:<br />PS3 has 6 SPUs.<br />SPUs are fast!<br />(...Given the right data! )<br />
  37. 37. Main Memory<br />(XDR - 256MB)<br />SPU<br />PLAYSTATION®3: In Brief<br />SPU<br />RSX™<br />SPU<br />SPU<br />Graphics Memory<br />(GDDR3 - 256MB)<br />SPU<br />SPU<br />
  38. 38. PLAYSTATION®3: In Brief<br />Main Memory<br />(256MB)<br />Graphics Memory<br />(256MB)<br />SPE<br />MFC<br />SPU<br />Local Store<br />(256KiB)<br />SXU<br />
  39. 39. Goals for PLAYSTATION®3<br />Reduce overall frame latency to acceptable level (<33ms).<br />Preserve picture quality (and resolution).<br />Blur runs @ 720p on X360 and PS3.<br />Preserve lighting accuracy.<br />Lighting and main scene must match:<br />Cars move fast... <br />Deferring the lighting simply not an option, works great in [Swoboda09] though.<br />
  40. 40. Step #1: Look At The Data<br />Data is *really* important! <br />Trivially easy in this case as we’re coming from a stream processing model, but never hurts to understand it anyway.<br />Kinda gives us a small glimpse of DX11 compute shaders . <br />
  41. 41. Step #1: Look At The Data<br />xform<br />Lights<br />
  42. 42. Step #1: Look At The Data<br />xform<br />Lights<br />
  43. 43. Step #2: Parallelism<br />Stream processing highly suited to parallelisation and we have 6 x SPUs.<br />The obvious question arises:What size should a unit of work be?<br />Answer: Look at the data again!<br />
  44. 44. Step #3: Look At The Data<br />Fun fact: Frame buffers are not usually linear!<br />Many reasons for this (Think filtering and RSX™ quads).<br />Our unit size is closely tied to the internal format of frame buffer produced by the RSX™.<br />Not going to get into the exact formats here, it’s dull and it’s all in the Sony SDK Docs – RTFM!<br />Recommend PhyreEngine for good reference examples.<br />
  45. 45. Step #4: Arbitrating Work<br />Synchronisation points are fail. Keep to an absolute minimum.<br />Solution: Atomics are your friend! <br />Target hardware has an ATO, <br /> Use it, <3 it... <br />Move through data in tiles, tile dictated by an index – DMA into the local store for processing.<br />
  46. 46. Index<br />Step #4: Arbitrating Work<br />SPU<br />SPU<br />SPU<br />SPU<br />SPU<br />SPU<br />
  47. 47. Step #5: Multi-Buffering<br />Move data and process data at the same time.<br />Costs local store, but usually worth it.<br />Different tag group for each buffer.<br />
  48. 48. Step #5: Multi-Buffering<br />We used triple-buffering, since we’re decoding the normal/depth buffer.<br />Normal/Depth Buffer (Main)<br />SXU<br />Lighting Buffer (Main)<br />MFC<br />
  49. 49. Step #6: Lighting (SOA)<br />SOA is basically a transpose of the obvious layout:<br />X<br />Y<br />Z<br />W<br />X<br />X<br />X<br />X<br />X<br />Y<br />Y<br />Y<br />Y<br />Y<br />Z<br />W<br />X<br />Y<br />Z<br />Z<br />Z<br />Z<br />Z<br />W<br />Z<br />X<br />Y<br />W<br />W<br />W<br />W<br />W<br />qword dot_xx = si_fm(v, v);<br />qword dot_xx_r4 = si_rotqbyi(dot_xx, 4);<br />dot_xx = si_fa(dot_xx, dot_xx_r4);<br />qword dot_xx_r8 = si_rotqbyi(dot_xx, 8);<br />dot_xx = si_fa(dot_xx, dot_xx_r8);<br />return si_to_float(dot_xx);<br />Vs.<br />1x square length (~18 cycles) <br />qword dot_x = si_fm(x, x);<br />qword dot_y = si_fma(y, y, dot_x);<br />qword dot_z = si_fma(z, z, dot_y);<br />return dot_z;<br /> 4 x square lengths (~12 cycles) <br />
  50. 50. Step #6: Lighting (SOA)<br />Pre-transpose lighting data, splat values across entire qword.<br />16 byte aligned, single lqd.<br />4 copies of world-space X, in each element of the array<br />struct light<br />{<br /> float m_x[4];<br /> float m_y[4];<br /> float m_z[4];<br /> float m_inv_radius_sq[4];<br /> float m_colour_r[4];<br /> float m_colour_g[4];<br /> float m_colour_b[4];<br />};<br />Never actually used radius, pre-compute (1/radius)^2<br />
  51. 51. Step #6: Lighting (Batch I)<br />qword everywhere.<br />Batch reads and writes into 16 byte chunks.<br />Read 4 pixels from normal/depth.<br />Write 4 pixels to lighting buffer.<br />qword depth_addr = si_from_ptr(depth_buf);<br />qword depth0 = si_lqd(depth_addr, 0x00); <br />qword depth1 = si_lqd(depth_addr, 0x10);<br />qword depth2 = si_lqd(depth_addr, 0x20);<br />qword depth3 = si_lqd(depth_addr, 0x30);<br />qword clmp0 = si_cfltu(diffuse0, 0x20);qword clmp1 = si_cfltu(diffuse1, 0x20);<br />qword clmp2 = si_cfltu(diffuse2, 0x20);<br />qword clmp3 = si_cfltu(diffuse3, 0x20);qword r = si_ila(0x8000);<br />qword scl = si_ilh(0xff00); dif0 = si_mpyhhau(clmp0, scl, r); <br /> dif1 = si_mpyhhau(clmp1, scl, r);<br /> dif2 = si_mpyhhau(clmp2, scl, r);<br /> dif3 = si_mpyhhau(clmp3, scl, r);<br />const vector unsigned char _shuf_uint =<br /> { 0xc0, 0x00, 0x04, 0x08, <br /> 0xc0, 0x10, 0x14, 0x18, <br /> 0xc0, 0x00, 0x04, 0x08, <br /> 0xc0, 0x10, 0x14, 0x18 };<br />qword shuf_ = (const qword)_shuf_uint;<br />qword base_add = si_from_ptr(pResult);<br />qword p0_1 = si_shufb(dif0, dif1, shuf_);<br />qword p0_2 = si_shufb(dif2, dif3, shuf_);<br />qword pix0 = si_selb(p0_1, p0_2, m_00ff);<br />si_stqd(pix0, base_add, 0x0);<br />
  52. 52. Step #6: Lighting (Balance)<br />Lighting SPU program performance limited by number of instructions issued.<br />Pipeline balance is vital!<br />SPU dual issues if:<br />Correctly aligned within single fetch group.<br />No dependencies.<br />Instructions are for correct pipelines.<br />Luckily, compiler maintained balance quite well with nop/lnop insertion and some instruction re-ordering.<br />Lighting larger batches helps out balance at the cost of register file usage<br />Mileage may vary here again, how bad are you hammering the even pipe?<br />
  53. 53. Step #6: Lighting (Batch II)<br />Fixed setup cost for a single line of our sub-tile size (32 pixels wide).<br />Unfortunately, too many to process at once despite SPU’s massive register file . Loop ispipelined and lots of live variables to multiplex onto register file.<br />Settled for 16 pixels, no spilling .<br />Note: First attempt worked on 4 pixel batches like RSX™. Lots of wasted cycles in inner loop – less dual issue.<br />32 Pixels<br /><br />Register spilling...<br />16 Pixels<br />16 Pixels<br /><br />Happy medium<br />4<br />4<br />4<br />4<br />4<br />4<br />4<br />4<br /><br />Wasted cycles and increased setup overhead<br />
  54. 54. Step #7: Culling<br />Culling works on more granular sub-tiles. Allows us to potentially reject more tiles (of course, YMMV ).<br />(Note: diagram below is an example, it’s not our actual sub-tile size).<br />Similar to GPU, basically a tile is culled if...<br />Depth max and min depth are both far clip.<br />No lights intersect the frustum constructed for the tile.<br />Sub-tile<br />
  55. 55. Step #7: Culling<br />Remember, SPUs can execute general purpose code.<br />Take advantage of high-level constructs where they are suitable – this means branches, early-outs, etc.<br />Note: Branches generally suck. Not suitable in lighting inner-loop, discard an entire sub-tile at once.<br />
  56. 56. Step #8: Synchronisation<br />Custom SPURS policy module made RSX™ initiated jobs easy. Our jobs can optionally depend on a 128 byte line written by RSX™ (or PPU, whatever).<br />Non-blocking :<br />Freedom to run other scheduler tasks while waiting.<br />Really should investigate using SPE’s mailboxes to stop us from hammering the bus.<br />Physics team happy again!<br />Not pre-emptive.<br />
  57. 57. Step #8: Synchronisation<br />Can be painful!<br />Expect hard to find bugs here.<br />We had a couple, *ahem* both were other Steve’s fault ;-)<br />Worth it in the end though!<br />Keep an eye on overall timings.<br />Originally lighting pushed out physics.<br />Very easy to forget the bigger picture.<br />Impossible to predict up front.<br />
  58. 58. Step #9: Slotting it in...<br />SPU:<br />Audio<br />Command Buffer<br />Scene Graph<br />Audio<br />Physics<br />Car Damage<br />Command Buffer<br />Scene Graph<br />Lighting #1<br />Physics<br />Command Buffer<br />Scene Graph<br />Lighting #2<br />Physics<br />Physics<br />Command Buffer<br />Lighting #3<br />Physics<br />Command Buffer<br />GPU:<br />Mirror<br />Reflection<br />Main Scene<br />Pre-Pass<br />
  59. 59. Step #9: Slotting it in...<br />Ended up running the lighting on 3 SPUs, still easily within our timeframe and no longer pushed the physics out.<br />
  60. 60. Step #10: Profit! <br />SPU implementation faster than RSX™ even without parallelism. (~2-3ms on 3 SPUs).<br />Overall frame latency reduced by up to 25%!<br />More benefits:<br />Blending in alternative colour space becomes trivial.<br />Add value by outputting other useful stuff from SPU program – down-sampled Z buffer anyone? <br />Lighting becomes free*.<br />* - In the strictest computer science sense of the word, ;-).<br />
  61. 61. The Future...<br />MSAA -- Big challenge, but solvable...<br />Experiment with different colour spaces?<br />Remove de-coding step...<br />Upsets my OCD as not really needed for the data transformation –<br />But also allows us to overlap input and output buffers.<br />Specular.<br />Better normals:<br />Ideally higher precision for use in main pass.<br />Fix positive z-component sign assumption.<br />Stereographic Projection<br />Lambert Azimuthal Equal-area Projection et al.<br />
  62. 62. References<br />[Engel08] W. Engel, “Light Pre-Pass Renderer”,, accessed on 4th July 2009 <br />[Balestra08] C. Balestra and P. Engstad, “The Technology of Uncharted: Drake’s Fortune”, GDC2008. <br />[Swoboda09] M. Swoboda, “Deferred Lighting and Post Processing on PLAYSTATION®3”, GDC2009.<br />
  63. 63. Special Thanks!<br />Matt Swoboda and Colin Hughes (SCE R&D) <br />and <br />The Bizarre Creations Core Tech Team<br />
  64. 64. Shameless Plug <br /> Steve and I contributed to this book... It’s out March 2010, you should buy it for your desk, studio library, etc.<br /><br />
  65. 65. Thanks for Listening!Questions?Check out<br />