SPU Assisted Rendering

10,202 views
10,447 views

Published on

Talk about SPU-accelerated rendering in Blur, that was given by Steven Tovey and Stephen McAuley @ Develop 2010.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
10,202
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
91
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Local space position of vertex
  • Normal
  • Couple of sets of Uvs.
  • Morph-targets.
  • Control point index into an array of curves.
  • Spherical Harmonic
  • Damage data... Position offset, Normal offset, scratch and dent levels.
  • Explain why 16KB chunks, MFC max per transfer.
  • Don’t want lumpiness if parallel read/write.
  • Don’t want lumpiness if parallel read/write.
  • Rim lighting here.
  • Tyres use low-power specular.
  • Brake lights.
  • Used on alloys for low-power specular.
  • Used for the scratch lighting, again for low-power specular.
  • We need to look at the pipeline of the graphics card to work out how we can move more of our GPU work onto the SPUs. Two main areas we can insert data – either through vertices at the top, or textures at the fragment stage. Sadly, we can’t hook into the rasteriser, which would be ace.
  • Of course, these look-up textures end up being screen-space look-up textures, which means some sort of deferred rendering…
  • I have a problem with forward rendering. I think most people traditionally design their engine this way, especially on 360 and PC. But all the work is done in the fragment shader, so when you port to the PS3 with a slower fragment shader unit, your whole game runs slower. Although you can use EDGE to speed up your vertex processing and your post processing, they both only step around the core of the issue that you’re fragment shader bound and there’s no easy way of solving it.
  • We found a light pre-pass renderer suited our goals pretty well. It’s a halfway house between traditional and deferred rendering.
  • We render a rear-view mirror, cube map reflections for the cars and planar reflections for the road and water in addition to the pre-pass and main views. Multi-threaded rendering helps a lot!
  • Deferring by a frame isn’t ideal. Either you just use the previous frame’s lighting buffer for the next frame, with obvious artefacts (especially if you’re doing a racing game like us), or you have to add a frame of latency.I don’t think adding frames of latency is ideal, especially for cross-platform games. If you add a frame of latency on the PS3, are you going to do the same on the 360? If you’re not, then game play could be different between both platforms.I’m not saying this is something I’d never do, I think in lots of circumstances you’ll have to. But avoid it where you can, and this is one instance.
  • If we wanted to take this further for future projects, we could add shadow maps in at the start of our pipeline, then do an exponential blur on the SPUs whilst we’re rendering the pre-pass geometry…
  • This is real multi-threaded graphics processing, with multiple processors doing different jobs at the same time. Therefore, architect your engine accordingly!Having small graphics jobs allows you to spread the workload. Obviously, not everything can be done like this. Some things will most likely have to be deferred a frame, adding a frame of latency, such as post-processing or MLAA. But there’s lots of tasks, smaller tasks, that don’t have to be, from SSAO to blurring exponential shadow maps. You have to find things to parallelise with!Think about the data again! Rendering has lots of stages, each with its own inputs and outputs. What could sync with what?
  • We combine the normals and depth into one 32-bit buffer. This is an optimisation as it halves the inputs into the SPU program, but also allows us to keep the depth buffer in local memory which is good for performance.
  • The first step, but the biggest stumbling block!
  • No blocking! Our jobs are optionally dependent on a label.
  • To be accurate, we have a jump-to-self per SPU.
  • When we load in a tile, we quickly iterate over every pixel and calculate minimum and maximum depth.No need to use a stencil buffer to cull out the sky as depth min and max will do it for us. (Remember, we don’t have the stencil buffer as we’re not using the depth buffer!)This technique is really useful for a variety of things, including depth of field (check out Matt Swoboda’s optimisation in PhyreEngine).
  • This is actually the easiest bit. Just write the lighting equations in intrinsics! However, they really have to be fast otherwise performance just won’t be good enough. Next is some helpful tips for optimisation.
  • So we triple buffer. It ends up that we have plenty of local store left as it’s simple job and our job size was relatively small. Another reason to write in siintrinsics though as it keeps the code size down!
  • Just like Ste said earlier, this is a big win. Probably a good rule of thumb for most SPU jobs!
  • Just like Ste said earlier, this is a big win. Probably a good rule of thumb for most SPU jobs!
  • When kicking SPU jobs off on the RSX, you have to be careful as you can interfere with jobs the PPU is running. This is where sync-free systems are a win! We’re lucky as we just avoided the physics, but also, running only on 3 SPUs was a good idea so we had 3 free for other tasks. See how quick the rendering is even though we’re rendering so many views!
  • Apologies for the shameless self-promotion!
  • SPU Assisted Rendering

    1. 1. /* * SPU Assisted Rendering. */<br />Steven Tovey & Stephen McAuley<br />Graphics Programmers, Bizarre Creations Ltd.<br />steven.tovey@bizarrecreations.com<br />stephen.mcauley@bizarrecreations.com<br />http://www.bizarrecreations.com<br />
    2. 2. /* Welcome! */<br /><ul><li>We have some copies of Blur to give away, stick around and fill out your evaluation sheets! </li></li></ul><li><ul><li>Part I (w/ Steven Tovey):</li></ul>What is SPU Assisted Rendering?<br />Case Studies<br /><ul><li>Car Damage
    3. 3. Car Lighting
    4. 4. Part II (w/ Stephen McAuley):</li></ul>Fragment Shading<br />Parallelisation<br />Case Study<br />Pre-pass Lighting on SPUs<br /><ul><li>Questions</li></ul>/* Agenda */<br />
    5. 5. /* * Part I w/ Steven Tovey<br /> */<br />SPU Acceleration of Car Rendering in Blur<br />
    6. 6. <ul><li>Assisting RSX™ with the SPUs (der!)
    7. 7. Why do this?</li></ul>Free up RSX™ to do other things.<br />Enable otherwise unfeasible techniques.<br />Optimise rendering.<br />/* What is SPU AR? I */<br />
    8. 8. <ul><li>Problems involved?
    9. 9. Synchronisation.
    10. 10. Optimising SPU modules.
    11. 11. Memory considerations:
    12. 12. Local store
    13. 13. Resource allocation
    14. 14. Etc.</li></ul>/* What is SPU AR? II */<br />
    15. 15. <ul><li>Original Xenon implementation:
    16. 16. Totally GPU-based.
    17. 17. 2xVTF (volume & 2D) for damage.
    18. 18. Large amount of work in vertex shader, making cars in Blur heavily vertex-bound.
    19. 19. All lighting in pixel shader.</li></ul>/* Case Study: Cars I */<br />
    20. 20. <ul><li>Loose fitting damage volume:</li></ul>/* Case Study: Cars II */<br />
    21. 21. <ul><li>Control points:</li></ul>/* Case Study: Cars III */<br />
    22. 22. <ul><li>Morph targets:</li></ul>/* Case Study: Cars IV */<br />
    23. 23. <ul><li>Scratch/dent textures:</li></ul>/* Case Study: Cars IV */<br />
    24. 24. <ul><li>Challenges:
    25. 25. Increase rendering speed of cars.
    26. 26. Maintain same quality.</li></ul>/* Case Study: Cars VI */<br />
    27. 27. <ul><li>Our solution:
    28. 28. Large parts are SPU based.
    29. 29. On demand.
    30. 30. Sync-free.
    31. 31. Deferred.
    32. 32. Work split between GPU/SPU.</li></ul>/* Damage: Solution */<br />
    33. 33. <ul><li>2 vertex streams:
    34. 34. Read-only car vertex data.
    35. 35. Shared between similar cars.
    36. 36. SPU-modified damage vertex data.
    37. 37. Per instance.
    38. 38. One-to-one mapping of vertices.
    39. 39. Control points:
    40. 40. Crude approximation of volume preservation.
    41. 41. Dent/scratch blend levels.</li></ul>/* Damage: Data I */<br />
    42. 42. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    43. 43. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    44. 44. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    45. 45. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    46. 46. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    47. 47. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    48. 48. /* Damage: Data II */<br />Stream0<br />Stream1<br />Position<br />SPU_Position<br />Normal<br />UV0<br />SPU_Normal<br />UV1<br />PosOffset<br />NormalOffset<br />ControlPoints<br />AO<br />
    49. 49. <ul><li>MFC writes data atomically in 16 byte chunks...
    50. 50. If vertex format is 16 bytes exactly can atomically change a vertex from SPU.
    51. 51. If you can live with the odd vertex being wrong for a frame, this could be a huge win!</li></ul>/* Damage: Data III */<br />
    52. 52. /* Damage: Data IV */<br />SPU<br />RSX Local<br />Main<br />Write-only Vertices<br />Read-only Vertices<br />
    53. 53. <ul><li>Damage events from game-side code are queued.
    54. 54. Note: There is no link to the player health, purely superficial.</li></ul>/* Damage: Events */<br />Impact<br />Impact<br />Game Code<br />Impact<br />Impact<br />Impact<br />Impact<br />
    55. 55. /* Damage: Data V */<br />Impact<br />Impact<br />Impact<br />Constants<br />Impact<br />Impact<br />Impact<br />SPU<br />GPU<br />Write-only Vertices*<br />Read-only Vertices*<br />* - w.r.t to SPU<br />
    56. 56. /* Damage: Data VI */<br />SPU<br />GPU<br />Write-only Vertices*<br />Read-only Vertices*<br />* - w.r.t to SPU<br />
    57. 57. Kick off SPU tasks<br /><ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Other Work(1)<br />PPU Damage<br />
    58. 58. <ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Other Work(1)<br />Other Work(1)<br />PPU Damage<br />
    59. 59. <ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Other Work(1)<br />Other Work(1)<br />PPU Damage<br />
    60. 60. <ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Flag<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Other Work(1)<br />Other Work(1)<br />PPU Damage<br />
    61. 61. <ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Flag<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Other Work(1)<br />Other Work(2)<br />Other Work(1)<br />PPU Damage<br />
    62. 62. <ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Flag<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Other Work(1)<br />Other Work(2)<br />Other Work(1)<br />PPU Damage<br />PPU Damage<br />
    63. 63. <ul><li>Less sync points should be the goal of any multi-core code:</li></ul>/* Damage: Control */<br />Flag<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Vertex Work<br />Other Work(1)<br />Other Work(2)<br />Other Work(1)<br />PPU Damage<br />PPU Damage<br />
    64. 64. <ul><li>Pretty easy to go from shaders to SPU intrinsics or asm.
    65. 65. We favour si style for simplicity and ease.</li></ul>/* de-code into IEEE754-ish 32bit float (meh): */<br />qword sign_bit = si_and(result, sign_bit_mask);<br />sign_bit = si_shli(sign_bit, 0x10); /* move 16 bits into correct place. */<br />qword significand = si_and(result, mant_bit_mask);<br />significand = si_shli(significand, 0xd);<br />qword is_zero_mask = si_cgti(significand, 0x0); /* all bits set if non-zero. */<br />expo_bias = si_and(is_zero_mask, expo_bias);<br />qword exponent_bias= si_a(significand, expo_bias); /* move expo up range,<br /> 0x07800000=>0x3f800000. */<br />exponent_bias= si_or(exponent_bias, sign_bit);<br />/* Damage: SPU I */<br />
    66. 66. <ul><li>Problems:
    67. 67. GPU version relied on bilinear filtering of volume texture to smooth damage.
    68. 68. Filtering on SPU is a bit of a pain.
    69. 69. Working out which events affect which vertices?</li></ul>/* Damage: SPU II */<br />
    70. 70. <ul><li>Simplest solution:
    71. 71. Two-stage x-form:
    72. 72. 1. Get data in volume texture-ish format.
    73. 73. 2. Apply x-form to all vertices.</li></ul>/* Damage: SPU III */<br />
    74. 74. <ul><li>Filtering:
    75. 75. Software bilinear filtering.
    76. 76. Some interesting instructions in ISA will help here.</li></ul>/* Damage: SPU IV */<br />
    77. 77. <ul><li>Data flow through SPU program is paramount to performance.</li></ul>Process in 16KB chunks.<br />Multi-buffer input and output.<br /><ul><li>If your system isn’t ‘mission critical’, align and lose double buffer.</li></ul>/* Damage: Lessons I */<br />
    78. 78. <ul><li>Make use of SoA mode data layout, liberated from rigidity of GPU programming model! </li></ul>/* Damage: Lessons II */<br />x<br />y<br />z<br />w<br />x<br />x<br />x<br />x<br />x<br />y<br />z<br />w<br />y<br />y<br />y<br />y<br />x<br />y<br />z<br />w<br />z<br />z<br />z<br />z<br />x<br />y<br />z<br />w<br />w<br />w<br />w<br />w<br />
    79. 79. <ul><li>Add value to your SPU program for relatively small computational effort:
    80. 80. We added some of the per-vertex lighting calculations for brake lights, for example.</li></ul>/* Damage: Lessons III */<br />
    81. 81. /* Damage: Results */<br />
    82. 82. <ul><li>Our solution:
    83. 83. SPU-generated cube maps.
    84. 84. 40 in total (accounting for double buffer).
    85. 85. 8x8 per face.
    86. 86. Deferred.
    87. 87. Work split between GPU/SPU.
    88. 88. Cars are lit with a mixture of things:
    89. 89. SH (world + dynamic)
    90. 90. Cube map lighting
    91. 91. Vertex lighting</li></ul>/* Lighting: Solution */<br />
    92. 92. <ul><li>Input:
    93. 93. Nearest 16 lights.
    94. 94. Output:
    95. 95. Cube map.
    96. 96. Simples!</li></ul>/* Lighting: Data */<br />Light<br />SPU<br />Light<br />Cube map<br />Light<br />Light<br />Light<br />Light<br />
    97. 97. <ul><li>Each frame PPU kicks off SPU-tasks to build cube maps.
    98. 98. Cube maps are double buffered to avoid artefacts and contention with GPU.
    99. 99. Workload scalable.
    100. 100. Number of cube maps per task can change dynamically if need be.</li></ul>/* Lighting: Control */<br />
    101. 101. <ul><li>On SPU we put some ‘Bizarre Creations Secret Sauce™’ into the cube maps:</li></ul>/* Lighting: SPU */<br />
    102. 102. <ul><li>On GPU, we sample with reflected view vector:</li></ul>reflect(view_dir, normal);<br />/* Lighting: GPU */<br />
    103. 103. /* Lighting: Results I */<br />
    104. 104. /* Lighting: Results I */<br />
    105. 105. /* Lighting: Results I */<br />
    106. 106. /* Lighting: Results II */<br />
    107. 107. /* Lighting: Results II */<br />
    108. 108. /* * Part II w/ Steve McAuley<br /> */<br />SPU Acceleration of Fragment Shading<br />
    109. 109. <ul><li>Problem:</li></ul>Our fragment programs are expensive.<br /><ul><li>Solution:</li></ul>Let’s use the SPUs to help.<br />/* The Problem */<br />
    110. 110. /* The Pipeline */<br />Vertices<br />Vertex shader<br />Triangle setup<br />Rasterisation<br />Textures<br />Fragment shader<br />ROP<br />
    111. 111. <ul><li>Solution:</li></ul>Make look-up textures on the SPUs to speed up our fragment programs.<br /><ul><li>What could we look up?</li></ul>Lighting<br />Shadows<br />Ambient occlusion<br />Fog<br /><ul><li>Sounds like deferred rendering!</li></ul>/* Look It Up! */<br />
    112. 112. Forward Rendering=FAIL<br />
    113. 113. <ul><li>Goal:</li></ul>Move dynamic lighting into a look-up texture.<br /><ul><li>Solution:</li></ul>Sounds like deferred rendering!<br /><ul><li>In Blur, we used a light pre-pass renderer.</li></ul>/* Case Study: Lighting */<br />
    114. 114. /* Light Pre-Pass */<br />Geometry<br />Normals<br />Final Colour<br />Geometry<br />Real-Time Lighting<br />Depth<br />
    115. 115. /* A Frame of Blur */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Lights<br />Mirror, Cube Map & Reflection<br />GPU:<br />
    116. 116. <ul><li>Move the lights onto the SPUs:</li></ul>But there’s a gap!<br />/* A Frame of Blur */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />GPU:<br />Lights<br />SPUs:<br />
    117. 117. <ul><li>Option #1:</li></ul>Defer the lighting by a frame.<br />/* A Frame of Blur */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />GPU:<br />Lights<br />SPUs:<br />
    118. 118. <ul><li>Option #2:</li></ul>Parallelise with another part of the rendering.<br />/* A Frame of Blur */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />GPU:<br />Lights<br />SPUs:<br />
    119. 119. <ul><li>Option #2:</li></ul>Taking it further…<br />/* A Frame of Blur */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />Shadows<br />GPU:<br />Lights<br />Blur<br />SPUs:<br />
    120. 120. - Key point:you must find something to parallelise with!<br />Design your engine accordingly!<br />Otherwise you risk a frame of latency.<br /><ul><li>This is true multi-GPU.</li></ul>Two graphics processors, working on separate tasks, in parallel.<br />/* Parallelism */<br />
    121. 121. /* Case Study: Lighting */<br /><ul><li>Goal:</li></ul>Move the lighting stage of the light pre-pass onto the SPUs.<br /><ul><li>There are just six easy steps to enlightenment…</li></li></ul><li>/* Step #1: The Data */<br />Normals<br />Transform<br />Depth<br />Lights<br />
    122. 122. /* Step #1: The Data */<br />Normal X<br />Normal Y<br />Depth Hi<br />Transform<br />Depth Lo<br />Lights<br />
    123. 123. <ul><li>We have six SPUs, and each of them wants a lighting job…
    124. 124. Divide the frame buffer into tiles.
    125. 125. Each tile is a unit of work.</li></ul>/* Step #2: Jobs */<br />
    126. 126. /* Step #2: Jobs */<br />Atomic<br />Increment<br />Index<br /><ul><li>Keep working until they’re all gone!</li></ul>(Then hand out the P45s…)<br />SPU<br />SPU<br />SPU<br />SPU<br />SPU<br />SPU<br />
    127. 127. <ul><li>Can be a time sink if you’re not careful!</li></ul>Expect to find your worst bugs here.<br />Best to get it right first time!<br />/* Step #3: Sync */<br />
    128. 128. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />GPU:<br />Lights<br />SPUs:<br />
    129. 129. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />GPU:<br />Lights<br />SPUs:<br />
    130. 130. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />WriteLabel<br />GPU:<br />Lights<br />SPUs:<br />
    131. 131. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />WriteLabel<br />GPU:<br />Lights<br />Wait on Label<br />SPUs:<br />
    132. 132. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />WriteLabel<br />GPU:<br />Lights<br />Wait on Label<br />SPUs:<br />
    133. 133. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />WriteLabel<br />Jump To Self<br />GPU:<br />Lights<br />Wait on Label<br />SPUs:<br />
    134. 134. /* Step #3: Sync */<br />Solid <br />Alpha<br />Post<br />Pre-Pass<br />Mirror, Cube Map & Reflection<br />WriteLabel<br />GPU:<br />Lights<br />Wait on Label<br />SPUs:<br />
    135. 135. <ul><li>Build a view frustum for each tile.</li></ul>Remember, we have the depth buffer so can calculate the minimum and maximum depth!<br /><ul><li>Gather only the lights that intersect this frustum.
    136. 136. Cull an entire tile if:</li></ul>Depth min and max are both far clip.<br />No lights intersect.<br />/* Step #4: Culling */<br />
    137. 137. /* Step #5: Light! */<br />
    138. 138. <ul><li>Multi-buffering:</li></ul>Do the following simultaneously:<br />Load data for next job.<br />Process data for the current job.<br />Save data from the previous job.<br />Costs local store but is usually worth it.<br />/* Step #6: Optimise! */<br />
    139. 139. <ul><li>Structure-of-arrays:</li></ul>Transpose your data for massive damage!<br />e.g.<br />/* Step #6: Optimise! */<br />x<br />y<br />z<br />w<br />x<br />x<br />x<br />x<br />x<br />y<br />z<br />w<br />y<br />y<br />y<br />y<br />x<br />y<br />z<br />w<br />z<br />z<br />z<br />z<br />x<br />y<br />z<br />w<br />w<br />w<br />w<br />w<br />
    140. 140. - Array-of-structures:<br />1 dot product, 23 cycles<br />qword d0 = si_fm(xyz0, abc0);<br />qword d1 = si_rotqbyi(d0, 0x4);<br />qword d2 = si_rotqbyi(d0, 0x8);<br />qword dot = si_fa(d0, d1);<br /> dot = si_fa(dot, d2);<br />- Structure-of-arrays:<br />4 dot products, 18 cycles<br />qword dot0123 = si_fm(x0123, a0123);<br /> dot0123 = si_fma(y0123, b0123, dot0123);<br /> dot0123 = si_fma(z0123, c0123, dot0123);<br />/* Step #6: Optimise! */<br />
    141. 141. <ul><li>Batching:</li></ul>Light 16 pixels at a time.<br />Minimises dependent instruction stalls.<br />Helps compiler with even/odd pipeline balance.<br />Use trial and error to find your ideal batch size!<br />A balance between register spilling and setup cost.<br />/* Step #6: Optimise! */<br />
    142. 142. <ul><li>Ran on 3 SPUs.
    143. 143. Slightly faster than the RSX.
    144. 144. An optimisation even if you have nothing to parallelise with!</li></ul>/* Case Study: Lighting */<br />
    145. 145. /* Case Study: Lighting */<br />
    146. 146. Lighting<br />Damage<br />Rendering<br />Physics<br />/* The Complete Picture */<br />
    147. 147. <ul><li>Use the SPUs to accelerate your rendering!</li></ul>Think about the data.<br />Design your engine appropriately.<br />Avoid frames of latency.<br />Keep synchronisation simple.<br />Add value.<br /><ul><li>It’s actually really easy, try it!</li></ul>/* Conclusion */<br />
    148. 148. /* Further Reading */<br />- Steven Tovey & Stephen McAuley, “Parallelized Light Pre-Pass Rendering with the Cell Broadband Engine”, GPU Pro<br />- Stephen McAuley & Steven Tovey, “A Bizarre Way to do Real-Time Lighting”, Develop in Liverpool 2009<br />
    149. 149. If you’re talented, then we’re hiring ;)<br />jobs@bizarrecreations.com<br />
    150. 150. lqd $r1,question_count<br />stopd $r0,$r0,0x1<br />; thanks for listening! ;)<br />brnz $r1,questions<br />

    ×