Neil Brown – Senior Engineer
Developer Services
Sony Computer Entertainment Europe
Research & Development Division
PlayStation ® : Cutting Edge Techniques
Slide 2
• PlayStation® Development Resources
• PSP Case study
– Motorstorm® : Arctic Edge
• PS3 Case Studies
– Maintaining Framerate
– SPU assisted graphics
– Other tips
What I Will Be Covering
Slide 3
SCEE R&D
Developer Services
• Great Marlborough Street in
Central London
• Cover ‘PAL’ Region
Developers
Slide 4
• Support – SCE DevNet
• Field Engineering
• Technical and Strategic Consultancy
• TRC Consultancy
• Technical Training
What We Cover
Slide 5
PlayStation® DevNet
Slide 6
PS3 DevKit
Latest additions to Dev Tools
PS3 Test Kit
Slide 7
Existing PSP DevKit
Latest additions to Dev Tools
PSPgo
Commander Arm
Slide 8
• PS3 DevKit: €1,700
• PS3 TestKit: €900
• PSP DevKit: €1,200
• PSP Commander: €300 euros
Dev Tools Prices
Slide 9
• An illustration of some techniques used in
real games
– Making use of the PSP® and PS3™
Hardware
– How to combine the Cell Broadband Engine™
with RSX™
Case Studies
Slide 10
• Migration from a PS3™ title
• Platform specific game engine, 30Hz
• Render Target are oversized to 512x304 32-bit
– Soften jaggies
Motorstorm® : Arctic Edge
Slide 11
• Use morph target for better realism
• 2 LODs
• Real-time lit, with specular and modulated
by track lighting
• Each skin is divided up into sub-skins that
are affected by a maximum of 8 different
bones
• Each skin’s skeleton has 16 bones
Vehicles and Character
Slide 12
• Streamed with 3 blocks (15kpolys) in
memory
• Lighting baked into vertices
• Maximum of 8.5MB
• Specular snow rendering use 2-pass:
– untextured, specularly-lit only
– textured layer with alpha "holes" which
modulated the addition of the lower layer
• Whole elements to be clipped
Landscape
Slide 13
• Custom physics code to mimic PS3™ Motorstorm
• Vehicles are represented by convex hulls.
• The Landscape is represented by compressed
vertices of an arbitrary mesh in a grid based scene.
• Ragdoll physics is a verlet system.
Physics
Slide 14
• Pre-built display lists for most objects
• Compressed data for all geometry
• Z-fighting was heavily reduced by increasing the far
clipping plane massively beyond the actual draw distance
required
Graphics Optimisation
Slide 15
Graphics Optimisation
Near Clip Plane
Far Clip Plane
Actual Draw
Distance
Z-Buffer
Most accurate section
Near Clip Plane
Far Clip Plane
Z-Buffer
Most accurate section
Slide 16
Graphics on PS3™
Slide 17
• Different widths between 1280 and 1920 for a 1080p 60Hz game
– Or 1024 to 1280 for a 720p game
• Feedback loop adjusts render width based on prior frame durations
– Resize is hidden in post processing full screen passes
– Horizontal hardware scaler helps for 1080p modes
Dynamic resolution
Slide 18
• Framerate can vary over time
– Particle systems can eat GPU time
– Views affect render time
• Cull objects
– Small screen footprint
– Used on Motorstorm 2
Maintaining frame rate
Slide 19
• Dynamic geometry LOD
– Average per mesh polygon size computed offline
– Viewport polygon size used to pick lowest LOD
– For multiplayer, LODs automatically adjust
• (smaller viewports)
• Shader LOD
– Reduce shader instruction count in distance
• Lose parallax & normal – keep specular map
Reducing GPU load
Slide 20
Scene without LOD
Slide 21
Scene with LOD
Slide 22
• Many games are fragment shader bound
• Rendering Z only ‘primes’ the RSX™ Z-cull unit
– Very fast, 16 pixels/clock rather than 8
– Render entire scene,
– Or ‘large’ meshes only
– Easily save 10% GPU
Z pre-pass
Slide 23
Slide 24
Slide 25
Slide 26
• Processor are becoming increasingly parallel
• Frame time is bound by the max of any of these component
Game Loop
Slide 27
• If any part run too slow, offload tasks on to
a helper CPU
Helper CPU
Slide 28
• Problem: Update an array of objects and submit the visible
ones for rendering
• Object.update() method was bottleneck on PPE
– Determined using SN Tuner
– This function generates the world space matrices for each
object
– Embarrassingly parallel
• No data dependencies between objects
• If we had 1 processor for each object we could do that 
Function off load example using a SPURS Task
Slide 29
SPURS Tasks
//---Conditionally calling SPU Code
if (usingSPU==false) //---------------------------generic version
{
for(int i=0; i<OBJECT_COUNT; i++)
{
testObject[i].update();
}
}
else //-------------------------------------------SPU version
{
int test=spurs.ThrowTaskRunComplete(taskUpdateObject_elf_start,
(int)testObject,
OBJECT_COUNT,0,0);
//could do something useful here…
int result = spurs.CatchTaskRunComplete(test);
}
Slide 30
• Getting data in and out of SPU is first problem
• Get this working before worrying about actual processing
– Brute force often works just fine
• DMA entire object array into SPE memory
• Run update method
• DMA entire object array back to main memory
• List DMA allows you to get fancy
– Gather and Scatter operation
• If the data set is really big or if you need every last drop of performance
– Streaming model
• Overlap DMA transfers with processing
• Double buffering
SPURS Task
Slide 31
SPURS Task: Getting Data in and out
int cellSpuMain(.....)
{
int sourceAddr = ....; //source address
int count = ....; //number of data elements
int dataSize = count * sizeof(gfxObject); //amount of memory to get
gfxObject *buf = (gfxObject*)memalign(128,dataSize); //alloc mem
DmaGet((void*)buf,sourceAddr,dataSize);
DmaWait(....);
//---data is loaded at this point so do something interesting
DmaPut((void*)buf,sourceAddr,dataSize);
DmaWait(....);
return(0);
}
Slide 32
• Keep code the same as PPU Version
– Conditional compilation based on processor type
– Might have to split code into a separate file
SPURS Task: Executing the Code
//--- SPU code to call the update method
//--- data is loaded at this point so do something interesting
//--- step through array of objects and update
gfxObject *tp;
for(int i=0; i<count; i++)
{
tp = &buf[i];
tp->update(); //same method as on PPE compiled for SPU
}
Slide 33
• 5x Speed up in this instance
– Brute force solution
– Could add more SPUS
– Problem is embarrassingly parallel
SPURS Task: Results
Slide 34
• Simple pipeline
– No stalls like LHS
• No Operating System, very few system calls
– Performance is high and deterministic
• Memory is very fast like L1 cache
– Memory transfers a asynchronous so can be hidden
SPUs are faster than the PPU
Slide 35
• PPU Stalled while waiting for SPU to Process
• SPU Data get and put were synchronous
– Not using hardware to its fullest extent
– Easy to get more if we need it
SPURS Task: Time Waster
PPU PPU
GET PUTExec
TIME
SPU
Slide 36
SPU Migration
• Look for large blocks of data with simple processing and
minimal synchronization requirements
– Vertex transform
– Animation
– Audio
– Image processing
• Entire subsystems that can be moved over
Slide 37
• Graphics API
• Multistream – Audio
• Bullet – Physics
• PlayStation®Edge
– Geometry/Compression/Animation/Post FX
• FIOS – Optimised disc access
Our Libraries that run on SPUs
Slide 38
• Animation
• AI Threat prediction
• AI Line of fire
• AI Obstacle avoidance
• Collision detection
• Physics
• Particle simulation
• Particle rendering
• Scene graph
Killzone®2 SPU Usage
• Display list building
• IBL Light probes
• Graphics post-processing
• Dynamic music system
• Skinning
• MP3 Decompression
• Edge Zlib
• etc.
44 Job types
Slide 39
• Accelerate both PPU and RSX™
• Allow to shorten the overall time
SPU can do more than CPU tasks
Slide 40
• Visibility and
Geometry Culling
• Vertex and lighting
processing
• Post processing
Offloading the GPU
Slide 41
• ‘Edge’ SPU library to offload vertex work from RSX™
– Trade SPU time for GPU performance
• Animation, skinning, culling and transformation
– Pre culled, so all RSXTM work generates pixels
Geometry processing
Slide 42
• Engine supports up to 256 vertex lights
• Light accumulation
– RGBE vertex colour
– Runs on SPU
• Zero impact on RSX™
Wipeout® weapon lighting
Slide 43
• Geometric occluders
– Tested using halfplanes
– Fast checks on SPU
• Low res software occlusion
– Coarse Zbuffer on SPU
– 256x114 float Z
– Low-poly conservative occluders
Occlusion Culling
Slide 44
• SPU/PPU
– Object culling – Offscreen or very small objects
– Occlusion Culling – Occluded objects
– Edge Geometry – Vertex culling
• RSX
– Early Z-cull – Coarse pixel culling
• Before fragment shader
• Remember to do a z pre-pass
– Depth test – Pixel culling
• After fragment shader
Culling
Slide 45
• SPU geometry code for landscape
– SPUs can generate or remove graphic primitives (procedural, sub-division
surfaces…)
– Allow to send only triangles that contribute to the final scene
Continuous LOD
Slide 46
PhyreEngine™
• Free game engine including
– Modular run time
– Samples, docs and whitepapers
– Full source and art work
• Optimized for multi-core especially PS3™
• Extractable and reusable components
Slide 47
Procedural Foliage
• Final SPU program generates RSX™ geometry with movement
• LOD calculated and transition to billboards
Original models are from the Xfrog Plant Library and used with permission of Greenworks Organic Software.
Slide 48
Offloading the Scene Traversal
Slide 49
• Motion Blur, Bloom, Depth of field processed on SPU
– Quarter-res image
– Merged into final scene
Killzone® post processing
Slide 50
• SPUs assist RSX™
1. RSX™ prepares low-res image buffers in XDR
2. RSX™ triggers interrupt to start SPUs
3. SPUs perform image operations
4. RSX™ already starts on next frame
5. Result of SPUs processed by RSX™ early in next frame
Post Processing
Slide 51
A Few Samples
SSAO Depth of Field
Slide 52
A Few Samples
Exponential
shadow map
Motion Blur
Slide 53
Post-Effect Impl. Difference
RSXTM SPUs
Bilinear filtering is nearly free Bilinear filtering is expensive
Nearest filtering is nearly free SPU only supports truncate mode
Sequential texture accesses can
benefit from texture cache
Sequential texture accesses need to
be DMA transferred into LS
Random accesses are handled nicely
at a cost of trashing the texture cache
Random accesses means handling
multiple small DMA transfers
Slide 54
• HD resolution is still expensive
– run post-processing at lower resolution!
Performance
Effect Single SPU at 720p Five SPUs at 720p
Depth of Field ~22.76 ms ~4.74 ms
ROP ~14.36 ms ~2.99 ms
Motion Blur (Intrinsics) ~27.5 ms ~5.73 ms
Motion Blur (hand tuned) ~17.6 ms ~3.67 ms
Slide 55
• Every PS3™ comes with an HDD
• Can access both HDD and Blu-ray simultaneously
– Helps when continuously streaming data: levels,
actors, sounds, music, textures…
– 2GB of System Cache
• Allows prefetching data in advance
Data Streaming
Slide 56
• Brute force method
– High quality images
– 128 FP16 frames
• Depth of field
• Motion Blur
– Use sub-pixel jitter to give 128x super-sampled AA
Photo-Mode
Slide 57
Slide 58
• Think of the system and your game as a whole
– Rearrange your rendering if necessary
– Rendering is not just a GPU problem
• Unlock the potential of multi-core programming
– PS3™ = PPU + SPUs + RSX™ working together in parallel
– Minimize contention
Summary
Slide 59
https://www.tpr.scee.net/
Slide 60
http://www.worldwidestudios.net/xdev
games_development@scee.net
Slide 61
• Thanks to all the studios that let me talk about their games
ANY QUESTIONS?
www.scee.com
ps3.scedev.net
research.scee.net
tpr.scee.net
www.worldwidestudios.net/xdev
Finally

PlayStation: Cutting Edge Techniques

  • 1.
    Neil Brown –Senior Engineer Developer Services Sony Computer Entertainment Europe Research & Development Division PlayStation ® : Cutting Edge Techniques
  • 2.
    Slide 2 • PlayStation®Development Resources • PSP Case study – Motorstorm® : Arctic Edge • PS3 Case Studies – Maintaining Framerate – SPU assisted graphics – Other tips What I Will Be Covering
  • 3.
    Slide 3 SCEE R&D DeveloperServices • Great Marlborough Street in Central London • Cover ‘PAL’ Region Developers
  • 4.
    Slide 4 • Support– SCE DevNet • Field Engineering • Technical and Strategic Consultancy • TRC Consultancy • Technical Training What We Cover
  • 5.
  • 6.
    Slide 6 PS3 DevKit Latestadditions to Dev Tools PS3 Test Kit
  • 7.
    Slide 7 Existing PSPDevKit Latest additions to Dev Tools PSPgo Commander Arm
  • 8.
    Slide 8 • PS3DevKit: €1,700 • PS3 TestKit: €900 • PSP DevKit: €1,200 • PSP Commander: €300 euros Dev Tools Prices
  • 9.
    Slide 9 • Anillustration of some techniques used in real games – Making use of the PSP® and PS3™ Hardware – How to combine the Cell Broadband Engine™ with RSX™ Case Studies
  • 10.
    Slide 10 • Migrationfrom a PS3™ title • Platform specific game engine, 30Hz • Render Target are oversized to 512x304 32-bit – Soften jaggies Motorstorm® : Arctic Edge
  • 11.
    Slide 11 • Usemorph target for better realism • 2 LODs • Real-time lit, with specular and modulated by track lighting • Each skin is divided up into sub-skins that are affected by a maximum of 8 different bones • Each skin’s skeleton has 16 bones Vehicles and Character
  • 12.
    Slide 12 • Streamedwith 3 blocks (15kpolys) in memory • Lighting baked into vertices • Maximum of 8.5MB • Specular snow rendering use 2-pass: – untextured, specularly-lit only – textured layer with alpha "holes" which modulated the addition of the lower layer • Whole elements to be clipped Landscape
  • 13.
    Slide 13 • Customphysics code to mimic PS3™ Motorstorm • Vehicles are represented by convex hulls. • The Landscape is represented by compressed vertices of an arbitrary mesh in a grid based scene. • Ragdoll physics is a verlet system. Physics
  • 14.
    Slide 14 • Pre-builtdisplay lists for most objects • Compressed data for all geometry • Z-fighting was heavily reduced by increasing the far clipping plane massively beyond the actual draw distance required Graphics Optimisation
  • 15.
    Slide 15 Graphics Optimisation NearClip Plane Far Clip Plane Actual Draw Distance Z-Buffer Most accurate section Near Clip Plane Far Clip Plane Z-Buffer Most accurate section
  • 16.
  • 17.
    Slide 17 • Differentwidths between 1280 and 1920 for a 1080p 60Hz game – Or 1024 to 1280 for a 720p game • Feedback loop adjusts render width based on prior frame durations – Resize is hidden in post processing full screen passes – Horizontal hardware scaler helps for 1080p modes Dynamic resolution
  • 18.
    Slide 18 • Frameratecan vary over time – Particle systems can eat GPU time – Views affect render time • Cull objects – Small screen footprint – Used on Motorstorm 2 Maintaining frame rate
  • 19.
    Slide 19 • Dynamicgeometry LOD – Average per mesh polygon size computed offline – Viewport polygon size used to pick lowest LOD – For multiplayer, LODs automatically adjust • (smaller viewports) • Shader LOD – Reduce shader instruction count in distance • Lose parallax & normal – keep specular map Reducing GPU load
  • 20.
  • 21.
  • 22.
    Slide 22 • Manygames are fragment shader bound • Rendering Z only ‘primes’ the RSX™ Z-cull unit – Very fast, 16 pixels/clock rather than 8 – Render entire scene, – Or ‘large’ meshes only – Easily save 10% GPU Z pre-pass
  • 23.
  • 24.
  • 25.
  • 26.
    Slide 26 • Processorare becoming increasingly parallel • Frame time is bound by the max of any of these component Game Loop
  • 27.
    Slide 27 • Ifany part run too slow, offload tasks on to a helper CPU Helper CPU
  • 28.
    Slide 28 • Problem:Update an array of objects and submit the visible ones for rendering • Object.update() method was bottleneck on PPE – Determined using SN Tuner – This function generates the world space matrices for each object – Embarrassingly parallel • No data dependencies between objects • If we had 1 processor for each object we could do that  Function off load example using a SPURS Task
  • 29.
    Slide 29 SPURS Tasks //---Conditionallycalling SPU Code if (usingSPU==false) //---------------------------generic version { for(int i=0; i<OBJECT_COUNT; i++) { testObject[i].update(); } } else //-------------------------------------------SPU version { int test=spurs.ThrowTaskRunComplete(taskUpdateObject_elf_start, (int)testObject, OBJECT_COUNT,0,0); //could do something useful here… int result = spurs.CatchTaskRunComplete(test); }
  • 30.
    Slide 30 • Gettingdata in and out of SPU is first problem • Get this working before worrying about actual processing – Brute force often works just fine • DMA entire object array into SPE memory • Run update method • DMA entire object array back to main memory • List DMA allows you to get fancy – Gather and Scatter operation • If the data set is really big or if you need every last drop of performance – Streaming model • Overlap DMA transfers with processing • Double buffering SPURS Task
  • 31.
    Slide 31 SPURS Task:Getting Data in and out int cellSpuMain(.....) { int sourceAddr = ....; //source address int count = ....; //number of data elements int dataSize = count * sizeof(gfxObject); //amount of memory to get gfxObject *buf = (gfxObject*)memalign(128,dataSize); //alloc mem DmaGet((void*)buf,sourceAddr,dataSize); DmaWait(....); //---data is loaded at this point so do something interesting DmaPut((void*)buf,sourceAddr,dataSize); DmaWait(....); return(0); }
  • 32.
    Slide 32 • Keepcode the same as PPU Version – Conditional compilation based on processor type – Might have to split code into a separate file SPURS Task: Executing the Code //--- SPU code to call the update method //--- data is loaded at this point so do something interesting //--- step through array of objects and update gfxObject *tp; for(int i=0; i<count; i++) { tp = &buf[i]; tp->update(); //same method as on PPE compiled for SPU }
  • 33.
    Slide 33 • 5xSpeed up in this instance – Brute force solution – Could add more SPUS – Problem is embarrassingly parallel SPURS Task: Results
  • 34.
    Slide 34 • Simplepipeline – No stalls like LHS • No Operating System, very few system calls – Performance is high and deterministic • Memory is very fast like L1 cache – Memory transfers a asynchronous so can be hidden SPUs are faster than the PPU
  • 35.
    Slide 35 • PPUStalled while waiting for SPU to Process • SPU Data get and put were synchronous – Not using hardware to its fullest extent – Easy to get more if we need it SPURS Task: Time Waster PPU PPU GET PUTExec TIME SPU
  • 36.
    Slide 36 SPU Migration •Look for large blocks of data with simple processing and minimal synchronization requirements – Vertex transform – Animation – Audio – Image processing • Entire subsystems that can be moved over
  • 37.
    Slide 37 • GraphicsAPI • Multistream – Audio • Bullet – Physics • PlayStation®Edge – Geometry/Compression/Animation/Post FX • FIOS – Optimised disc access Our Libraries that run on SPUs
  • 38.
    Slide 38 • Animation •AI Threat prediction • AI Line of fire • AI Obstacle avoidance • Collision detection • Physics • Particle simulation • Particle rendering • Scene graph Killzone®2 SPU Usage • Display list building • IBL Light probes • Graphics post-processing • Dynamic music system • Skinning • MP3 Decompression • Edge Zlib • etc. 44 Job types
  • 39.
    Slide 39 • Accelerateboth PPU and RSX™ • Allow to shorten the overall time SPU can do more than CPU tasks
  • 40.
    Slide 40 • Visibilityand Geometry Culling • Vertex and lighting processing • Post processing Offloading the GPU
  • 41.
    Slide 41 • ‘Edge’SPU library to offload vertex work from RSX™ – Trade SPU time for GPU performance • Animation, skinning, culling and transformation – Pre culled, so all RSXTM work generates pixels Geometry processing
  • 42.
    Slide 42 • Enginesupports up to 256 vertex lights • Light accumulation – RGBE vertex colour – Runs on SPU • Zero impact on RSX™ Wipeout® weapon lighting
  • 43.
    Slide 43 • Geometricoccluders – Tested using halfplanes – Fast checks on SPU • Low res software occlusion – Coarse Zbuffer on SPU – 256x114 float Z – Low-poly conservative occluders Occlusion Culling
  • 44.
    Slide 44 • SPU/PPU –Object culling – Offscreen or very small objects – Occlusion Culling – Occluded objects – Edge Geometry – Vertex culling • RSX – Early Z-cull – Coarse pixel culling • Before fragment shader • Remember to do a z pre-pass – Depth test – Pixel culling • After fragment shader Culling
  • 45.
    Slide 45 • SPUgeometry code for landscape – SPUs can generate or remove graphic primitives (procedural, sub-division surfaces…) – Allow to send only triangles that contribute to the final scene Continuous LOD
  • 46.
    Slide 46 PhyreEngine™ • Freegame engine including – Modular run time – Samples, docs and whitepapers – Full source and art work • Optimized for multi-core especially PS3™ • Extractable and reusable components
  • 47.
    Slide 47 Procedural Foliage •Final SPU program generates RSX™ geometry with movement • LOD calculated and transition to billboards Original models are from the Xfrog Plant Library and used with permission of Greenworks Organic Software.
  • 48.
    Slide 48 Offloading theScene Traversal
  • 49.
    Slide 49 • MotionBlur, Bloom, Depth of field processed on SPU – Quarter-res image – Merged into final scene Killzone® post processing
  • 50.
    Slide 50 • SPUsassist RSX™ 1. RSX™ prepares low-res image buffers in XDR 2. RSX™ triggers interrupt to start SPUs 3. SPUs perform image operations 4. RSX™ already starts on next frame 5. Result of SPUs processed by RSX™ early in next frame Post Processing
  • 51.
    Slide 51 A FewSamples SSAO Depth of Field
  • 52.
    Slide 52 A FewSamples Exponential shadow map Motion Blur
  • 53.
    Slide 53 Post-Effect Impl.Difference RSXTM SPUs Bilinear filtering is nearly free Bilinear filtering is expensive Nearest filtering is nearly free SPU only supports truncate mode Sequential texture accesses can benefit from texture cache Sequential texture accesses need to be DMA transferred into LS Random accesses are handled nicely at a cost of trashing the texture cache Random accesses means handling multiple small DMA transfers
  • 54.
    Slide 54 • HDresolution is still expensive – run post-processing at lower resolution! Performance Effect Single SPU at 720p Five SPUs at 720p Depth of Field ~22.76 ms ~4.74 ms ROP ~14.36 ms ~2.99 ms Motion Blur (Intrinsics) ~27.5 ms ~5.73 ms Motion Blur (hand tuned) ~17.6 ms ~3.67 ms
  • 55.
    Slide 55 • EveryPS3™ comes with an HDD • Can access both HDD and Blu-ray simultaneously – Helps when continuously streaming data: levels, actors, sounds, music, textures… – 2GB of System Cache • Allows prefetching data in advance Data Streaming
  • 56.
    Slide 56 • Bruteforce method – High quality images – 128 FP16 frames • Depth of field • Motion Blur – Use sub-pixel jitter to give 128x super-sampled AA Photo-Mode
  • 57.
  • 58.
    Slide 58 • Thinkof the system and your game as a whole – Rearrange your rendering if necessary – Rendering is not just a GPU problem • Unlock the potential of multi-core programming – PS3™ = PPU + SPUs + RSX™ working together in parallel – Minimize contention Summary
  • 59.
  • 60.
  • 61.
    Slide 61 • Thanksto all the studios that let me talk about their games ANY QUESTIONS? www.scee.com ps3.scedev.net research.scee.net tpr.scee.net www.worldwidestudios.net/xdev Finally