PlayStation: Cutting Edge Techniques

Neil Brown – Senior Engineer
Developer Services
Sony Computer Entertainment Europe
Research & Development Division
PlayStation ® : Cutting Edge Techniques

• PlayStation® Development Resources
• PSP Case study
– Motorstorm® : Arctic Edge
• PS3 Case Studies
– Maintaining Framerate
– SPU assisted graphics
– Other tips
What I Will Be Covering

SCEE R&D
Developer Services
• Great Marlborough Street in
Central London
• Cover ‘PAL’ Region
Developers

• Support – SCE DevNet
• Field Engineering
• Technical and Strategic Consultancy
• TRC Consultancy
• Technical Training
What We Cover

PS3 DevKit
Latest additions to Dev Tools
PS3 Test Kit

Existing PSP DevKit
Latest additions to Dev Tools
PSPgo
Commander Arm

• PS3 DevKit: €1,700
• PS3 TestKit: €900
• PSP DevKit: €1,200
• PSP Commander: €300 euros
Dev Tools Prices

• An illustration of some techniques used in
real games
– Making use of the PSP® and PS3™
Hardware
– How to combine the Cell Broadband Engine™
with RSX™
Case Studies

• Migration from a PS3™ title
• Platform specific game engine, 30Hz
• Render Target are oversized to 512x304 32-bit
– Soften jaggies
Motorstorm® : Arctic Edge

• Use morph target for better realism
• 2 LODs
• Real-time lit, with specular and modulated
by track lighting
• Each skin is divided up into sub-skins that
are affected by a maximum of 8 different
bones
• Each skin’s skeleton has 16 bones
Vehicles and Character

• Streamed with 3 blocks (15kpolys) in
memory
• Lighting baked into vertices
• Maximum of 8.5MB
• Specular snow rendering use 2-pass:
– untextured, specularly-lit only
– textured layer with alpha "holes" which
modulated the addition of the lower layer
• Whole elements to be clipped
Landscape

• Custom physics code to mimic PS3™ Motorstorm
• Vehicles are represented by convex hulls.
• The Landscape is represented by compressed
vertices of an arbitrary mesh in a grid based scene.
• Ragdoll physics is a verlet system.
Physics

• Pre-built display lists for most objects
• Compressed data for all geometry
• Z-fighting was heavily reduced by increasing the far
clipping plane massively beyond the actual draw distance
required
Graphics Optimisation

Graphics Optimisation
Near Clip Plane
Far Clip Plane
Actual Draw
Distance
Z-Buffer
Most accurate section
Near Clip Plane
Far Clip Plane
Z-Buffer
Most accurate section

• Different widths between 1280 and 1920 for a 1080p 60Hz game
– Or 1024 to 1280 for a 720p game
• Feedback loop adjusts render width based on prior frame durations
– Resize is hidden in post processing full screen passes
– Horizontal hardware scaler helps for 1080p modes
Dynamic resolution

• Framerate can vary over time
– Particle systems can eat GPU time
– Views affect render time
• Cull objects
– Small screen footprint
– Used on Motorstorm 2
Maintaining frame rate

• Dynamic geometry LOD
– Average per mesh polygon size computed offline
– Viewport polygon size used to pick lowest LOD
– For multiplayer, LODs automatically adjust
• (smaller viewports)
• Shader LOD
– Reduce shader instruction count in distance
• Lose parallax & normal – keep specular map
Reducing GPU load

• Many games are fragment shader bound
• Rendering Z only ‘primes’ the RSX™ Z-cull unit
– Very fast, 16 pixels/clock rather than 8
– Render entire scene,
– Or ‘large’ meshes only
– Easily save 10% GPU
Z pre-pass

• Processor are becoming increasingly parallel
• Frame time is bound by the max of any of these component
Game Loop

• If any part run too slow, offload tasks on to
a helper CPU
Helper CPU

• Problem: Update an array of objects and submit the visible
ones for rendering
• Object.update() method was bottleneck on PPE
– Determined using SN Tuner
– This function generates the world space matrices for each
object
– Embarrassingly parallel
• No data dependencies between objects
• If we had 1 processor for each object we could do that 
Function off load example using a SPURS Task

SPURS Tasks
//---Conditionally calling SPU Code
if (usingSPU==false) //---------------------------generic version
{
for(int i=0; i<OBJECT_COUNT; i++)
{
testObject[i].update();
}
}
else //-------------------------------------------SPU version
{
int test=spurs.ThrowTaskRunComplete(taskUpdateObject_elf_start,
(int)testObject,
OBJECT_COUNT,0,0);
//could do something useful here…
int result = spurs.CatchTaskRunComplete(test);
}

• Getting data in and out of SPU is first problem
• Get this working before worrying about actual processing
– Brute force often works just fine
• DMA entire object array into SPE memory
• Run update method
• DMA entire object array back to main memory
• List DMA allows you to get fancy
– Gather and Scatter operation
• If the data set is really big or if you need every last drop of performance
– Streaming model
• Overlap DMA transfers with processing
• Double buffering
SPURS Task

SPURS Task: Getting Data in and out
int cellSpuMain(.....)
{
int sourceAddr = ....; //source address
int count = ....; //number of data elements
int dataSize = count * sizeof(gfxObject); //amount of memory to get
gfxObject *buf = (gfxObject*)memalign(128,dataSize); //alloc mem
DmaGet((void*)buf,sourceAddr,dataSize);
DmaWait(....);
//---data is loaded at this point so do something interesting
DmaPut((void*)buf,sourceAddr,dataSize);
DmaWait(....);
return(0);
}

• Keep code the same as PPU Version
– Conditional compilation based on processor type
– Might have to split code into a separate file
SPURS Task: Executing the Code
//--- SPU code to call the update method
//--- data is loaded at this point so do something interesting
//--- step through array of objects and update
gfxObject *tp;
for(int i=0; i<count; i++)
{
tp = &buf[i];
tp->update(); //same method as on PPE compiled for SPU
}

• 5x Speed up in this instance
– Brute force solution
– Could add more SPUS
– Problem is embarrassingly parallel
SPURS Task: Results

• Simple pipeline
– No stalls like LHS
• No Operating System, very few system calls
– Performance is high and deterministic
• Memory is very fast like L1 cache
– Memory transfers a asynchronous so can be hidden
SPUs are faster than the PPU

• PPU Stalled while waiting for SPU to Process
• SPU Data get and put were synchronous
– Not using hardware to its fullest extent
– Easy to get more if we need it
SPURS Task: Time Waster
PPU PPU
GET PUTExec
TIME
SPU

SPU Migration
• Look for large blocks of data with simple processing and
minimal synchronization requirements
– Vertex transform
– Animation
– Audio
– Image processing
• Entire subsystems that can be moved over

• Graphics API
• Multistream – Audio
• Bullet – Physics
• PlayStation®Edge
– Geometry/Compression/Animation/Post FX
• FIOS – Optimised disc access
Our Libraries that run on SPUs

• Animation
• AI Threat prediction
• AI Line of fire
• AI Obstacle avoidance
• Collision detection
• Physics
• Particle simulation
• Particle rendering
• Scene graph
Killzone®2 SPU Usage
• Display list building
• IBL Light probes
• Graphics post-processing
• Dynamic music system
• Skinning
• MP3 Decompression
• Edge Zlib
• etc.
44 Job types

• Accelerate both PPU and RSX™
• Allow to shorten the overall time
SPU can do more than CPU tasks

• Visibility and
Geometry Culling
• Vertex and lighting
processing
• Post processing
Offloading the GPU

• ‘Edge’ SPU library to offload vertex work from RSX™
– Trade SPU time for GPU performance
• Animation, skinning, culling and transformation
– Pre culled, so all RSXTM work generates pixels
Geometry processing

• Engine supports up to 256 vertex lights
• Light accumulation
– RGBE vertex colour
– Runs on SPU
• Zero impact on RSX™
Wipeout® weapon lighting

• Geometric occluders
– Tested using halfplanes
– Fast checks on SPU
• Low res software occlusion
– Coarse Zbuffer on SPU
– 256x114 float Z
– Low-poly conservative occluders
Occlusion Culling

• SPU/PPU
– Object culling – Offscreen or very small objects
– Occlusion Culling – Occluded objects
– Edge Geometry – Vertex culling
• RSX
– Early Z-cull – Coarse pixel culling
• Before fragment shader
• Remember to do a z pre-pass
– Depth test – Pixel culling
• After fragment shader
Culling

• SPU geometry code for landscape
– SPUs can generate or remove graphic primitives (procedural, sub-division
surfaces…)
– Allow to send only triangles that contribute to the final scene
Continuous LOD

PhyreEngine™
• Free game engine including
– Modular run time
– Samples, docs and whitepapers
– Full source and art work
• Optimized for multi-core especially PS3™
• Extractable and reusable components

Procedural Foliage
• Final SPU program generates RSX™ geometry with movement
• LOD calculated and transition to billboards
Original models are from the Xfrog Plant Library and used with permission of Greenworks Organic Software.

Offloading the Scene Traversal

• Motion Blur, Bloom, Depth of field processed on SPU
– Quarter-res image
– Merged into final scene
Killzone® post processing

• SPUs assist RSX™
1. RSX™ prepares low-res image buffers in XDR
2. RSX™ triggers interrupt to start SPUs
3. SPUs perform image operations
4. RSX™ already starts on next frame
5. Result of SPUs processed by RSX™ early in next frame
Post Processing

A Few Samples
SSAO Depth of Field

A Few Samples
Exponential
shadow map
Motion Blur

Post-Effect Impl. Difference
RSXTM SPUs
Bilinear filtering is nearly free Bilinear filtering is expensive
Nearest filtering is nearly free SPU only supports truncate mode
Sequential texture accesses can
benefit from texture cache
Sequential texture accesses need to
be DMA transferred into LS
Random accesses are handled nicely
at a cost of trashing the texture cache
Random accesses means handling
multiple small DMA transfers

• HD resolution is still expensive
– run post-processing at lower resolution!
Performance
Effect Single SPU at 720p Five SPUs at 720p
Depth of Field ~22.76 ms ~4.74 ms
ROP ~14.36 ms ~2.99 ms
Motion Blur (Intrinsics) ~27.5 ms ~5.73 ms
Motion Blur (hand tuned) ~17.6 ms ~3.67 ms

• Every PS3™ comes with an HDD
• Can access both HDD and Blu-ray simultaneously
– Helps when continuously streaming data: levels,
actors, sounds, music, textures…
– 2GB of System Cache
• Allows prefetching data in advance
Data Streaming

• Brute force method
– High quality images
– 128 FP16 frames
• Depth of field
• Motion Blur
– Use sub-pixel jitter to give 128x super-sampled AA
Photo-Mode

• Think of the system and your game as a whole
– Rearrange your rendering if necessary
– Rendering is not just a GPU problem
• Unlock the potential of multi-core programming
– PS3™ = PPU + SPUs + RSX™ working together in parallel
– Minimize contention
Summary

https://www.tpr.scee.net/

http://www.worldwidestudios.net/xdev
games_development@scee.net

• Thanks to all the studios that let me talk about their games
ANY QUESTIONS?
www.scee.com
ps3.scedev.net
research.scee.net
tpr.scee.net
www.worldwidestudios.net/xdev
Finally

PlayStation: Cutting Edge Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PlayStation: Cutting Edge Techniques

Similar to PlayStation: Cutting Edge Techniques (20)

More from Slide_N

More from Slide_N (20)

Recently uploaded

Recently uploaded (20)

PlayStation: Cutting Edge Techniques