SlideShare a Scribd company logo
1 of 153
Download to read offline
RSX™ Best Practices
Mark Cerny, Cerny Games
David Simpson, Naughty Dog
Jon Olick, Naughty Dog
RSX™ Best Practices
 About libgcm
 Using the SPUs with the RSX™
 Brief overview of GCM Replay
December 7th, 2004
Sony Computer Entertainment and NVidia
announce joint development of RSX™
Vertex & Index
Buffers
Textures
Fragment Programs
PLAYSTATION®2 Command List
Command
List
Vertex Table
Index Table
PLAYSTATION®2 Command
List Construction
PLAYSTATION®3 Command Buffer
Render Targets
Textures
Vertex & Fragment
Programs
Vertex & Index
Buffers
Command
Buffer
State in OpenGL
glStencilOp(GL_KEEP,GL_KEEP,GL_DECR_WRAP_EXT);
glActiveStencilFaceEXT(GL_BACK);
glActiveStencilFaceEXT(GL_FRONT);
glStencilOp(GL_KEEP,GL_KEEP,GL_INCR_WRAP_EXT);
glStencilOp(GL_KEEP,GL_KEEP,GL_DECR_WRAP_EXT);
More State in OpenGL
glClearDepth(depth);
glClearStencil(s);
glClear(GL_DEPTH | GL_STENCIL)
glClearDepth(depth);
glClearStencil(s);
glClear(GL_DEPTH | GL_STENCIL)
Goals with libgcm
Goals with libgcm
 Support multiple buckets
Goals with libgcm
 Support multiple buckets
 Remove state
Goals with libgcm
 Support multiple buckets
 Remove state
 glClearDepth and glClearStencil become a
single function
libgcm Context Structure
libgcm Context Structure
Start
End
libgcm Context Structure
Write Location
Start
End
libgcm Context Structure
Write Location
Start
End
Callback generated when out of space
Jump
Jump
Jump
Multiple Buffers
Jump
Jump
Single Large Buffer
Jump
Heap
Virtual Buckets
Write Location, Bucket 2
Write Location, Bucket 1
Heap
Virtual Buckets
Write Location, Bucket 1
Write Location, Bucket 2
Heap
Virtual Buckets
New allocation
for bucket 1,
connected by
jump
Write Location, Bucket 2
Write Location, Bucket 1
Heap
Virtual Buckets
New allocation
for bucket 2,
connected by
jump
Write Location, Bucket 2
Write Location, Bucket 1
Context Types
 Space Checking
 No Space Checking
Set Alpha Blend
 Space Checking 5.0x
 No Space Checking 1.1x
Setup Texture for Shader
 Space Checking 1.8x
 No Space Checking 1.8x
Space Checking
Set up Alpha Blend
1.1x
Set up Texture for Shader
1.05x
Patch Static Command Buffer
Disable Draw
Set Shader Constants
Set Shader Constants
Disable Draw
Set Shader Constants
Disable Draw
Concatenate Static Command
Buffers
Set Shader Constants
Set Shader Constants
Set Shader Constants
Set Shader Constants
Set Shader Constants
Using the RSX™ with the
SPUs
SPU 0
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
RSX™
Using the RSX™ with the
SPUs
 SPUs can be used to supercharge vertex
processing on the RSX™
 SPUs can perform triangle and mesh
operations that cannot be performed on
the RSX™
Geometry Processing Pipeline
 Runs on SPUs
 Modular
 Need only to use some pieces
 Outputs index and vertex data
which is directly read by the
RSX™
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Geometry Processing Pipeline
 SPU processes one vertex set
at a time
 One or more vertex sets are
generated per mesh in an offline
tools processing step called
partitioning
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
The RSX™ can process vertices in
large chunks
But a 50,000 vertex object won’t fit in
an SPU!
SPU
The object needs to be partitioned into
smaller pieces, called vertex sets
SPU
Culling is much better with smaller
pieces too
SPU
On the PLAYSTATION®2 we used
vertex sets with about 64 vertices
 Many repeated
vertices
 Data increase of about
30%
An SPU can handle vertex sets with
between 500-1500 vertices
 Still some repeated
vertices
 Data increase of about
7-10%
 Vertex data is
ultimately smaller due
to increased
compression
Decompression
 SPUs free to use any type of
compressed data – not
restricted to 8, 16, 32 bit or the
like
 Vertex data is decompressed
into full floats, as they are
easiest for the SPU to use
 Triangle index data can also
be decompressed at this time
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
N-Bit Stream Decompression
 Each vertex attribute is an N-bit stream
 Each component of that attribute has its own
number of bits, integer offset, scale, and bias
 Each component is decompressed with
the following equation:
 out = float(in + intOffset) * scale + bias
 Scale and bias need to be constant across an
entire object to prevent cracks
 The number of bits and integer offset need not be
Integer Offset Example
 The total range of this object is
21 units
 Requires 5 bits
 The range of the first list is
only 15 units
 Requires only 4 bits
 The range of the second list is
8 units
 When intOffset is set to 13,
entries in the second list require
only 3 bits
Object
List 1
1
5
12
0
8
3
14
9
List 2
17
14
20
16
13
19
18
Blend Shapes
 Not really possible to do on a
GPU
 SPU can blend any number of
shapes and any number of
vertex attributes
 Large data savings
 Store only deltas
 Use highly compressed data
formats, like N-bit compression
 Only store data for changing
vertices
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Blend Shape Use in
MLB 07: The Show
Skinning
 Huge offload of vertex
processing from the RSX™
 No need to set large number
of vertex program constants
with matrix data
 SPU can handle vertices with
an arbitrary number of
influences
 Number of influences can vary
across the mesh, resulting in
data and computational savings
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
LOD Systems
 Reduces processing time as
an object moves into the
distance
 Many LOD systems are not a
good match for a GPU
 Often operate upon an entire
mesh at a time
 Good match for the SPUs
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Discrete Progressive Mesh
 Smoothly reduces the triangle
count as a model moves into
the distance
 With discrete progressive
mesh, the LOD calculation is
done once for an entire object
 An entire object is processed
at once by the tools to avoid
cracks between vertex sets
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
At an LOD there are two types of
vertices
LOD = 0.0
Parent Vertex
Child Vertex
As the LOD level decreases, the
children “slide” towards their parents
LOD = 0.2
Parent Vertex
Child Vertex
The children continue to move
towards their parents
LOD = 0.7
Parent Vertex
Child Vertex
At the next integral LOD, all child
vertices disappear as do the triangles
LOD = 1.0
Parent Vertex
Child Vertex
Vertices are arranged from lowest
LOD to highest LOD
 At LOD 0 all vertices are
needed.
 At LOD 1, child vertices from
LOD 0 are no longer needed
 At LOD 2, child vertices from
LOD 1 are also removed
 This saves bandwidth to the
SPUsLOD 0
LOD 1
LOD 2
Vertex Table
Every LOD has its own index table of
triangles and parent table
 The parent table contains an index to the parent
for every child vertex
Indexes for
LOD 0
Indexes for
LOD 1
Indexes for
LOD 2
Parent Table for
LOD 2
Parent Table for
LOD 1
Parent Table for
LOD 0
LODs are arranged in LOD groups to
avoid small vertex sets
LOD 0
LOD 1
LOD 2
LOD 0
LOD 1
LOD 2
LOD Group 0
LOD 0
LOD 1
LOD 2
LOD 3
LOD 4
LOD 5
LOD Group 1
Continuous Progressive Mesh
 Like discrete progressive
mesh, child vertices move
smoothly toward their parents
 However, the LOD is
calculated for each vertex
instead of just once for the
object
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Vertex set about to undergo
continuous progressive mesh
Parent Vertex
Child Vertex, LOD 1
Child Vertex, LOD 0
A single vertex set can straddle
several LOD ranges
LOD = 1.0
Parent Vertex
Child Vertex, LOD 1
Child Vertex, LOD 0
LOD = 0.0
Vertices move depending on their
distance
Parent Vertex
Child Vertex, LOD 1
Child Vertex, LOD 0
LOD = 1.0
LOD = 0.0
Stencil Shadows
Edge
Table
T14 T15 V1
T6 T23 V0
T10 T18 V2
V0T10T2
V0T8T0
V1T6T5
V2T26T5
V0T13T4
V2T17T15
V1T25T11
T14 T15 V1
Also need adjoining triangles and
vertices from neighboring vertex sets
Edge
Table
T14 T15 V1
T6 T23 V0
T10 T18 V2
V0T10T2
V0T8T0
V1T6T5
V2T26T5
V0T13T4
V2T17T15
V1T25T11
Find the profile edges and generate a
new vertex table of extruded edges
Extruded Edges
T14 T15 V1
T6 T23 V0
T10 T18 V2
V0T10T2
V0T8T0
V1T6T5
V2T26T5
V0T13T4
V2T17T15
V1T25T11
Edge
Table
Output the new vertex data and draw
commands to a shadow context
Command
Context for
Light 0
Shadow
Draw 13
Shadow
Draw 19
Shadow
Draw 18
SPU
May as well do multiple lights at the
same time
Command
Context for
Light 0
Shadow
Draw 13
Shadow
Draw 19
Shadow
Draw 18
SPU
Shadow
Draw 11
Shadow
Draw 19
Shadow
Draw 12
Shadow
Draw 13
Shadow
Draw 19
Shadow
Draw 18
Shadow
Draw 15
Shadow
Draw 19
Shadow
Draw 16
Command
Context for
Light 1
Command
Context for
Light 2
Command
Context for
Light 3
Normal and Tangent
Calculation
 Typically having normals and tangents
included in the vertex data is a good
thing
 However, some operations can move the
positions so much that the included
normals and tangents are no longer
correct
 Solution: Recalculate the normals and
tangents on the SPU!
Blend shapes can move the positions
quite a lot!
Recalculate the normals!
 Like stencil shadows,
calculating normals and
tangents requires information
about adjoining triangles and
vertices from neighboring
vertex sets
 Only worth the cost in limited
situations
Triangle Culling
 Many triangles in a scene will
ultimately have no renderable
area
 Culling these triangles on the
SPU removes the burden of
the RSX™ processing
triangles which do not
contribute to the final image
 This leaves the RSX™ with more
time to process relevant triangles
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
What types of triangles can be culled?
Off screen triangles can be culled
Back facing triangles can be culled,
but be sure to use error bars
Degenerate triangles can be culled
Some triangles are very small
Some triangles are so small that they
do not cover a pixel center
These triangles can be culled
Multisampling adds some
complications…
But these triangles can still be culled
The SPU starts with the input triangle
index table
Tri 0
Tri 1
Tri 2
Tri 3
Tri 4
Tri 5
Tri 6
Tri 7
Tri 8
Tri 9
Tri 10
Tri 11
Tri 12
Tri 13
Tri 14
Original
Index Table
The culling algorithm determines
which triangles are to be kept
Tri 0
Tri 1
Tri 2
Tri 3
Tri 4
Tri 5
Tri 6
Tri 7
Tri 8
Tri 9
Tri 10
Tri 11
Tri 12
Tri 13
Tri 14
Original
Index Table
And a new index table is created from
these triangles
Tri 0
Tri 1
Tri 2
Tri 3
Tri 4
Tri 5
Tri 6
Tri 7
Tri 8
Tri 9
Tri 10
Tri 11
Tri 12
Tri 13
Tri 14
Original
Index Table
Tri 1
Tri 4
Tri 6
Tri 7
Tri 11
Tri 14
Culled
Index Table
Culling can
remove 60-70%
of all triangles!
Vertex Culling
 After triangles are culled,
some vertices are no longer
used in any triangle
 These vertices can be
removed from the vertex table
 This is done by first building a
vertex renaming table which
contains the new vertex index for
each vertex
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Start with an empty vertex renaming
table
4
5
6
7
8
9
10
11
12
13
14
15
3
2
1
0
Index Table Vertex Table
Renaming
Table
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0 2 5
1 4 10
1 8 13
2 5 7
5 7 8
7 8 9
8 10 13
10 13 14
Add a new index for each used vertex
in the index table
4
5
6
7
8
9
10
11
12
13
14
15
3
2
1
00 2 5
1 4 10
1 8 13
2 5 7
5 7 8
7 8 9
8 10 13
10 13 14
Index Table Vertex Table
Renaming
Table
0
3
1
-1
4
2
-1
8
6
9
5
-1
-1
7
10
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
3
1
4
2
8
6
9
5
7
10
Using the renaming table, build a new
vertex table with only used vertices
4
5
6
7
8
9
10
11
12
13
14
15
3
2
1
0
Vertex Table
Renaming
Table
0
3
1
-1
4
2
-1
8
6
9
5
-1
-1
7
10
-1
4
10
8
13
7
9
14
1
5
2
0
New Vertex TableIndex Table
0 2 5
1 4 10
1 8 13
2 5 7
5 7 8
7 8 9
8 10 13
10 13 14
Finally, replace the old indices in the
index table with the new indices
New Vertex TableIndex Table
Renaming
Table
0
3
1
-1
4
2
-1
8
6
9
5
-1
-1
7
10
-1
4
10
8
13
7
9
14
1
5
2
0
4
5
6
7
8
9
10
11
12
13
14
15
3
2
1
0
Vertex Table
0 2 5
1 4 10
1 8 13
2 5 7
5 7 8
7 8 9
8 10 13
10 13 14
0 1 2
3 4 5
3 6 7
1 2 8
5 7 8
8 6 9
6 5 7
5 7 10
Only minor performance gains on the
RSX™, if any
 Removes about 30% of the vertex data
 Better use of the pre-transform cache,
but not much else
Vertex Stream Combining
 A vertex stream is an
interleaved set of vertex
attributes, which is used
natively by the RSX™
 Fewer vertex streams results
in better performance
 Easy to combine streams
while compressing vertex
attributes into RSX™ formats
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Vertex attributes can be input into the
SPUs in multiple streams
Input Vertex
Stream 0
Input Vertex
Stream 1
The vertex streams are
decompressed into tables of floats
Input Vertex
Stream 0
Input Vertex
Stream 1
Float Tables
When done, the vertex attributes are
compressed into one output stream
Input Vertex
Stream 0
Output
Vertex Stream
Float Tables
Input Vertex
Stream 1
Output Buffering Schemes
 Vertex and index data
constructed by the SPUs is
output from SPU local store
 Holes in the command buffer
are patched with pointers to
the vertex and index data as
well as the draw commands
Decompression
Blend Shapes
Skinning
Progressive Mesh
Culling
Compression
Output
Double Buffer
 Each buffer stores vertex and
index data for an entire frame
 SPUs atomically access a
mutex which is used to
allocate memory from a buffer
 Easy synchronization with the
RSX™ once a frame
 Uses lots of memory
Vertex
and
Index
Data
for
Frame 0
Vertex
and
Index
Data
for
Frame 1
It is possible to completely fill a buffer
 Can use a callback to allocate
new memory (which you may
not have)
 Don’t draw geometry that
doesn’t fit (difficult to pick
which geometry not to draw)
SPUData
Vertex
and
Index
Data
Double buffer requires an extra frame
of lag in the rendering pipeline
Build Jobs
on PPU
Process Jobs
on SPU
Render on
RSX™
Scan Out
Build Jobs
on PPU
Process Jobs
on SPU
Render on
RSX™
Scan Out
Build Jobs
on PPU
Process Jobs
on SPU
Render on
RSX™
Scan Out
Single Buffer
 Uses only half the memory!
 Still possible to completely fill
the buffer
Vertex
and
Index
Data
for
Single
Frame
Single buffer uses a shorter rendering
pipeline
 Vertex and index data is created just-in-time for
the RSX™
 Draw commands are inserted into the command
buffer while the RSX™ is rendering
 Requires tight SPU↔RSX™ synchronization
Build Jobs on
PPU
SPU Processing/
RSX™ Rendering
Scan Out
Build Jobs on
PPU
SPU Processing/
RSX™ Rendering
Scan Out
Build Jobs on
PPU
SPU Processing/
RSX™ Rendering
Scan Out
Command Buffer Holes
 SPU processing requires
some setup by the PPU
 Some job data is required for
each vertex set
 Static portions of the command
buffer are built on the PPU
 Static vertex attribute pointers
 Index table pointer and draw
commands when not performing
triangle culling on the SPU
 “Holes” are left for the dynamic
portion built by the SPU
Static 18
Other
State
State
Hole 18
Hole 19
Hole 20
Hole 21
Hole 22
Static 19
Static 20
Static 21
Static 22
Command
Buffer
Filling the Holes
 Dynamic portions are built on
the SPU
 Vertex attribute pointers for any
attributes output by the SPU
 Draw commands when
performing triangle culling
 Commands necessary for ring
buffer synchronization
 For brevity, we will not show
the PPU generated commands
going forward
Static 18
Other
State
State
Static 19
Static 20
Static 21
Static 22
Command
Buffer
Hole 18
Hole 19
Hole 20
Hole 21
Hole 22
Draw 18
Draw 19
Draw 20
Draw 21
Draw 22
RSX™ Put Pointer
 RSX™ has an internal “Put
Pointer”
 Commands in the command
buffer are executed up to the
Put Pointer
 Commands after the Put
Pointer are not executed
Command
Buffer
Put
Pointer
Put
Pointer
SPU↔RSX™ Synchronization
Using the Put Pointer
Command
Buffer
Output
Buffer
SPU
Draw 17 Data 17
Put
Pointer
SPU outputs vertex and index data
Command
Buffer
Output
Buffer
SPU
Draw 17 Data 17
Data 18
Put
Pointer
SPU outputs vertex attribute pointers
and draw commands
Command
Buffer
Output
Buffer
SPU
Draw 17 Data 17
Data 18Draw 18
Put
Pointer
SPU updates the Put Pointer
Command
Buffer
Output
Buffer
SPU
Draw 17 Data 17
Data 18Draw 18
Put
Pointer
But there are six SPUs, so
who updates the Put Pointer?
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 21Draw 18
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 19
Draw 20
Draw 21
Draw 22
Draw 23
Data 19
Data 18
Data 20
Data 23
Data 22
Other
Put
Pointer
SPUs are asynchronous, so they can
finish in any order!
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 21
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 21
Other
Put
Pointer
So, the SPUs must synchronize with
each other!
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 19
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 21
Draw 19
Data 21
Other
The Put Pointer is updated only when
ALL previous jobs are done…
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 18
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 21
Draw 18 Data 21
Draw 19 Data 19
Put
Pointer
Other
But can only be moved to the end of
this job’s draw commands
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 20
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 21
Draw 20
Data 21
Draw 19
Data 18
Put
Pointer
Draw 18
Data 19
Other
Remember to guarantee progress!
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 23
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 21
Draw 23
Data 21
Draw 19
Data 18
Put
Pointer
Draw 18
Data 19
Draw 20
Data 20
Other
The last job finished moves the Put
Pointer to the end of the buffer
Command
Buffer
Output
Buffer
SPU 0Draw 17 Data 17
Data 22
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Draw 21
Draw 22
Data 21
Draw 19
Data 18
Put
Pointer
Draw 18
Data 19
Draw 20
Data 20
Other
Draw 23
Data 23
SPU↔RSX™ Synchronization
Using Local Stalls
Command
Buffer
Draw 17
Put
Pointer
Local Stall
Other
Local Stall
Local Stall
Local Stall
Local Stall
Local Stall
 Easier and faster than Put
Pointer synchronization
 Place local stalls in the
command buffer where
necessary
 RSX™ will stop processing at
a local stall until it is
overwritten by new commands
 SPUs will generally stay
ahead of the RSX™, so stalls
rarely occur
Local Stall
SPU will overwrite local stalls when it
outputs a set of new commands
 No SPU↔SPU
synchronization required!
 Please see the document
regarding this technique on
the PS3 Developer’s Support
website for crucial details
SPU
Command
Buffer
Draw 17
Put
Pointer
Local Stall
Other
Local Stall
Local Stall
Local Stall
Local Stall
Local Stall
Local Stall
New
Commands
Ring Buffers
 Small memory footprint
 Will not run out of memory
 Can stall the SPUs if buffers
become full
 Objects need to be processed
in the same order the RSX™
renders them to prevent
deadlock
Vertex
and
Index
Data
End of
Free Area
Start of
Free Area
Each SPU has its own buffer to
prevent deadlock
Data 23
Data 17
Data 10
Data 22
Data 11
Data 16
Data 19
Data 6
Data 14 Data 20
Data 7
Data 15
Data 21
Data 12
Data 9
Data 18
Data 8
Data 13
SPU 0
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
Buffer 0 Buffer 1
Buffer 3
Buffer 2
Buffer 4 Buffer 5
RSX™ writes a semaphore once a
chunk of data has been consumed
 A command to write a semaphore needs to be
added to the command buffer after all
commands that use the data
 The value of the semaphore to be written is the new
end of free area pointer
Data 19
Data 6
Data 14
Start of
Free Area
Draw 6
Draw 5
Draw 7
Semaphore
Semaphore
Semaphore
Current
RSX™
Execution
New End
of Free
Area
SPU 0
SPU 1
SPU 2
SPU 3
SPU 4
SPU 5
RSX™
Future Work
 Cg compiler for SPUs
 Complicated vertex programs could be run
on the SPUs instead of the RSX™
 Can’t have too many outputs otherwise the
RSX™ will take longer loading them than it
would have to run the program
Future Work
 Shadow map generation on SPUs
 Large load removed from RSX™
 Very doable
 Much more complicated if you have alpha cutouts
in your textures
GCM Replay
GCM Replay
 New tool for use with the RSX™
 Analysis
 Debugging
 Profiling
 Will be released soon to all licensed
developers as part of PLAYSTATION®Edge
 Main tool runs on the PC
 Integration into your title is simple and easy
Capture a Snapshot
Render Targets
Textures
Vertex & Fragment
Programs
Vertex & Index Buffers
Snapshot Contents
Command
Buffer
Performance Analysis
Find Bottlenecks
“What Ifs”
Q: What If I Had Efficient
Triangle Culling?
What would the performance gain be?
 GCM Replay can remove all draw calls to
triangles which never write a pixel
 Once this is done, GCM Replay can reprofile the
snapshot and compute the speed increase
A: Cull Triangles Using an
SPU!
 Triangle culling techniques shown earlier
can dramatically increase performance
Q: What If My Setting of Fragment
Program Constants Was Done
Externally to the Command Buffer?
 One copy of each fragment
program is kept in memory
 Individual fragment program
constants are patched by placing
draw commands in the command
buffer in the appropriate locations
Fragment
Program
Patched
Constants
Conventional Patch Technique
A: Patch Using the PPU or
SPU!
 Multiple copies of fragment
programs can be patched with
the appropriate constants
either on the PPU or an SPU
 Removes 100% of the RSX™
load for patching fragment
programs
 If done as part of SPU processing
of a vertex set, synchronization
will be already be taken care of
Patched
Copy 3
Patched
Copy 2
Patched
Copy 4
Patched
Copy 1
Q: What If I Had More Optimal
Indexed Triangle Lists?
A: Optimize for the Four Vertex
Mini-cache!
 GCM Replay contains an
optimizer for indexed triangle list
ordering
 Corresponding offline indexed
triangle list optimizer available as
part of PLAYSTATION®Edge
Vertex 1
Vertex 0
Vertex 2
Vertex 3
Four Vertex
Mini-cache
Strip Example
3 new vertices
3
Strip Example
3
1
1 new vertex
Strip Example
3
1
1
1 new vertex
Strip Example
3
1
1
1
1 new vertex…
2 vertices + 1 per triangle in total
Free Form Example
3 new vertices
3
Free Form Example
1 new vertex
3
1
Free Form Example
1 new vertex
3
1
1
Free Form Example
1 new vertex
3
1
1
1
Free Form Example
1 new vertex
3
1
1
1
1
Free Form Example
0 new vertices!
3
1
1
1
1
0
Free Form Example
1 new vertex
3
1
1
1
1
0
1
Free Form Example
1 new vertex
3
1
1
1
1
0
1
1
Free Form Example
1 new vertex
3
1
1
1
1
0
1
1
1
Free Form Example
0 new vertices…
3
1
1
1
1
0
1
1
1
0
2 vertices + 3 per 4 triangles in total
Q: What If I Had Perfect
Object Z-Culling?
 Some objects will not contribute to the
final scene because they are entirely
blocked by other objects
 GCM Replay will soon be able to show
the performance difference if good object
Z-culling was performed
A: Object Z-Culling on SPUs
 Write an SPU rasterizer
 Render the depth values of a low polygon
version of the environment
 Rasterize and check bounding volumes
of objects
Warhawk
RSX™ Best Practices

More Related Content

Similar to RSX™ Best Practices

Logic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicLogic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional Logic
Gouda Mando
 
Project_PPT_Presentation.ppt
Project_PPT_Presentation.pptProject_PPT_Presentation.ppt
Project_PPT_Presentation.ppt
BIPLABNAYAK10
 

Similar to RSX™ Best Practices (20)

GRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our livesGRAPHICAL STRUCTURES in our lives
GRAPHICAL STRUCTURES in our lives
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Summing up 33_s
Summing up 33_sSumming up 33_s
Summing up 33_s
 
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
MariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAsMariaDB 10.11 key features overview for DBAs
MariaDB 10.11 key features overview for DBAs
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 
Foss4 g topology_july_16_2015
Foss4 g topology_july_16_2015Foss4 g topology_july_16_2015
Foss4 g topology_july_16_2015
 
Logic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional LogicLogic Design - Chapter 5: Part1 Combinattional Logic
Logic Design - Chapter 5: Part1 Combinattional Logic
 
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeksBeginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
Beginning direct3d gameprogramming09_shaderprogramming_20160505_jintaeks
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
Opengl basics
Opengl basicsOpengl basics
Opengl basics
 
Intro to OpenGL ES 2.0
Intro to OpenGL ES 2.0Intro to OpenGL ES 2.0
Intro to OpenGL ES 2.0
 
Advanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big DataAdvanced Non-Relational Schemas For Big Data
Advanced Non-Relational Schemas For Big Data
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
lectureAll-OpenGL-complete-Guide-Tutorial.pdf
lectureAll-OpenGL-complete-Guide-Tutorial.pdflectureAll-OpenGL-complete-Guide-Tutorial.pdf
lectureAll-OpenGL-complete-Guide-Tutorial.pdf
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
Project_PPT_Presentation.ppt
Project_PPT_Presentation.pptProject_PPT_Presentation.ppt
Project_PPT_Presentation.ppt
 
L02.pdf
L02.pdfL02.pdf
L02.pdf
 
FractalTreeIndex
FractalTreeIndexFractalTreeIndex
FractalTreeIndex
 
Using spider for sharding in production
Using spider for sharding in productionUsing spider for sharding in production
Using spider for sharding in production
 

More from Slide_N

Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
Slide_N
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
Slide_N
 

More from Slide_N (20)

SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
SpursEngine A High-performance Stream Processor Derived from Cell/B.E. for Me...
 
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdfParallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
Parallel Vector Tile-Optimized Library (PVTOL) Architecture-v3.pdf
 
Experiences with PlayStation VR - Sony Interactive Entertainment
Experiences with PlayStation VR  - Sony Interactive EntertainmentExperiences with PlayStation VR  - Sony Interactive Entertainment
Experiences with PlayStation VR - Sony Interactive Entertainment
 
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3SPU-based Deferred Shading for Battlefield 3 on Playstation 3
SPU-based Deferred Shading for Battlefield 3 on Playstation 3
 
Filtering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-AliasingFiltering Approaches for Real-Time Anti-Aliasing
Filtering Approaches for Real-Time Anti-Aliasing
 
Chip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdfChip Multiprocessing and the Cell Broadband Engine.pdf
Chip Multiprocessing and the Cell Broadband Engine.pdf
 
Cell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology GroupCell Today and Tomorrow - IBM Systems and Technology Group
Cell Today and Tomorrow - IBM Systems and Technology Group
 
New Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - KutaragiNew Millennium for Computer Entertainment - Kutaragi
New Millennium for Computer Entertainment - Kutaragi
 
Sony Transformation 60 - Kutaragi
Sony Transformation 60 - KutaragiSony Transformation 60 - Kutaragi
Sony Transformation 60 - Kutaragi
 
Sony Transformation 60
Sony Transformation 60 Sony Transformation 60
Sony Transformation 60
 
Moving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living RoomMoving Innovative Game Technology from the Lab to the Living Room
Moving Innovative Game Technology from the Lab to the Living Room
 
The Technology behind PlayStation 2
The Technology behind PlayStation 2The Technology behind PlayStation 2
The Technology behind PlayStation 2
 
Cell Technology for Graphics and Visualization
Cell Technology for Graphics and VisualizationCell Technology for Graphics and Visualization
Cell Technology for Graphics and Visualization
 
Industry Trends in Microprocessor Design
Industry Trends in Microprocessor DesignIndustry Trends in Microprocessor Design
Industry Trends in Microprocessor Design
 
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with OcelotTranslating GPU Binaries to Tiered SIMD Architectures with Ocelot
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
 
Cellular Neural Networks: Theory
Cellular Neural Networks: TheoryCellular Neural Networks: Theory
Cellular Neural Networks: Theory
 
Network Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTMNetwork Processing on an SPE Core in Cell Broadband EngineTM
Network Processing on an SPE Core in Cell Broadband EngineTM
 
Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3Deferred Pixel Shading on the PLAYSTATION®3
Deferred Pixel Shading on the PLAYSTATION®3
 
Developing Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of DestructionDeveloping Technology for Ratchet and Clank Future: Tools of Destruction
Developing Technology for Ratchet and Clank Future: Tools of Destruction
 
NVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM PowerNVIDIA Tesla Accelerated Computing Platform for IBM Power
NVIDIA Tesla Accelerated Computing Platform for IBM Power
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

RSX™ Best Practices

  • 1.
  • 2. RSX™ Best Practices Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog
  • 3. RSX™ Best Practices  About libgcm  Using the SPUs with the RSX™  Brief overview of GCM Replay
  • 4. December 7th, 2004 Sony Computer Entertainment and NVidia announce joint development of RSX™
  • 8. PLAYSTATION®3 Command Buffer Render Targets Textures Vertex & Fragment Programs Vertex & Index Buffers Command Buffer
  • 10. More State in OpenGL glClearDepth(depth); glClearStencil(s); glClear(GL_DEPTH | GL_STENCIL) glClearDepth(depth); glClearStencil(s); glClear(GL_DEPTH | GL_STENCIL)
  • 12. Goals with libgcm  Support multiple buckets
  • 13. Goals with libgcm  Support multiple buckets  Remove state
  • 14. Goals with libgcm  Support multiple buckets  Remove state  glClearDepth and glClearStencil become a single function
  • 17. libgcm Context Structure Write Location Start End
  • 18. libgcm Context Structure Write Location Start End Callback generated when out of space
  • 21. Heap Virtual Buckets Write Location, Bucket 2 Write Location, Bucket 1
  • 22. Heap Virtual Buckets Write Location, Bucket 1 Write Location, Bucket 2
  • 23. Heap Virtual Buckets New allocation for bucket 1, connected by jump Write Location, Bucket 2 Write Location, Bucket 1
  • 24. Heap Virtual Buckets New allocation for bucket 2, connected by jump Write Location, Bucket 2 Write Location, Bucket 1
  • 25. Context Types  Space Checking  No Space Checking
  • 26. Set Alpha Blend  Space Checking 5.0x  No Space Checking 1.1x
  • 27. Setup Texture for Shader  Space Checking 1.8x  No Space Checking 1.8x
  • 28. Space Checking Set up Alpha Blend 1.1x Set up Texture for Shader 1.05x
  • 29. Patch Static Command Buffer Disable Draw Set Shader Constants Set Shader Constants Disable Draw Set Shader Constants Disable Draw
  • 30. Concatenate Static Command Buffers Set Shader Constants Set Shader Constants Set Shader Constants Set Shader Constants Set Shader Constants
  • 31. Using the RSX™ with the SPUs SPU 0 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 RSX™
  • 32. Using the RSX™ with the SPUs  SPUs can be used to supercharge vertex processing on the RSX™  SPUs can perform triangle and mesh operations that cannot be performed on the RSX™
  • 33. Geometry Processing Pipeline  Runs on SPUs  Modular  Need only to use some pieces  Outputs index and vertex data which is directly read by the RSX™ Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 34. Geometry Processing Pipeline  SPU processes one vertex set at a time  One or more vertex sets are generated per mesh in an offline tools processing step called partitioning Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 35. The RSX™ can process vertices in large chunks
  • 36. But a 50,000 vertex object won’t fit in an SPU! SPU
  • 37. The object needs to be partitioned into smaller pieces, called vertex sets SPU
  • 38. Culling is much better with smaller pieces too SPU
  • 39. On the PLAYSTATION®2 we used vertex sets with about 64 vertices  Many repeated vertices  Data increase of about 30%
  • 40. An SPU can handle vertex sets with between 500-1500 vertices  Still some repeated vertices  Data increase of about 7-10%  Vertex data is ultimately smaller due to increased compression
  • 41. Decompression  SPUs free to use any type of compressed data – not restricted to 8, 16, 32 bit or the like  Vertex data is decompressed into full floats, as they are easiest for the SPU to use  Triangle index data can also be decompressed at this time Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 42. N-Bit Stream Decompression  Each vertex attribute is an N-bit stream  Each component of that attribute has its own number of bits, integer offset, scale, and bias  Each component is decompressed with the following equation:  out = float(in + intOffset) * scale + bias  Scale and bias need to be constant across an entire object to prevent cracks  The number of bits and integer offset need not be
  • 43. Integer Offset Example  The total range of this object is 21 units  Requires 5 bits  The range of the first list is only 15 units  Requires only 4 bits  The range of the second list is 8 units  When intOffset is set to 13, entries in the second list require only 3 bits Object List 1 1 5 12 0 8 3 14 9 List 2 17 14 20 16 13 19 18
  • 44. Blend Shapes  Not really possible to do on a GPU  SPU can blend any number of shapes and any number of vertex attributes  Large data savings  Store only deltas  Use highly compressed data formats, like N-bit compression  Only store data for changing vertices Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 45. Blend Shape Use in MLB 07: The Show
  • 46. Skinning  Huge offload of vertex processing from the RSX™  No need to set large number of vertex program constants with matrix data  SPU can handle vertices with an arbitrary number of influences  Number of influences can vary across the mesh, resulting in data and computational savings Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 47. LOD Systems  Reduces processing time as an object moves into the distance  Many LOD systems are not a good match for a GPU  Often operate upon an entire mesh at a time  Good match for the SPUs Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 48. Discrete Progressive Mesh  Smoothly reduces the triangle count as a model moves into the distance  With discrete progressive mesh, the LOD calculation is done once for an entire object  An entire object is processed at once by the tools to avoid cracks between vertex sets Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 49. At an LOD there are two types of vertices LOD = 0.0 Parent Vertex Child Vertex
  • 50. As the LOD level decreases, the children “slide” towards their parents LOD = 0.2 Parent Vertex Child Vertex
  • 51. The children continue to move towards their parents LOD = 0.7 Parent Vertex Child Vertex
  • 52. At the next integral LOD, all child vertices disappear as do the triangles LOD = 1.0 Parent Vertex Child Vertex
  • 53. Vertices are arranged from lowest LOD to highest LOD  At LOD 0 all vertices are needed.  At LOD 1, child vertices from LOD 0 are no longer needed  At LOD 2, child vertices from LOD 1 are also removed  This saves bandwidth to the SPUsLOD 0 LOD 1 LOD 2 Vertex Table
  • 54. Every LOD has its own index table of triangles and parent table  The parent table contains an index to the parent for every child vertex Indexes for LOD 0 Indexes for LOD 1 Indexes for LOD 2 Parent Table for LOD 2 Parent Table for LOD 1 Parent Table for LOD 0
  • 55. LODs are arranged in LOD groups to avoid small vertex sets LOD 0 LOD 1 LOD 2 LOD 0 LOD 1 LOD 2 LOD Group 0 LOD 0 LOD 1 LOD 2 LOD 3 LOD 4 LOD 5 LOD Group 1
  • 56.
  • 57. Continuous Progressive Mesh  Like discrete progressive mesh, child vertices move smoothly toward their parents  However, the LOD is calculated for each vertex instead of just once for the object Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 58. Vertex set about to undergo continuous progressive mesh Parent Vertex Child Vertex, LOD 1 Child Vertex, LOD 0
  • 59. A single vertex set can straddle several LOD ranges LOD = 1.0 Parent Vertex Child Vertex, LOD 1 Child Vertex, LOD 0 LOD = 0.0
  • 60. Vertices move depending on their distance Parent Vertex Child Vertex, LOD 1 Child Vertex, LOD 0 LOD = 1.0 LOD = 0.0
  • 61.
  • 62. Stencil Shadows Edge Table T14 T15 V1 T6 T23 V0 T10 T18 V2 V0T10T2 V0T8T0 V1T6T5 V2T26T5 V0T13T4 V2T17T15 V1T25T11 T14 T15 V1
  • 63. Also need adjoining triangles and vertices from neighboring vertex sets Edge Table T14 T15 V1 T6 T23 V0 T10 T18 V2 V0T10T2 V0T8T0 V1T6T5 V2T26T5 V0T13T4 V2T17T15 V1T25T11
  • 64. Find the profile edges and generate a new vertex table of extruded edges Extruded Edges T14 T15 V1 T6 T23 V0 T10 T18 V2 V0T10T2 V0T8T0 V1T6T5 V2T26T5 V0T13T4 V2T17T15 V1T25T11 Edge Table
  • 65. Output the new vertex data and draw commands to a shadow context Command Context for Light 0 Shadow Draw 13 Shadow Draw 19 Shadow Draw 18 SPU
  • 66. May as well do multiple lights at the same time Command Context for Light 0 Shadow Draw 13 Shadow Draw 19 Shadow Draw 18 SPU Shadow Draw 11 Shadow Draw 19 Shadow Draw 12 Shadow Draw 13 Shadow Draw 19 Shadow Draw 18 Shadow Draw 15 Shadow Draw 19 Shadow Draw 16 Command Context for Light 1 Command Context for Light 2 Command Context for Light 3
  • 67. Normal and Tangent Calculation  Typically having normals and tangents included in the vertex data is a good thing  However, some operations can move the positions so much that the included normals and tangents are no longer correct  Solution: Recalculate the normals and tangents on the SPU!
  • 68. Blend shapes can move the positions quite a lot!
  • 69. Recalculate the normals!  Like stencil shadows, calculating normals and tangents requires information about adjoining triangles and vertices from neighboring vertex sets  Only worth the cost in limited situations
  • 70. Triangle Culling  Many triangles in a scene will ultimately have no renderable area  Culling these triangles on the SPU removes the burden of the RSX™ processing triangles which do not contribute to the final image  This leaves the RSX™ with more time to process relevant triangles Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 71. What types of triangles can be culled?
  • 72. Off screen triangles can be culled
  • 73. Back facing triangles can be culled, but be sure to use error bars
  • 75. Some triangles are very small
  • 76. Some triangles are so small that they do not cover a pixel center
  • 77. These triangles can be culled
  • 79. But these triangles can still be culled
  • 80. The SPU starts with the input triangle index table Tri 0 Tri 1 Tri 2 Tri 3 Tri 4 Tri 5 Tri 6 Tri 7 Tri 8 Tri 9 Tri 10 Tri 11 Tri 12 Tri 13 Tri 14 Original Index Table
  • 81. The culling algorithm determines which triangles are to be kept Tri 0 Tri 1 Tri 2 Tri 3 Tri 4 Tri 5 Tri 6 Tri 7 Tri 8 Tri 9 Tri 10 Tri 11 Tri 12 Tri 13 Tri 14 Original Index Table
  • 82. And a new index table is created from these triangles Tri 0 Tri 1 Tri 2 Tri 3 Tri 4 Tri 5 Tri 6 Tri 7 Tri 8 Tri 9 Tri 10 Tri 11 Tri 12 Tri 13 Tri 14 Original Index Table Tri 1 Tri 4 Tri 6 Tri 7 Tri 11 Tri 14 Culled Index Table Culling can remove 60-70% of all triangles!
  • 83. Vertex Culling  After triangles are culled, some vertices are no longer used in any triangle  These vertices can be removed from the vertex table  This is done by first building a vertex renaming table which contains the new vertex index for each vertex Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 84. Start with an empty vertex renaming table 4 5 6 7 8 9 10 11 12 13 14 15 3 2 1 0 Index Table Vertex Table Renaming Table -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 2 5 1 4 10 1 8 13 2 5 7 5 7 8 7 8 9 8 10 13 10 13 14
  • 85. Add a new index for each used vertex in the index table 4 5 6 7 8 9 10 11 12 13 14 15 3 2 1 00 2 5 1 4 10 1 8 13 2 5 7 5 7 8 7 8 9 8 10 13 10 13 14 Index Table Vertex Table Renaming Table 0 3 1 -1 4 2 -1 8 6 9 5 -1 -1 7 10 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 3 1 4 2 8 6 9 5 7 10
  • 86. Using the renaming table, build a new vertex table with only used vertices 4 5 6 7 8 9 10 11 12 13 14 15 3 2 1 0 Vertex Table Renaming Table 0 3 1 -1 4 2 -1 8 6 9 5 -1 -1 7 10 -1 4 10 8 13 7 9 14 1 5 2 0 New Vertex TableIndex Table 0 2 5 1 4 10 1 8 13 2 5 7 5 7 8 7 8 9 8 10 13 10 13 14
  • 87. Finally, replace the old indices in the index table with the new indices New Vertex TableIndex Table Renaming Table 0 3 1 -1 4 2 -1 8 6 9 5 -1 -1 7 10 -1 4 10 8 13 7 9 14 1 5 2 0 4 5 6 7 8 9 10 11 12 13 14 15 3 2 1 0 Vertex Table 0 2 5 1 4 10 1 8 13 2 5 7 5 7 8 7 8 9 8 10 13 10 13 14 0 1 2 3 4 5 3 6 7 1 2 8 5 7 8 8 6 9 6 5 7 5 7 10
  • 88. Only minor performance gains on the RSX™, if any  Removes about 30% of the vertex data  Better use of the pre-transform cache, but not much else
  • 89. Vertex Stream Combining  A vertex stream is an interleaved set of vertex attributes, which is used natively by the RSX™  Fewer vertex streams results in better performance  Easy to combine streams while compressing vertex attributes into RSX™ formats Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 90. Vertex attributes can be input into the SPUs in multiple streams Input Vertex Stream 0 Input Vertex Stream 1
  • 91. The vertex streams are decompressed into tables of floats Input Vertex Stream 0 Input Vertex Stream 1 Float Tables
  • 92. When done, the vertex attributes are compressed into one output stream Input Vertex Stream 0 Output Vertex Stream Float Tables Input Vertex Stream 1
  • 93. Output Buffering Schemes  Vertex and index data constructed by the SPUs is output from SPU local store  Holes in the command buffer are patched with pointers to the vertex and index data as well as the draw commands Decompression Blend Shapes Skinning Progressive Mesh Culling Compression Output
  • 94. Double Buffer  Each buffer stores vertex and index data for an entire frame  SPUs atomically access a mutex which is used to allocate memory from a buffer  Easy synchronization with the RSX™ once a frame  Uses lots of memory Vertex and Index Data for Frame 0 Vertex and Index Data for Frame 1
  • 95. It is possible to completely fill a buffer  Can use a callback to allocate new memory (which you may not have)  Don’t draw geometry that doesn’t fit (difficult to pick which geometry not to draw) SPUData Vertex and Index Data
  • 96. Double buffer requires an extra frame of lag in the rendering pipeline Build Jobs on PPU Process Jobs on SPU Render on RSX™ Scan Out Build Jobs on PPU Process Jobs on SPU Render on RSX™ Scan Out Build Jobs on PPU Process Jobs on SPU Render on RSX™ Scan Out
  • 97. Single Buffer  Uses only half the memory!  Still possible to completely fill the buffer Vertex and Index Data for Single Frame
  • 98. Single buffer uses a shorter rendering pipeline  Vertex and index data is created just-in-time for the RSX™  Draw commands are inserted into the command buffer while the RSX™ is rendering  Requires tight SPU↔RSX™ synchronization Build Jobs on PPU SPU Processing/ RSX™ Rendering Scan Out Build Jobs on PPU SPU Processing/ RSX™ Rendering Scan Out Build Jobs on PPU SPU Processing/ RSX™ Rendering Scan Out
  • 99. Command Buffer Holes  SPU processing requires some setup by the PPU  Some job data is required for each vertex set  Static portions of the command buffer are built on the PPU  Static vertex attribute pointers  Index table pointer and draw commands when not performing triangle culling on the SPU  “Holes” are left for the dynamic portion built by the SPU Static 18 Other State State Hole 18 Hole 19 Hole 20 Hole 21 Hole 22 Static 19 Static 20 Static 21 Static 22 Command Buffer
  • 100. Filling the Holes  Dynamic portions are built on the SPU  Vertex attribute pointers for any attributes output by the SPU  Draw commands when performing triangle culling  Commands necessary for ring buffer synchronization  For brevity, we will not show the PPU generated commands going forward Static 18 Other State State Static 19 Static 20 Static 21 Static 22 Command Buffer Hole 18 Hole 19 Hole 20 Hole 21 Hole 22 Draw 18 Draw 19 Draw 20 Draw 21 Draw 22
  • 101. RSX™ Put Pointer  RSX™ has an internal “Put Pointer”  Commands in the command buffer are executed up to the Put Pointer  Commands after the Put Pointer are not executed Command Buffer Put Pointer
  • 102. Put Pointer SPU↔RSX™ Synchronization Using the Put Pointer Command Buffer Output Buffer SPU Draw 17 Data 17
  • 103. Put Pointer SPU outputs vertex and index data Command Buffer Output Buffer SPU Draw 17 Data 17 Data 18
  • 104. Put Pointer SPU outputs vertex attribute pointers and draw commands Command Buffer Output Buffer SPU Draw 17 Data 17 Data 18Draw 18
  • 105. Put Pointer SPU updates the Put Pointer Command Buffer Output Buffer SPU Draw 17 Data 17 Data 18Draw 18
  • 106. Put Pointer But there are six SPUs, so who updates the Put Pointer? Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 21Draw 18 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 19 Draw 20 Draw 21 Draw 22 Draw 23 Data 19 Data 18 Data 20 Data 23 Data 22 Other
  • 107. Put Pointer SPUs are asynchronous, so they can finish in any order! Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 21 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 21 Other
  • 108. Put Pointer So, the SPUs must synchronize with each other! Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 19 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 21 Draw 19 Data 21 Other
  • 109. The Put Pointer is updated only when ALL previous jobs are done… Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 18 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 21 Draw 18 Data 21 Draw 19 Data 19 Put Pointer Other
  • 110. But can only be moved to the end of this job’s draw commands Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 20 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 21 Draw 20 Data 21 Draw 19 Data 18 Put Pointer Draw 18 Data 19 Other
  • 111. Remember to guarantee progress! Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 23 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 21 Draw 23 Data 21 Draw 19 Data 18 Put Pointer Draw 18 Data 19 Draw 20 Data 20 Other
  • 112. The last job finished moves the Put Pointer to the end of the buffer Command Buffer Output Buffer SPU 0Draw 17 Data 17 Data 22 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Draw 21 Draw 22 Data 21 Draw 19 Data 18 Put Pointer Draw 18 Data 19 Draw 20 Data 20 Other Draw 23 Data 23
  • 113. SPU↔RSX™ Synchronization Using Local Stalls Command Buffer Draw 17 Put Pointer Local Stall Other Local Stall Local Stall Local Stall Local Stall Local Stall  Easier and faster than Put Pointer synchronization  Place local stalls in the command buffer where necessary  RSX™ will stop processing at a local stall until it is overwritten by new commands  SPUs will generally stay ahead of the RSX™, so stalls rarely occur Local Stall
  • 114. SPU will overwrite local stalls when it outputs a set of new commands  No SPU↔SPU synchronization required!  Please see the document regarding this technique on the PS3 Developer’s Support website for crucial details SPU Command Buffer Draw 17 Put Pointer Local Stall Other Local Stall Local Stall Local Stall Local Stall Local Stall Local Stall New Commands
  • 115. Ring Buffers  Small memory footprint  Will not run out of memory  Can stall the SPUs if buffers become full  Objects need to be processed in the same order the RSX™ renders them to prevent deadlock Vertex and Index Data End of Free Area Start of Free Area
  • 116. Each SPU has its own buffer to prevent deadlock Data 23 Data 17 Data 10 Data 22 Data 11 Data 16 Data 19 Data 6 Data 14 Data 20 Data 7 Data 15 Data 21 Data 12 Data 9 Data 18 Data 8 Data 13 SPU 0 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 Buffer 0 Buffer 1 Buffer 3 Buffer 2 Buffer 4 Buffer 5
  • 117. RSX™ writes a semaphore once a chunk of data has been consumed  A command to write a semaphore needs to be added to the command buffer after all commands that use the data  The value of the semaphore to be written is the new end of free area pointer Data 19 Data 6 Data 14 Start of Free Area Draw 6 Draw 5 Draw 7 Semaphore Semaphore Semaphore Current RSX™ Execution New End of Free Area
  • 118. SPU 0 SPU 1 SPU 2 SPU 3 SPU 4 SPU 5 RSX™
  • 119. Future Work  Cg compiler for SPUs  Complicated vertex programs could be run on the SPUs instead of the RSX™  Can’t have too many outputs otherwise the RSX™ will take longer loading them than it would have to run the program
  • 120. Future Work  Shadow map generation on SPUs  Large load removed from RSX™  Very doable  Much more complicated if you have alpha cutouts in your textures
  • 122. GCM Replay  New tool for use with the RSX™  Analysis  Debugging  Profiling  Will be released soon to all licensed developers as part of PLAYSTATION®Edge  Main tool runs on the PC  Integration into your title is simple and easy
  • 123.
  • 125. Render Targets Textures Vertex & Fragment Programs Vertex & Index Buffers Snapshot Contents Command Buffer
  • 126.
  • 130. Q: What If I Had Efficient Triangle Culling? What would the performance gain be?  GCM Replay can remove all draw calls to triangles which never write a pixel  Once this is done, GCM Replay can reprofile the snapshot and compute the speed increase
  • 131. A: Cull Triangles Using an SPU!  Triangle culling techniques shown earlier can dramatically increase performance
  • 132. Q: What If My Setting of Fragment Program Constants Was Done Externally to the Command Buffer?  One copy of each fragment program is kept in memory  Individual fragment program constants are patched by placing draw commands in the command buffer in the appropriate locations Fragment Program Patched Constants Conventional Patch Technique
  • 133. A: Patch Using the PPU or SPU!  Multiple copies of fragment programs can be patched with the appropriate constants either on the PPU or an SPU  Removes 100% of the RSX™ load for patching fragment programs  If done as part of SPU processing of a vertex set, synchronization will be already be taken care of Patched Copy 3 Patched Copy 2 Patched Copy 4 Patched Copy 1
  • 134. Q: What If I Had More Optimal Indexed Triangle Lists?
  • 135. A: Optimize for the Four Vertex Mini-cache!  GCM Replay contains an optimizer for indexed triangle list ordering  Corresponding offline indexed triangle list optimizer available as part of PLAYSTATION®Edge Vertex 1 Vertex 0 Vertex 2 Vertex 3 Four Vertex Mini-cache
  • 136. Strip Example 3 new vertices 3
  • 139. Strip Example 3 1 1 1 1 new vertex… 2 vertices + 1 per triangle in total
  • 140. Free Form Example 3 new vertices 3
  • 141. Free Form Example 1 new vertex 3 1
  • 142. Free Form Example 1 new vertex 3 1 1
  • 143. Free Form Example 1 new vertex 3 1 1 1
  • 144. Free Form Example 1 new vertex 3 1 1 1 1
  • 145. Free Form Example 0 new vertices! 3 1 1 1 1 0
  • 146. Free Form Example 1 new vertex 3 1 1 1 1 0 1
  • 147. Free Form Example 1 new vertex 3 1 1 1 1 0 1 1
  • 148. Free Form Example 1 new vertex 3 1 1 1 1 0 1 1 1
  • 149. Free Form Example 0 new vertices… 3 1 1 1 1 0 1 1 1 0 2 vertices + 3 per 4 triangles in total
  • 150. Q: What If I Had Perfect Object Z-Culling?  Some objects will not contribute to the final scene because they are entirely blocked by other objects  GCM Replay will soon be able to show the performance difference if good object Z-culling was performed
  • 151. A: Object Z-Culling on SPUs  Write an SPU rasterizer  Render the depth values of a low polygon version of the environment  Rasterize and check bounding volumes of objects