-
1.
Vertex Shader Tricks
New Ways to Use the Vertex Shader to Improve
Performance
Bill Bilodeau
Developer Technology Engineer, AMD
-
2.
Topics Covered
● Overview of the DX11 front-end pipeline
● Common bottlenecks
● Advanced Vertex Shader Features
● Vertex Shader Techniques
● Samples and Results
-
3.
Graphics Hardware
DX11 Front-End Pipeline
● VS –vertex data
● HS – control points
● Tessellator
● DS – generated vertices
● GS – primitives
● Write to UAV at all stages
● Starting with DX11.1
Vector GPR’s
(256 2048-bit registers)
Vector ALU
(1 64-way single precision operation every 4 clocks)
Scalar ALU
(1 operation every 4 clocks)
Scalar GPR’s
(256 64-bit registers)
Vector/Scalar cross communication bus
Vector GPR’s
(256 2048-bit registers)
Vector ALU
(1 64-way single precision operation every 4 clocks)
Scalar ALU
(1 operation every 4 clocks)
Scalar GPR’s
(256 64-bit registers)
Vector/Scalar cross communication bus
Vector GPR’s
(256 2048-bit registers)
Vector ALU
(1 64-way single precision operation every 4 clocks)
Scalar ALU
(1 operation every 4 clocks)
Scalar GPR’s
(256 64-bit registers)
Vector/Scalar cross communication bus
.
.
.
Input Assembler
Hull Shader
Domain
Shader
Tessellator
Geometry
Shader
Stream
Out
CB,
SRV,
or
UAV
Vertex Shader
-
4.
Bottlenecks - VS
● VS Attributes
● Limit outputs to 4 attributes (AMD)
●This applies to all shader stages (except PS)
● VS Texture Fetches
● Too many texture fetches can add latency
●Especially dependent texture fetches
●Group fetches together for better performance
●Hide latency with ALU instructions
-
5.
Bottlenecks - VS
● Use the caches wisely
● Avoid large vertex formats
that waste pre-VS cache
space
● DrawIndexed() allows for
reuse of processed vertices
saved in the post-VS cache
●Vertices with the same index
only need to get processed once
Vertex Shader
Pre-VS Cache
(Hides Latency)
Input Assembler
Post-VS Cache
(Vertex Reuse)
-
6.
Bottlenecks - GS
● GS
● Can add or remove primitives
● Adding new primitives requires storing new
vertices
●Going off chip to store data can be a bandwidth issue
● Using the GS means another shader stage
●This means more competition for shader resources
●Better if you can do everything in the VS
-
7.
Advanced Vertex Shader Features
● SV_VertexID, SV_InstanceID
● UAV output (DX11.1)
● NULL vertex buffer
● VS can create its own vertex data
-
8.
SV_VertexID
● Can use the vertex id to decide what
vertex data to fetch
● Fetch from SRV, or procedurally create a
vertex
VSOut VertexShader(SV_VertexID id)
{
float3 vertex = g_VertexBuffer[id];
…
}
-
9.
UAV buffers
● Write to UAVs from a Vertex Shader
● New feature in DX11.1 (UAV at any stage)
● Can be used instead of stream-out for
writing vertex data
● Triangle output not limited to strips
●You can use whatever format you want
● Can output anything useful to a UAV
-
10.
NULL Vertex Buffer
● DX11/DX10 allows this
● Just set the number of vertices in Draw()
● VS will execute without a vertex buffer bound
● Can be used for instancing
● Call Draw() with the total number of vertices
● Bind mesh and instance data as SRVs
-
11.
Vertex Shader Techniques
● Full Screen Triangle
● Vertex Shader Instancing
● Merged Instancing
● Vertex Shader UAVs
-
12.
Full Screen Triangle
● For post-processing effects
● Triangle has better performance
than quad
● Fast and easy with VS
generated coordinates
● No IB or VB is necessary
● Something you should be
using for full screen effects
Clip Space Coordinates
(-1, -1, 0)
(-1, 3, 0)
(3, -1, 0)
-
13.
Full Screen Triangle: C++ code
// Null VB, IB
pd3dImmediateContext->IASetVertexBuffers( 0, 0, NULL, NULL, NULL );
pd3dImmediateContext->IASetIndexBuffer( NULL, (DXGI_FORMAT)0, 0 );
pd3dImmediateContext->IASetInputLayout( NULL );
// Set Shaders
pd3dImmediateContext->VSSetShader( g_pFullScreenVS, NULL, 0 );
pd3dImmediateContext->PSSetShader( … );
pd3dImmediateContext->PSSetShaderResources( … );
pd3dImmediateContext->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );
// Render 3 vertices for the triangle
pd3dImmediateContext->Draw(3, 0);
-
14.
Full Screen Triangle: HLSL Code
VSOutput VSFullScreenTest(uint id:SV_VERTEXID)
{
VSOutput output;
// generate clip space position
output.pos.x = (float)(id / 2) * 4.0 - 1.0;
output.pos.y = (float)(id % 2) * 4.0 - 1.0;
output.pos.z = 0.0;
output.pos.w = 1.0;
// texture coordinates
output.tex.x = (float)(id / 2) * 2.0;
output.tex.y = 1.0 - (float)(id % 2) * 2.0;
// color
output.color = float4(1, 1, 1, 1);
return output;
}
Clip Space Coordinates
(-1, -1, 0)
(-1, 3, 0)
(3, -1, 0)
-
15.
VS Instancing: Point Sprites
● Often done on GS, but can be faster on VS
● Create an SRV point buffer and bind to VS
● Call Draw or DrawIndexed to render the full
triangle list.
● Read the location from the point buffer and
expand to vertex location in quad
● Can be used for particles or Bokeh DOF sprites
● Don’t use DrawInstanced for a small mesh
-
16.
Point Sprites: C++ Code
pd3d->IASetIndexBuffer( g_pParticleIndexBuffer, DXGI_FORMAT_R32_UINT, 0 );
pd3d->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );
pd3dImmediateContext->DrawIndexed( g_particleCount * 6, 0, 0);
-
17.
Point Sprites: HLSL Code
VSInstancedParticleDrawOut VSIndexBuffer(uint id:SV_VERTEXID)
{
VSInstancedParticleDrawOut output;
uint particleIndex = id / 4;
uint vertexInQuad = id % 4;
// calculate the position of the vertex
float3 position;
position.x = (vertexInQuad % 2) ? 1.0 : -1.0;
position.y = (vertexInQuad & 2) ? -1.0 : 1.0;
position.z = 0.0;
position.xy *= PARTICLE_RADIUS;
position = mul( position, (float3x3)g_mInvView ) +
g_bufPosColor[particleIndex].pos.xyz;
output.pos = mul( float4(position,1.0), g_mWorldViewProj );
output.color = g_bufPosColor[particleIndex].color;
// texture coordinate
output.tex.x = (vertexInQuad % 2) ? 1.0 : 0.0;
output.tex.y = (vertexInQuad & 2) ? 1.0 : 0.0;
return output;
}
-
18.
Point Sprite Performance
Indexed, 500K SpritesNon-Indexed, 500K SpritesGS, 500K SpritesDrawInstanced, 500K SpritesIndexed, 1M SpritesNon-Indexed, 1M SpritesGS, 1M SpritesDrawInstanced, 1M Sprit
R9 290x (ms) 0.52 0.77 1.38 1.77 1.02 1.53 2.7 3.54
Titan (ms) 0.52 0.87 0.83 5.1 1.5 1.92 1.6 10.3
0
2
4
6
8
10
12
AMD Radeon R9 290x
Nvidia Titan
-
19.
Point Sprite Performance
● DrawIndexed() is the fastest method
● Draw() is slower but doesn’t need an IB
● Don’t use DrawInstanced() for creating
sprites on either AMD or NVidia hardware
● Not recommended for a small number of
vertices
-
20.
Merge Instancing
● Combine multiple meshes that can be
instanced many times
● Better than normal instancing which renders
only one mesh
● Instance nearby meshes for smaller bounding box
● Each mesh is a page in the vertex data
● Fixed vertex count for each mesh
●Meshes smaller than page size use degenerate triangles
-
21.
Merge Instancing
Mesh Vertex Data
Mesh Data 0
Mesh Data 1
Mesh Data 2
.
.
.
Mesh Instance Data
Instance 0
Mesh Index 2
Instance 1
Mesh Index 0
.
.
.
Degenerate
Triangle
Vertex 0
Vertex 1
Vertex 2
Vertex 3
.
.
.
0
0
0
Fixed Length Page
-
22.
Merged Instancing using VS
● Use the vertex ID to look up the mesh to
instance
● All meshes are the same size, so (id / SIZE)
can be used as an offset to the mesh
● Faster than using DrawInstanced()
-
23.
Merge Instancing Performance
0
5
10
15
20
25
30
DrawInstanced Soft Instancing
R9 290x
GTX 780
● Instancing performance test by
Cloud Imperium Games for Star
Citizen
● Renders 13.5M triangles (~40M
verts)
● DrawInstanced version calls
DrawInstanced() and uses instance
data in a vertex buffer
● Soft Instancing version uses
vertex instancing with Draw() calls
and fetches instance data from
SRV
AMD Radeon
R9 290X
Nvidia
GTX 780
ms
-
24.
Vertex Shader UAVs
● Random access Read/Write in a VS
● Can be used to store transformed vertex
data for use in multi-pass algorithms
● Can be used for passing constant
attributes between any shader stage (not
just from VS)
-
25.
Skinning to UAV
● Skin vertex data then output to UAV
● Instance the skinned UAV data multiple times
● Can also be used for non-instanced data
● Multiple passes can reuse the transformed
vertex data – Shadow map rendering
● Performance is about the same as
stream-out, but you can do more …
-
26.
Bounding Box to UAV
● Can calculate and store Bbox in the VS
● Use a UAV to store the min/max values (6)
● InterlockedMin/InterlockedMax determine min
and max of the bbox
●Need to use integer values with atomics
● Use the stored bbox in later passes
● GPU physics (collision)
● Tile based processing
-
27.
Bounding Box: HLSL Code
void UAVBBoxSkinVS(VSSkinnedIn input, uint id:SV_VERTEXID )
{
// skin the vertex
. . .
// output the max and min for the bounding box
int x = (int) (vSkinned.Pos.x * FLOAT_SCALE); // convert to integer
int y = (int) (vSkinned.Pos.y * FLOAT_SCALE);
int z = (int) (vSkinned.Pos.z * FLOAT_SCALE);
InterlockedMin(g_BBoxUAV[0], x);
InterlockedMin(g_BBoxUAV[1], y);
InterlockedMin(g_BBoxUAV[2], z);
InterlockedMax(g_BBoxUAV[3], x);
InterlockedMax(g_BBoxUAV[4], y);
InterlockedMax(g_BBoxUAV[5], z);
. . .
-
28.
Particle System UAV
● Single pass GPU-only particle system
● In the VS:
● Generate sprites for rendering
● Do Euler integration and update the particle
system state to a UAV
-
29.
Particle System: HLSL Code
uint particleIndex = id / 4;
uint vertexInQuad = id % 4;
// calculate the new position of the vertex
float3 oldPosition = g_bufPosColor[particleIndex].pos.xyz;
float3 oldVelocity = g_bufPosColor[particleIndex].velocity.xyz;
// Euler integration to find new position and velocity
float3 acceleration = normalize(oldVelocity) * ACCELLERATION;
float3 newVelocity = acceleration * g_deltaT + oldVelocity;
float3 newPosition = newVelocity * g_deltaT + oldPosition;
g_particleUAV[particleIndex].pos = float4(newPosition, 1.0);
g_particleUAV[particleIndex].velocity = float4(newVelocity, 0.0);
// Generate sprite vertices
. . .
-
30.
Conclusion
● Vertex shader “tricks” can be more
efficient than more commonly used methods
● Use SV_Vertex ID for smarter instancing
●Sprites
●Merge Instancing
● UAVs add lots of freedom to vertex shaders
●Bounding box calculation
●Single pass VS particle system
-
31.
Demos
● Particle System
● UAV Skinning
● Bbox
-
32.
Acknowledgements
● Merge Instancing
● Emil Person, “Graphics Gems for Games”
SIGGRAPH 2011
● Brendan Jackson, Cloud Imperium
● Thanks to
● Nick Thibieroz, AMD
● Raul Aguaviva (particle system UAV), AMD
● Alex Kharlamov, AMD
-
33.
Questions
● bill.bilodeau@amd.com
The value of SV_VertexID depends on the draw call. For non-indexed Draw, the vertex ID starts with 0 and increments by 1 for every vertex processed by the shader. For DrawIndexed(), the vertexID is the value of the index in the index buffer for that vertex.
For indexed Draw calls, create an index buffer which contains (index location + index number). That way you can calculate (vertexID/vertsPerMesh) to get the instance index, and (vertexID % vertsPerMesh) to get the index value which you can use to look up the vertex.
- If the mesh is being reused many times, then calculating the bounding box has little overhead.Bounding box can be used for collision detection
Could read and write from the UAV instead of binding an input SRV