Approaching zero driver overhead

Approaching Zero
Driver Overhead
Cass Everitt
NVIDIA
Tim Foley
Intel
Graham Sellers
AMD
John McDonald
NVIDIA
Cass Everitt
● NVIDIA
Assertion
● OpenGL already has paths with very low
driver overhead
● You just need to know
● What they are, and
● How to use them
But first, who are we?
● Graham Sellers @GrahamSellers
● AMD OpenGL driver manager, OpenGL SuperBible author
● Tim Foley @TangentVector
● Graphics researcher, GPU language/compiler nerd
● John McDonald @basisspace
● Graphics engineer, chip architect, game developer
● Cass Everitt @casseveritt
● GL zealot, chip architect, mobile enthusiast
Many kinds of bottlenecks
● Focus here is ―driver limited‖
● App could render more, and
● GPU could render more, but
● Driver is at its limit…
● Because of expensive API calls
Some causes of driver overhead
● The CPU cost of fulfilling the
API contract
● Validation
● Hazard avoidance
Costs that add up…
● Major Categories:
● synchronization, allocation,
validation, and compilation
● Buffer updates (synchronization, allocation)
● Mapping, in-band updates
● Binding objects (validation, compilation)
● FBOs, programs, textures, buffers
Remedy? – Efficient APIs!
● Buffer storage
● Texture arrays
● Multi-Draw Indirect
● Texture arrays, bindless,
sparse, indirect parameters
}Tim Foley
Graham Sellers}
Results
● apitest
● Framework for testing
different ―solutions‖
● Source on github
}John McDonald
Remember, these OpenGL APIs
● Exist TODAY – already on your PC
● Are at least multi-vendor (EXT), and
mostly core (GL 4.2+)
● Coexist with existing
OpenGL
Remember, these OpenGL APIs
● Exist TODAY – already on your PC
● Are at least multi-vendor (EXT), and mostly core
(GL 4.2+)
● Coexist with existing
OpenGL
Remember, these OpenGL APIs
● Exist TODAY – already on your PC
● Are at least multi-vendor (EXT), and mostly
core (GL 4.2+)
● Coexist with existing
OpenGL
On with the show…
next speaker
Tim Foley
● Intel
Challenge: More Stuff per Frame
● Varied
● Not 1000s of same instanced mesh
● Unique geometry, textures, etc.
● Dynamic
● Not just pretty skinned meshes
● Generate new geometry each frame
Want an Order of Magnitude
● Increase in unique objects per frame
● Can over-simplify as draws per frame, but
● Misses importance of variety
● Do we need a new API to achieve this?
● How far can we get with what we have today?
Three Techniques in This Talk
● Persistent-mapped buffers
● Faster streaming of dynamic geometry
● MultiDrawIndirect (MDI)
● Faster submission of many draw calls
● Packing 2D textures into arrays
● Texture changes no longer break batches
Naïve Draw Loop
foreach( object )
{
// bind framebuffer
// set depth, blending, etc. states
// bind shaders
// bind textures
// bind vertex/index buffers
WriteUniformData( object );
glDrawElements(
GL_TRIANGLES,
object->indexCount,
GL_UNSIGNED_SHORT,
0 );
}
Typical Draw Loop
// sort or bucket visible objects
foreach( render target ) // framebuffer
foreach( pass ) // depth, blending, etc. states
foreach( material ) // shaders
foreach( material instance ) // textures
foreach( vertex format ) // vertex buffers
foreach( object )
{
WriteUniformData( object );
glDrawElementsBaseVertex(
GL_TRIANGLES,
object->indexCount,
GL_UNSIGNED_SHORT,
object->indexDataOffset,
object->baseVertex );
}
Two Ways to Improve Overhead
// sort or bucket visible objects
foreach( render target ) // framebuffer
foreach( pass ) // depth, blending, etc. states
foreach( material ) // shaders
foreach( material instance ) // textures
foreach( vertex format ) // vertex buffers
foreach( object )
{
WriteUniformData( object );
glDrawElementsBaseVertex(
GL_TRIANGLES,
object->indexCount,
GL_UNSIGNED_SHORT,
object->indexDataOffset,
object->baseVertex );
}
submit each batch faster
fewer, bigger batches
Pack Multiple Objects per Buffer
// sort or bucket visible objects
foreach( render target ) // framebuffer
foreach( pass ) // depth, blending, etc. states
foreach( material ) // shaders
foreach( material instance ) // textures
foreach( vertex format ) // vertex buffers
foreach( object )
{
WriteUniformData( object );
glDrawElementsBaseVertex(
GL_TRIANGLES,
object->indexCount,
GL_UNSIGNED_SHORT,
object->indexDataOffset,
object->baseVertex );
}
pack multiple objects into the same
(dynamic or static) vertex/index buffer
take advantage of glDraw*() params to
index into buffer without changing
bindings
Dynamic Streaming of Geometry
● Typical dynamic vertex ring buffer
void* data = glMapBuffer(GL_ARRAY_BUFFER,
ringOffset,
dataSize,
GL_MAP_UNSYNCHRONIZED_BIT
| GL_MAP_WRITE_BIT );
WriteGeometry( data, ... );
glUnmapBuffer(GL_ARRAY_BUFFER);
ringOffset += dataSize;
// deal with wrap-around in ring, etc.
frequent mapping = overhead
no sync with GPU, but forces
sync in multi-threaded drivers
BufferStorage and Persistent Map
● Allocate buffer with glBufferStorage()
● Use flags to enable persistent mapping
glBufferStorage(GL_ARRAY_BUFFER, ringSize, NULL, flags);
GLbitfield flags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
keep mapped while drawing
writes automatically visible to GPU
Dynamic Streaming of Geometry
● Map once at creation time
● No more Map/Unmap in your draw loop
● But need to do synchronization yourself
data = glMapBufferRange(ARRAY_BUFFER, 0, ringSize, flags);
WriteGeometry( data, ... );
data += dataSize;
upcoming talks will cover
glFenceSync() and glClientWaitSync()
Performance
● BufferSubData vs Map(UNSYNCHRONIZED)
● Intel: avoid frequent BufferSubData()
● NV: Map(UNSYNCH) bad for threaded drivers
● Persistent mapping best where supported
● Overhead 2-20x better than next best option
That Inner Loop Again
foreach( object )
{
WriteUniformData( object, &uniformData );
glDrawElementsBaseVertex(
GL_TRIANGLES,
object->indexCount,
GL_UNSIGNED_SHORT,
object->indexDataOffset,
object->baseVertex );
}
Using an Indirect Draw
DrawElementsIndirectCommand command;
foreach( object )
{
WriteUniformData( object, &uniformData );
WriteDrawCommand( object, &command );
glDrawElementsIndirect(
GL_TRIANGLES,
GL_UNSIGNED_SHORT,
&command );
}
typedef struct {
uint count;
uint instanceCount;
uint firstIndex;
uint baseVertex;
uint baseInstance;
} DrawElementsIndirectCommand;
per-object parameters are
now sourced from memory
One Multi-Draw Submits it All
DrawElementsIndirectCommand* commands = ...;
foreach( object )
{
WriteUniformData( object, &uniformData[i] );
WriteDrawCommand( object, &commands[i] );
}
glMultiDrawElementsIndirect(
GL_TRIANGLES,
GL_UNSIGNED_SHORT,
commands,
commandCount,
0 );
fill in per-object data
(use parallelism, GPU compute if you like)
kick buffered-up objects to be rendered
What if I don‘t know the count?
● Doing GPU culling, etc.
● Use ARB_indirect_parameters
● Caveat: not all HW/drivers support it
glBindBuffer( GL_DRAW_INDIRECT_BUFFER, commandBuffer );
glBindBuffer( GL_PARAMETER_BUFFER, countBuffer );
// …
glMultiDrawElementsIndirectCount(
GL_TRIANGLES, GL_UNSIGNED_SHORT,
commandOffset,
countOffset,
maxCommandCount,
0 );
Per-Draw Parameters/Data
● If shader used to take struct of uniforms
● Now take an array of such structs
● Or use SSBO to go bigger
uniform ShaderParams params;
(Shader Storage Buffer Object)
uniform ShaderParams params[MAX_BATCH_SIZE];
buffer AllTheParams { ShaderParams params[]; };
How to find your draw‘s data?
● Ideally, just index it using gl_DrawID
● Provided by ARB_shader_draw_parameters
● Not supported everywhere
● But relatively simple to implement your own
mat4 mvp = params[gl_DrawIDARB].mvp;
Implement Your Own Draw ID
● Use baseInstance field of draw struct
● Increment base instance for each command
● Shader can‘t see base instance
● gl_InstanceID always counts from zero
http://www.g-truc.net/post-0518.html
cmd->baseInstance = drawCounter++;
Implement Your Own Draw ID
● Use a vertex attribute
● Set as per-instance with glVertexAttribDivisor
● Fill buffer with your own IDs
● Or arbitrary other per-draw parameters
● On some HW, faster than using gl_DrawID
More MultiDrawIndirect Caveats
● If generating draws on GPU
● Use a GL buffer (obviously)
● If generating on CPU
● Intel: (Compat) faster to use ordinary host pointer
● NV: persistent-mapped buffer slightly faster
● GPU or CPU
● AMD: Array must be tightly packed for best perf
Can Be 6-10x Less Overhead
0%
100%
200%
300%
400%
500%
600%
700%
Dynamic Buffer Persistent-Mapped Multi-Draw
Normalized Objects per Second
Batching Across Texture Changes
● Bindless, sparse can help
● As you will hear
● Not all hardware supports these
● Packing 2D textures into arrays
● Works on all current hardware/drivers
Packing Textures Into Arrays
● Array groups textures with same shape
● Dimensions, format, mips, MSAA
● Texture views may allow further grouping
● Put some same-size formats together
Packing Textures Into Arrays
● Bind all arrays to pipeline at once
● Need to allocate carefully
● Based on your content requirements
● Don‘t allocate more than fits in GPU memory
uniform sampler2Darray allSamplers[MAX_ARRAY_TEXTURES];
Options for Sampler Parameters
● Pair array with different sampler objs
● Create views of array with different state
● Be careful about max texture limits
● Each combination needs a new binding slot
Accessing Packed 2D Textures
● Texture ―handle‖ is pair of indices
● Index into array of sampler2Darray
● Slice index into particular array texture
● Can store as 64 bits {int;float;}
● Or pack into 32 bits (hi/lo) no int→float convert in shader
fewer bytes to read, but more math
Texture Array ~5x Less Overhead
0%
100%
200%
300%
400%
500%
600%
glBindTexture per Object Texture Arrays No Texture
Normalized Objects per Second
Dramatically Reduced Overhead
● Possible with current GL API and HW
● Persistent-mapped buffers
● Indirect and Multi-Draws
● Packing 2D textures into arrays
● Overhead is priority for all of us on GL
Graham Sellers
● AMD
Section Overview
● Bindless textures
● Recap of traditional texture binding
● Remove texture units with bindless
● Sparse textures
● Manage virtual and physical memory
● Streaming, sparse data sets, etc.
Texture Units - Recap
● Traditional texture binding
● Create textures
● Bind to texture units
● Declare samplers in shaders
● Draw
Texture Units - Recap
● Textures bound to numbered units
● Limited number of texture units
● State changes between draws
● Driver controls residency
Texture Units - Recap
● Binding textures - API
● Very hard to coalesce draws
glGenTextures(10, &tex[0]);
glBindTexture(GL_TEXTURE_2D, tex[n]);
glTexStorage2D(GL_TEXTURE_2D, ...);
foreach (draw in draws) {
foreach (texture in draw->textures) {
glBindTexture(GL_TEXTURE_2D, tex[texture]);
}
// Other stuff
glDrawElements(...);
}
Texture Units - Recap
● Binding textures - shader
● Limited textures per shader
● All declared at global scope
layout (binding = 0) uniform sampler2D uTexture1;
layout (binding = 1) uniform sampler3D uTexture2;
out vec4 oColor;
void main(void){
oColor = texture(uTexture1, ...) +
texture(uTexture2, ...);
}
Bindless Textures
● Remove texture bindings!
● Unlimited* virtual texture bindings
● Application controls residency
● Shader accesses textures by handle
* Virtually unlimited
Bindless Textures
● Bindless textures - API
● No texture binds between draws
// Create textures as normal, get handles from textures
GLuint64 handle = glGetTextureHandleARB(tex);
// Make resident
glMakeTextureHandleResidentARB(handle);
// Communicate ‘handle’ to shader... somehow
foreach (draw) {
glDrawElements(...);
}
Bindless Textures
● Bindless textures - shader
● Shader accesses textures by handle
● Must communicate handles to shader
uniform Samplers {
sampler2D tex[500]; // Limited only by storage
};
out vec4 oColor;
void main(void) {
oColor = texture(tex[123], ...) + texture(tex[456], ...);
}
Bindless Textures
● Handles are 64-bit integers
● Stick them in uniform buffers
● Switch set of textures – glBindBufferRange
● Number of accessible textures limited by buffer size
● Put them in structures (AoS)
● Index with gl_DrawIDARB, gl_InstanceID
Bindless Textures – DANGER!!!
● Some caveats with bindless textures
● Divergence rules apply
● Just like indexing arrays of textures
● Bindless handle must be constant across instance
● Divergence might work
● On some implementations, it Just Works
● On others, it Just Doesn‘t
● Even when it works, it could be expensive
Sparse Textures
● Very large virtual textures
● Separate virtual and physical allocation
● Partially populated arrays, mips, cubes, etc.
● Stream data on demand
Sparse Textures
● Textures arranged as tiles
● Each tile may be resident or not
Sparse Textures
● Sparse textures – API
● That‘s it – now you have a virtual texture
// Tell OpenGL you want a sparse texture
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_SPARSE_ARB, GL_TRUE);
// Allocate storage
glTexStorage2D(GL_TEXTURE_2D, 10, GL_RGBA8, 1024, 1024);
Sparse Textures
● Sparse textures – page sizes
// Query number of available page sizes
glGetInternalformativ(GL_TEXTURE_2D, GL_NUM_VIRTUAL_PAGE_SIZES_ARB,
GL_RGBA8, sizeof(GLint), &num_sizes);
// Get actual page sizes
glGetInternalformativ(GL_TEXTURE_2D, GL_VIRTUAL_PAGE_SIZE_X_ARB,
GL_RGBA8, sizeof(page_sizes_x),
&page_sizes_x[0]);
glGetInternalformativ(GL_TEXTURE_2D, GL_VIRTUAL_PAGE_SIZE_Y_ARB,
GL_RGBA8, sizeof(page_sizes_y),
&page_sizes_y[0]);
// Choose a page size
glTexParameteri(GL_TEXTURE_2D, GL_VIRTUAL_PAGE_SIZE_INDEX_ARB, n);
Sparse Textures
● Reserve and commit
● In ‗Operating System‘ terms
● Reserve – virtual allocation without physical store
● Commit – back virtual allocation with real memory
Sparse Textures
● Sparse textures – commitment
● Commitment is controlled by a single function
● Uncommitted pages use no memory
● Committed pages may contain data
void glTexPageCommitmentARB(GLenum target, GLint level,
GLint xoffset, GLint yoffset,
GLint zoffset, GLsizei width,
GLsizei height, GLsizei depth,
GLboolean commit);
Sparse Textures
● Sparse textures – data storage
● Put data into sparse textures as normal
● glTexSubImage, glCopyTextureImage, etc.
● Use a (persistent mapped) PBO for this!
● Attach to framebuffer object + draw
● Read from sparse textures
● glReadPixels, glGetTexImage*, etc.
Sparse Textures
● Sparse textures – in-shader use
● No changes to shaders
● Reads from committed regions behave normally
● Reads from uncommitted regions return junk
● Probably not junk – most likely zeros
● The spec doesn‘t mandate this, however
Sparse Texture Arrays
● Combine sparse textures and arrays
● Create very long (sparse) array textures
● Some layers are resident, some are not
● Allocate new layers on demand
● New layer = glTexPageCommitmentARB
Sparse Texture Arrays
● Manage your own texture memory
● Create a huge virtual array texture
● Need a new texture?
● Allocate a new layer
● Don‘t need it any more?
● Recycle or make non-resident
Sparse Bindless Texture Arrays
● Use all the features!
● Create a sparse array per texture size
● As textures become needed, commit pages
● Run out of pages? Make another texture...
● Get texture bindless handles
● Use as many handles as you like
Sparse Bindless Texture Arrays
● Indexing sparse bindless arrays requires:
● 64-bit texture handle
● N-bit layer index
● Remember...
● Index can diverge, handle cannot
● Need one array per-size
Building Data Structures
● Okay, so how do we use these things?
● Option 1 – Build on the CPU
● It‘s just memory writes
● Use a bunch of threads
● Persistent maps
● Option 2 – Use the GPU
● Much fun. Wow.
Building Data Structures
● Using the GPU to set the scene (1)
● Create SSBO with AoS for draw parameters
struct DrawParams {
uint count;
uint instanceCount;
uint firstIndex;
uint baseIndex;
uint baseInstance;
};
layout (binding = 0) {
DrawParams draw_params[];
};
Building Data Structures
● Using the GPU to set the scene (2)
● Create another SSBO for draw metadata
struct DrawMeta {
uint material_index;
// More per-draw meta-stuff goes here...
};
layout (binding = 0) {
DrawMeta draw_meta[];
};
Building Data Structures
● Using the GPU to set the scene (3)
● Use atomic counter to append to buffers
layout (binding = 0, offset = 0) atomic_uint draw_count;
void append_draw(DrawParams params, DrawMeta meta)
{
uint index = atomicCounterIncrement(draw_count);
draw_params[index] = params;
draw_meta[index] = meta;
}
Building Data Structures
● Using the GPU to set the scene (4)
● Dump counter, do MultiDraw*IndirectCount
glCopyBufferSubData(GL_ATOMIC_COUNTER_BUFFER,
GL_PARAMETER_BUFFER_ARB,
0, 0, sizeof(GLuint));
glMultiDrawElementsIndirectCountARB(GL_TRIANLGES,
GL_UNSIGNED_SHORT,
nullptr,
MAX_DRAWS,
0);
Building Data Structures
● Using the GPU to set the scene (5)
● In draw, use meta with gl_DrawIDARB
struct Material {
sampler2D tex1;
};
layout (binding = 0) uniform MaterialData {
Material material[];
};
...
oColor = texture(material[draw_meta[gl_DrawIDARB].material_index],
...);
John McDonald
● NVIDIA
Putting it all into practice
● Introducing apitest
● Results
● Code review
apitest
● https://github.com/nvMcJohn/apitest
● Extensible OSS Framework (Public Domain)
● Uses SDL 2.0 (Thanks SDL!)
● Initially developed by Patrick Doane
OS OpenGL D3D11
Windows Yes Yes
Linux Yes No
OSX Sorta No
The Framework
● Code is segmented into Problems and
Solutions
● A Problem is a dataset to render
● A Solution is one targeted approach to
rendering that dataset (Problem)
● Support code to create shaders, load
textures, etc.
The Problems So Far
● DynamicStreaming
● Render 160,000 ―particles‖ that are
dynamically generated each frame
● UntexturedObjects
● Render 643 different, untextured objects
● Different matrices per object
● No instancing allowed!
The Problems So Far - Continued
● Textured Quads
● 10,000 quads using different textures
● Texture is changed between every object
● Null
● Clear and SwapBuffer
● Not going to discuss today—included as a
sanity startup.
Result discussion
● Results gathered on a GTX 680, using
public driver 335.23.
● But are shown normalized.
● AMD and Intel have very similar
performance ratios between solutions.
Decoder Ring
● SBTA = Sparse Bindless Texture Array
● SDP = Shader Draw Parameters
DynamicStreaming
● Demo!
● Problem: Render 160,000 ―particles‖ that
are dynamically generated each frame
Approaching zero driver overhead
0% 50% 100% 150% 200% 250%
GLMapPersistent
D3D11MapNoOverwrite
GLBufferSubData
D3D11UpdateSubresource
GLMapUnsynchronized
DynamicStreaming - Normalized Obj/s
GLMapPersistent
● Map the buffer at the beginning of time
● Keep it mapped forever.
● You are responsible for safety (proper
fencing)
● Do not stomp on data in flight
● src/solutions/dynamicstreaming/gl/mappersistent.*
Required Extensions
● ARB_buffer_storage
● ARB_map_buffer_range
● ARB_sync
Buffer Creation
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT;
mDestHead = 0;
mBuffSize = 3 * maxVerts * kVertexSizeBytes;
glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer);
glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags);
mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
mBuffSize, mapFlags);
Dem Flags
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT;
mDestHead = 0;
mBuffSize = 3 * maxVerts * kVertexSizeBytes;
glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer);
glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags);
mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
mBuffSize, mapFlags);
Set circular buffer head
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT;
mDestHead = 0;
mBuffSize = 3 * maxVerts * kVertexSizeBytes;
glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer);
glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags);
mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
mBuffSize, mapFlags);
Triple Buffering ftw
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT;
mDestHead = 0;
mBuffSize = 3 * maxVerts * kVertexSizeBytes;
glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer);
glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags);
mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
mBuffSize, mapFlags);
Buffer Create
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT;
mDestHead = 0;
mBuffSize = 3 * maxVerts * kVertexSizeBytes;
glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer);
glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags);
mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
mBuffSize, mapFlags);
Map me… forever.
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT;
mDestHead = 0;
mBuffSize = 3 * maxVerts * kVertexSizeBytes;
glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer);
glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags);
mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0,
mBuffSize, mapFlags);
Buffer Update / Render
mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes);
for (int i = 0; i < particleCount; ++i) {
const int vertexOffset = i * kVertsPerParticle;
const int thisDstOffset = mDstHead + (i * kParticleSizeBytes);
void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset;
memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes);
DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle);
}
mBufferLockManager.LockRange(mDstHead, vertSizeBytes);
mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
Safety Third!
mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes);
for (int i = 0; i < particleCount; ++i) {
const int vertexOffset = i * kVertsPerParticle;
const int thisDstOffset = mDstHead + (i * kParticleSizeBytes);
void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset;
memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes);
DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle);
}
mBufferLockManager.LockRange(mDstHead, vertSizeBytes);
mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
Write those particles
mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes);
for (int i = 0; i < particleCount; ++i) {
const int vertexOffset = i * kVertsPerParticle;
const int thisDstOffset = mDstHead + (i * kParticleSizeBytes);
void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset;
memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes);
DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle);
}
mBufferLockManager.LockRange(mDstHead, vertSizeBytes);
mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
Now draw (inefficiently)
mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes);
for (int i = 0; i < particleCount; ++i) {
const int vertexOffset = i * kVertsPerParticle;
const int thisDstOffset = mDstHead + (i * kParticleSizeBytes);
void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset;
memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes);
DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle);
}
mBufferLockManager.LockRange(mDstHead, vertSizeBytes);
mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
Update circular buffer head
mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes);
for (int i = 0; i < particleCount; ++i) {
const int vertexOffset = i * kVertsPerParticle;
const int thisDstOffset = mDstHead + (i * kParticleSizeBytes);
void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset;
memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes);
DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle);
}
mBufferLockManager.LockRange(mDstHead, vertSizeBytes);
mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
UntexturedObjects
● Demo!
● Problem: Render 643 unique, untextured
objects
Approaching zero driver overhead
0% 100% 200% 300% 400% 500% 600% 700% 800% 900%
GLBufferStorage-NoSDP
GLMultiDrawBuffer-NoSDP
GLMultiDraw-NoSDP
GLBufferStorage-SDP
GLMultiDrawBuffer-SDP
GLMultiDraw-SDP
GLMapPersistent
GLDrawLoop
GLBindlessIndirect
GLTexCoord
GLUniform
D3D11Naive
GLBindless
GLDynamicBuffer
GLBufferRange
GLMapUnsynchronized
Untextured Object - Normalized Obj/s
0% 100% 200% 300% 400% 500% 600% 700% 800% 900%
GLBufferStorage-NoSDP
GLMultiDrawBuffer-NoSDP
GLMultiDraw-NoSDP
GLBufferStorage-SDP
GLMultiDrawBuffer-SDP
GLMultiDraw-SDP
GLMapPersistent
GLDrawLoop
GLBindlessIndirect
GLTexCoord
GLUniform
D3D11Naive
GLBindless
GLDynamicBuffer
GLBufferRange
GLMapUnsynchronized
Untextured Object - Normalized Obj/s
0% 100% 200% 300% 400% 500% 600% 700% 800% 900%
GLBufferStorage-NoSDP
GLMultiDrawBuffer-NoSDP
GLMultiDraw-NoSDP
GLBufferStorage-SDP
GLMultiDrawBuffer-SDP
GLMultiDraw-SDP
GLMapPersistent
GLDrawLoop
GLBindlessIndirect
GLTexCoord
GLUniform
D3D11Naive
GLBindless
GLDynamicBuffer
GLBufferRange
GLMapUnsynchronized
Untextured Object - Normalized Obj/s
0% 100% 200% 300% 400% 500% 600% 700% 800% 900%
GLBufferStorage-NoSDP
GLMultiDrawBuffer-NoSDP
GLMultiDraw-NoSDP
GLBufferStorage-SDP
GLMultiDrawBuffer-SDP
GLMultiDraw-SDP
GLMapPersistent
GLDrawLoop
GLBindlessIndirect
GLTexCoord
GLUniform
D3D11Naive
GLBindless
GLDynamicBuffer
GLBufferRange
GLMapUnsynchronized
Untextured Object - Normalized Obj/s
GLBufferStorage-(ε|No)SDP
● Set up a giant uniform or storage buffer
with data for all objects for a frame.
● Use MDI to render many objects at once
● And PMB for dynamic data (matrix
transforms, MDI entries)
● Need a way to index data in shader (SDP)
Required Extensions
● ARB_buffer_storage
● ARB_map_buffer_range
● ARB_multi_draw_indirect
● ARB_shader_draw_parameters
● ARB_shader_storage_buffer_object
● ARB_sync
NoSDP
● Can be used when instancing isn‘t needed
● Very simple improvement to SDP
approach
● Not going to cover today
● So check the source code!
DrawElementsIndirectCommand
struct DrawElementsIndirectCommand
{
uint count;
uint instanceCount;
uint firstIndex;
uint baseVertex;
uint baseInstance;
};
typedef DrawElementsIndirectCommand DEICmd;
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_DYNAMIC_STORAGE_BIT;
mCmdHead = 0;
mCmdSize = 3 * objCount * sizeof(DEICmd);
glBindBuffer(GL_DRAW_INDIRECT_BUFFER, mCmdBuffer);
glBufferStorage(GL_DRAW_INDIRECT_BUFFER, mCmdSize, 0, createFlags);
mCmdPtr = glMapBufferRange(GL_DRAW_INDIRECT_BUFFER, 0,
mCmdSize, mapFlags);
Cmd Buffer Creation
Obj Buffer Creation
GLbitfield mapFlags = GL_MAP_WRITE_BIT
| GL_MAP_PERSISTENT_BIT
| GL_MAP_COHERENT_BIT;
GLbitfield createFlags = mapFlags | GL_DYNAMIC_STORAGE_BIT;
mObjHead = 0;
mObjSize = 3 * objCount * sizeof(Matrix);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, mObjBuffer);
glBufferStorage(GL_SHADER_STORAGE_BUFFER, mObjSize, 0, createFlags);
mObjPtr = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0,
mObjSize, mapFlags);
Cmd Buffer Update
mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount);
for (size_t u = 0; u < objCount; ++u) {
DEICmd *cmd = (mCmdPtr + mCmdHead) + u;
cmd->count = mIndexCount;
cmd->instanceCount = 1;
cmd->firstIndex = 0;
cmd->baseVertex = 0;
cmd->baseInstance = 0;
}
oldCmdHead = mCmdHead;
mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize;
// Next, update the per-Object Data
Fencing for fun and profit
mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount);
for (size_t u = 0; u < objCount; ++u) {
DEICmd *cmd = (mCmdPtr + mCmdHead) + u;
cmd->count = mIndexCount;
cmd->instanceCount = 1;
cmd->firstIndex = 0;
cmd->baseVertex = 0;
cmd->baseInstance = 0;
}
oldCmdHead = mCmdHead;
mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize;
// Next, update the per-Object Data
Someone Set Up Us The Draws
mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount);
for (size_t u = 0; u < objCount; ++u) {
DEICmd *cmd = (mCmdPtr + mCmdHead) + u;
cmd->count = mIndexCount;
cmd->instanceCount = 1;
cmd->firstIndex = 0;
cmd->baseVertex = 0;
cmd->baseInstance = 0;
}
oldCmdHead = mCmdHead;
mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize;
// Next, update the per-Object Data
Manage the Head
mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount);
for (size_t u = 0; u < objCount; ++u) {
DEICmd *cmd = (mCmdPtr + mCmdHead) + u;
cmd->count = mIndexCount;
cmd->instanceCount = 1;
cmd->firstIndex = 0;
cmd->baseVertex = 0;
cmd->baseInstance = 0;
}
oldCmdHead = mCmdHead;
mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize;
// Next, update the per-Object Data
Obj Buffer Update
// Next, update the per-Object Data
// Next, update the per-Object Data
Obj Buffer Update / Render
// Next, update the per-Object Data
mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount);
for (size_t u = 0; u < objCount; ++u) {
Matrix *obj = (mObjPtr + mObjHead) + u;
(*obj) = (inObjParameters)[u];
}
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT,
0, objCount, 0);
mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount);
mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount);
mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
Seriously though, be safe
// Next, update the per-Object Data
mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount);
for (size_t u = 0; u < objCount; ++u) {
Matrix *obj = (mObjPtr + mObjHead) + u;
(*obj) = (inObjParameters)[u];
}
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT,
0, objCount, 0);
mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount);
mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount);
mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
Updates to object parameters
// Next, update the per-Object Data
mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount);
for (size_t u = 0; u < objCount; ++u) {
Matrix *obj = (mObjPtr + mObjHead) + u;
(*obj) = (inObjParameters)[u];
}
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT,
0, objCount, 0);
mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount);
mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount);
mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
Draw all the things
// Next, update the per-Object Data
mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount);
for (size_t u = 0; u < objCount; ++u) {
Matrix *obj = (mObjPtr + mObjHead) + u;
(*obj) = (inObjParameters)[u];
}
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT,
0, objCount, 0);
mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount);
mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount);
mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
Head management
// Next, update the per-Object Data
mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount);
for (size_t u = 0; u < objCount; ++u) {
Matrix *obj = (mObjPtr + mObjHead) + u;
(*obj) = (inObjParameters)[u];
}
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT,
0, objCount, 0);
mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount);
mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount);
mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
TexturedQuads
● Demo!
● 10,000 quads using different textures
● Texture is changed between every object
Approaching zero driver overhead
0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% 1800% 2000%
GLSBTAMultiDraw-NoSDP
GLTextureArrayMultiDraw-NoSDP
GLBindlessMultiDraw
GLSBTAMultiDraw-SDP
GLTextureArrayMultiDraw-SDP
GLNoTex
GLTextureArray
GLNoTexUniform
GLTextureArrayUniform
GLSBTA
GLBindless
GLNaive
GLNaiveUniform
D3D11Naive
TexturedQuads – Normalized Obj/s
0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% 1800% 2000%
GLSBTAMultiDraw-NoSDP
GLTextureArrayMultiDraw-NoSDP
GLBindlessMultiDraw
GLSBTAMultiDraw-SDP
GLTextureArrayMultiDraw-SDP
GLNoTex
GLTextureArray
GLNoTexUniform
GLTextureArrayUniform
GLSBTA
GLBindless
GLNaive
GLNaiveUniform
D3D11Naive
TexturedQuads – Normalized Obj/s
0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% 1800% 2000%
GLSBTAMultiDraw-NoSDP
GLTextureArrayMultiDraw-NoSDP
GLBindlessMultiDraw
GLSBTAMultiDraw-SDP
GLTextureArrayMultiDraw-SDP
GLNoTex
GLTextureArray
GLNoTexUniform
GLTextureArrayUniform
GLSBTA
GLBindless
GLNaive
GLNaiveUniform
D3D11Naive
TexturedQuads – Normalized Obj/s
TexturedQuads notes
● SBTA was covered at Steam Dev Days
● Non-Sparse, Non-Bindless TextureArray is
the fallback
● Should use BufferStorage improvements
● SBTA = Sparse Bindless Texture Array
GLTextureArrayMultiDraw-(ε|No)SDP
● Instead of loose textures, use arrays of Texture
Arrays
● Container contains <=2048 same-shape textures
● Shape is height, width, mipmapcount, format
● Use MDI for kickoffs
● Address is passed as {int; float} pair
struct Tex2DAddress {
uint Container;
float Page;
};
layout (std140, binding=1) readonly buffer CB1 {
Tex2DAddress texAddress[];
};
uniform sampler2DArray TexContainer[16];
// Elsewhere (in a func, whatever)
int drawID = int(In.iDrawID);
Tex2DAddress addr = texAddress[drawID];
vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page);
vec4 texel = texture(TexContainer[addr.Container], texCoord);
struct Tex2DAddress {
uint Container;
float Page;
};
layout (std140, binding=1) readonly buffer CB1 {
Tex2DAddress texAddress[];
};
uniform sampler2DArray TexContainer[16];
// Elsewhere (in a func, whatever)
int drawID = int(In.iDrawID);
Tex2DAddress addr = texAddress[drawID];
vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page);
vec4 texel = texture(TexContainer[addr.Container], texCoord);
struct Tex2DAddress {
uint Container;
float Page;
};
layout (std140, binding=1) readonly buffer CB1 {
Tex2DAddress texAddress[];
};
uniform sampler2DArray TexContainer[16];
// Elsewhere (in a func, whatever)
int drawID = int(In.iDrawID);
Tex2DAddress addr = texAddress[drawID];
vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page);
vec4 texel = texture(TexContainer[addr.Container], texCoord);
struct Tex2DAddress {
uint Container;
float Page;
};
layout (std140, binding=1) readonly buffer CB1 {
Tex2DAddress texAddress[];
};
uniform sampler2DArray TexContainer[16];
// Elsewhere (in a func, whatever)
int drawID = int(In.iDrawID);
Tex2DAddress addr = texAddress[drawID];
vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page);
vec4 texel = texture(TexContainer[addr.Container], texCoord);
struct Tex2DAddress {
uint Container;
float Page;
};
layout (std140, binding=1) readonly buffer CB1 {
Tex2DAddress texAddress[];
};
uniform sampler2DArray TexContainer[16];
// Elsewhere (in a func, whatever)
int drawID = int(In.iDrawID);
Tex2DAddress addr = texAddress[drawID];
vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page);
vec4 texel = texture(TexContainer[addr.Container], texCoord);
Questions?
● graham dot sellers at amd dot com
@GrahamSellers
● tim dot foley at intel dot com
@TangentVector
● cass at nvidia dot com
@casseveritt
● jmcdonald at nvidia dot com
@basisspace
1 of 130

Recommended

OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead by
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadTristan Lorach
242.3K views42 slides
OpenGL 4.4 - Scene Rendering Techniques by
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesNarann29
2.1K views56 slides
Advanced Scenegraph Rendering Pipeline by
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
1.4K views42 slides
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14 by
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
31.2K views33 slides
Optimizing the Graphics Pipeline with Compute, GDC 2016 by
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Graham Wihlidal
135.4K views99 slides
Advancements in-tiled-rendering by
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
2.2K views61 slides

More Related Content

What's hot

Triangle Visibility buffer by
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
651 views80 slides
Dx11 performancereloaded by
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloadedmistercteam
5.5K views44 slides
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run by
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunElectronic Arts / DICE
32.1K views96 slides
Z Buffer Optimizations by
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizationspjcozzi
18.2K views41 slides
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing... by
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Johan Andersson
19.1K views74 slides
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007) by
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Johan Andersson
16.3K views52 slides

What's hot(20)

Dx11 performancereloaded by mistercteam
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam5.5K views
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run by Electronic Arts / DICE
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Z Buffer Optimizations by pjcozzi
Z Buffer OptimizationsZ Buffer Optimizations
Z Buffer Optimizations
pjcozzi18.2K views
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing... by Johan Andersson
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Johan Andersson19.1K views
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007) by Johan Andersson
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Johan Andersson16.3K views
FrameGraph: Extensible Rendering Architecture in Frostbite by Electronic Arts / DICE
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
Rendering Technologies from Crysis 3 (GDC 2013) by Tiago Sousa
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
Tiago Sousa24.6K views
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3 by Electronic Arts / DICE
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
Taking Killzone Shadow Fall Image Quality Into The Next Generation by Guerrilla
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Guerrilla14.9K views
A Bit More Deferred Cry Engine3 by guest11b095
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b09512K views
Deferred Rendering in Killzone 2 by Guerrilla
Deferred Rendering in Killzone 2Deferred Rendering in Killzone 2
Deferred Rendering in Killzone 2
Guerrilla19.7K views
Secrets of CryENGINE 3 Graphics Technology by Tiago Sousa
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Tiago Sousa32.3K views
Parallel Futures of a Game Engine (v2.0) by Johan Andersson
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
Johan Andersson10.5K views
Star Ocean 4 - Flexible Shader Managment and Post-processing by umsl snfrzb
Star Ocean 4 - Flexible Shader Managment and Post-processingStar Ocean 4 - Flexible Shader Managment and Post-processing
Star Ocean 4 - Flexible Shader Managment and Post-processing
umsl snfrzb7.2K views
Checkerboard Rendering in Dark Souls: Remastered by QLOC by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOC
QLOC1.4K views
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas by AMD Developer Central
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas

Similar to Approaching zero driver overhead

Computer Graphics - Lecture 01 - 3D Programming I by
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
585 views57 slides
Optimizing Games for Mobiles by
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for MobilesSt1X
9.1K views59 slides
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide by
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideUA Mobile
627 views51 slides
Porting the Source Engine to Linux: Valve's Lessons Learned by
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learnedbasisspace
3K views90 slides
BlaBlaCar Elastic Search Feedback by
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedbacksinfomicien
3.6K views37 slides
Smedberg niklas bringing_aaa_graphics by
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
966 views69 slides

Similar to Approaching zero driver overhead(20)

Computer Graphics - Lecture 01 - 3D Programming I by 💻 Anton Gerdelan
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
Optimizing Games for Mobiles by St1X
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
St1X9.1K views
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide by UA Mobile
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
UA Mobile627 views
Porting the Source Engine to Linux: Valve's Lessons Learned by basisspace
Porting the Source Engine to Linux: Valve's Lessons LearnedPorting the Source Engine to Linux: Valve's Lessons Learned
Porting the Source Engine to Linux: Valve's Lessons Learned
basisspace3K views
BlaBlaCar Elastic Search Feedback by sinfomicien
BlaBlaCar Elastic Search FeedbackBlaBlaCar Elastic Search Feedback
BlaBlaCar Elastic Search Feedback
sinfomicien3.6K views
Smedberg niklas bringing_aaa_graphics by changehee lee
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
changehee lee966 views
High Performance Rust UI.pdf by mraaaaa
High Performance Rust UI.pdfHigh Performance Rust UI.pdf
High Performance Rust UI.pdf
mraaaaa12 views
Open gl by EU Edge
Open glOpen gl
Open gl
EU Edge830 views
Netflix machine learning by Amer Ather
Netflix machine learningNetflix machine learning
Netflix machine learning
Amer Ather258 views
Intro to GPGPU with CUDA (DevLink) by Rob Gillen
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen2.6K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1.1K views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh1 view
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh441 views
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018 by Holden Karau
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau355 views
Vpu technology &gpgpu computing by Arka Ghosh
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh396 views
spaGO: A self-contained ML & NLP library in GO by Matteo Grella
spaGO: A self-contained ML & NLP library in GOspaGO: A self-contained ML & NLP library in GO
spaGO: A self-contained ML & NLP library in GO
Matteo Grella461 views
Intro to GPGPU Programming with Cuda by Rob Gillen
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen3.8K views

Approaching zero driver overhead

  • 1. Approaching Zero Driver Overhead Cass Everitt NVIDIA Tim Foley Intel Graham Sellers AMD John McDonald NVIDIA
  • 3. Assertion ● OpenGL already has paths with very low driver overhead ● You just need to know ● What they are, and ● How to use them
  • 4. But first, who are we? ● Graham Sellers @GrahamSellers ● AMD OpenGL driver manager, OpenGL SuperBible author ● Tim Foley @TangentVector ● Graphics researcher, GPU language/compiler nerd ● John McDonald @basisspace ● Graphics engineer, chip architect, game developer ● Cass Everitt @casseveritt ● GL zealot, chip architect, mobile enthusiast
  • 5. Many kinds of bottlenecks ● Focus here is ―driver limited‖ ● App could render more, and ● GPU could render more, but ● Driver is at its limit… ● Because of expensive API calls
  • 6. Some causes of driver overhead ● The CPU cost of fulfilling the API contract ● Validation ● Hazard avoidance
  • 7. Costs that add up… ● Major Categories: ● synchronization, allocation, validation, and compilation ● Buffer updates (synchronization, allocation) ● Mapping, in-band updates ● Binding objects (validation, compilation) ● FBOs, programs, textures, buffers
  • 8. Remedy? – Efficient APIs! ● Buffer storage ● Texture arrays ● Multi-Draw Indirect ● Texture arrays, bindless, sparse, indirect parameters }Tim Foley Graham Sellers}
  • 9. Results ● apitest ● Framework for testing different ―solutions‖ ● Source on github }John McDonald
  • 10. Remember, these OpenGL APIs ● Exist TODAY – already on your PC ● Are at least multi-vendor (EXT), and mostly core (GL 4.2+) ● Coexist with existing OpenGL
  • 11. Remember, these OpenGL APIs ● Exist TODAY – already on your PC ● Are at least multi-vendor (EXT), and mostly core (GL 4.2+) ● Coexist with existing OpenGL
  • 12. Remember, these OpenGL APIs ● Exist TODAY – already on your PC ● Are at least multi-vendor (EXT), and mostly core (GL 4.2+) ● Coexist with existing OpenGL
  • 13. On with the show… next speaker
  • 15. Challenge: More Stuff per Frame ● Varied ● Not 1000s of same instanced mesh ● Unique geometry, textures, etc. ● Dynamic ● Not just pretty skinned meshes ● Generate new geometry each frame
  • 16. Want an Order of Magnitude ● Increase in unique objects per frame ● Can over-simplify as draws per frame, but ● Misses importance of variety ● Do we need a new API to achieve this? ● How far can we get with what we have today?
  • 17. Three Techniques in This Talk ● Persistent-mapped buffers ● Faster streaming of dynamic geometry ● MultiDrawIndirect (MDI) ● Faster submission of many draw calls ● Packing 2D textures into arrays ● Texture changes no longer break batches
  • 18. Naïve Draw Loop foreach( object ) { // bind framebuffer // set depth, blending, etc. states // bind shaders // bind textures // bind vertex/index buffers WriteUniformData( object ); glDrawElements( GL_TRIANGLES, object->indexCount, GL_UNSIGNED_SHORT, 0 ); }
  • 19. Typical Draw Loop // sort or bucket visible objects foreach( render target ) // framebuffer foreach( pass ) // depth, blending, etc. states foreach( material ) // shaders foreach( material instance ) // textures foreach( vertex format ) // vertex buffers foreach( object ) { WriteUniformData( object ); glDrawElementsBaseVertex( GL_TRIANGLES, object->indexCount, GL_UNSIGNED_SHORT, object->indexDataOffset, object->baseVertex ); }
  • 20. Two Ways to Improve Overhead // sort or bucket visible objects foreach( render target ) // framebuffer foreach( pass ) // depth, blending, etc. states foreach( material ) // shaders foreach( material instance ) // textures foreach( vertex format ) // vertex buffers foreach( object ) { WriteUniformData( object ); glDrawElementsBaseVertex( GL_TRIANGLES, object->indexCount, GL_UNSIGNED_SHORT, object->indexDataOffset, object->baseVertex ); } submit each batch faster fewer, bigger batches
  • 21. Pack Multiple Objects per Buffer // sort or bucket visible objects foreach( render target ) // framebuffer foreach( pass ) // depth, blending, etc. states foreach( material ) // shaders foreach( material instance ) // textures foreach( vertex format ) // vertex buffers foreach( object ) { WriteUniformData( object ); glDrawElementsBaseVertex( GL_TRIANGLES, object->indexCount, GL_UNSIGNED_SHORT, object->indexDataOffset, object->baseVertex ); } pack multiple objects into the same (dynamic or static) vertex/index buffer take advantage of glDraw*() params to index into buffer without changing bindings
  • 22. Dynamic Streaming of Geometry ● Typical dynamic vertex ring buffer void* data = glMapBuffer(GL_ARRAY_BUFFER, ringOffset, dataSize, GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_WRITE_BIT ); WriteGeometry( data, ... ); glUnmapBuffer(GL_ARRAY_BUFFER); ringOffset += dataSize; // deal with wrap-around in ring, etc. frequent mapping = overhead no sync with GPU, but forces sync in multi-threaded drivers
  • 23. BufferStorage and Persistent Map ● Allocate buffer with glBufferStorage() ● Use flags to enable persistent mapping glBufferStorage(GL_ARRAY_BUFFER, ringSize, NULL, flags); GLbitfield flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; keep mapped while drawing writes automatically visible to GPU
  • 24. Dynamic Streaming of Geometry ● Map once at creation time ● No more Map/Unmap in your draw loop ● But need to do synchronization yourself data = glMapBufferRange(ARRAY_BUFFER, 0, ringSize, flags); WriteGeometry( data, ... ); data += dataSize; upcoming talks will cover glFenceSync() and glClientWaitSync()
  • 25. Performance ● BufferSubData vs Map(UNSYNCHRONIZED) ● Intel: avoid frequent BufferSubData() ● NV: Map(UNSYNCH) bad for threaded drivers ● Persistent mapping best where supported ● Overhead 2-20x better than next best option
  • 26. That Inner Loop Again foreach( object ) { WriteUniformData( object, &uniformData ); glDrawElementsBaseVertex( GL_TRIANGLES, object->indexCount, GL_UNSIGNED_SHORT, object->indexDataOffset, object->baseVertex ); }
  • 27. Using an Indirect Draw DrawElementsIndirectCommand command; foreach( object ) { WriteUniformData( object, &uniformData ); WriteDrawCommand( object, &command ); glDrawElementsIndirect( GL_TRIANGLES, GL_UNSIGNED_SHORT, &command ); } typedef struct { uint count; uint instanceCount; uint firstIndex; uint baseVertex; uint baseInstance; } DrawElementsIndirectCommand; per-object parameters are now sourced from memory
  • 28. One Multi-Draw Submits it All DrawElementsIndirectCommand* commands = ...; foreach( object ) { WriteUniformData( object, &uniformData[i] ); WriteDrawCommand( object, &commands[i] ); } glMultiDrawElementsIndirect( GL_TRIANGLES, GL_UNSIGNED_SHORT, commands, commandCount, 0 ); fill in per-object data (use parallelism, GPU compute if you like) kick buffered-up objects to be rendered
  • 29. What if I don‘t know the count? ● Doing GPU culling, etc. ● Use ARB_indirect_parameters ● Caveat: not all HW/drivers support it glBindBuffer( GL_DRAW_INDIRECT_BUFFER, commandBuffer ); glBindBuffer( GL_PARAMETER_BUFFER, countBuffer ); // … glMultiDrawElementsIndirectCount( GL_TRIANGLES, GL_UNSIGNED_SHORT, commandOffset, countOffset, maxCommandCount, 0 );
  • 30. Per-Draw Parameters/Data ● If shader used to take struct of uniforms ● Now take an array of such structs ● Or use SSBO to go bigger uniform ShaderParams params; (Shader Storage Buffer Object) uniform ShaderParams params[MAX_BATCH_SIZE]; buffer AllTheParams { ShaderParams params[]; };
  • 31. How to find your draw‘s data? ● Ideally, just index it using gl_DrawID ● Provided by ARB_shader_draw_parameters ● Not supported everywhere ● But relatively simple to implement your own mat4 mvp = params[gl_DrawIDARB].mvp;
  • 32. Implement Your Own Draw ID ● Use baseInstance field of draw struct ● Increment base instance for each command ● Shader can‘t see base instance ● gl_InstanceID always counts from zero http://www.g-truc.net/post-0518.html cmd->baseInstance = drawCounter++;
  • 33. Implement Your Own Draw ID ● Use a vertex attribute ● Set as per-instance with glVertexAttribDivisor ● Fill buffer with your own IDs ● Or arbitrary other per-draw parameters ● On some HW, faster than using gl_DrawID
  • 34. More MultiDrawIndirect Caveats ● If generating draws on GPU ● Use a GL buffer (obviously) ● If generating on CPU ● Intel: (Compat) faster to use ordinary host pointer ● NV: persistent-mapped buffer slightly faster ● GPU or CPU ● AMD: Array must be tightly packed for best perf
  • 35. Can Be 6-10x Less Overhead 0% 100% 200% 300% 400% 500% 600% 700% Dynamic Buffer Persistent-Mapped Multi-Draw Normalized Objects per Second
  • 36. Batching Across Texture Changes ● Bindless, sparse can help ● As you will hear ● Not all hardware supports these ● Packing 2D textures into arrays ● Works on all current hardware/drivers
  • 37. Packing Textures Into Arrays ● Array groups textures with same shape ● Dimensions, format, mips, MSAA ● Texture views may allow further grouping ● Put some same-size formats together
  • 38. Packing Textures Into Arrays ● Bind all arrays to pipeline at once ● Need to allocate carefully ● Based on your content requirements ● Don‘t allocate more than fits in GPU memory uniform sampler2Darray allSamplers[MAX_ARRAY_TEXTURES];
  • 39. Options for Sampler Parameters ● Pair array with different sampler objs ● Create views of array with different state ● Be careful about max texture limits ● Each combination needs a new binding slot
  • 40. Accessing Packed 2D Textures ● Texture ―handle‖ is pair of indices ● Index into array of sampler2Darray ● Slice index into particular array texture ● Can store as 64 bits {int;float;} ● Or pack into 32 bits (hi/lo) no int→float convert in shader fewer bytes to read, but more math
  • 41. Texture Array ~5x Less Overhead 0% 100% 200% 300% 400% 500% 600% glBindTexture per Object Texture Arrays No Texture Normalized Objects per Second
  • 42. Dramatically Reduced Overhead ● Possible with current GL API and HW ● Persistent-mapped buffers ● Indirect and Multi-Draws ● Packing 2D textures into arrays ● Overhead is priority for all of us on GL
  • 44. Section Overview ● Bindless textures ● Recap of traditional texture binding ● Remove texture units with bindless ● Sparse textures ● Manage virtual and physical memory ● Streaming, sparse data sets, etc.
  • 45. Texture Units - Recap ● Traditional texture binding ● Create textures ● Bind to texture units ● Declare samplers in shaders ● Draw
  • 46. Texture Units - Recap ● Textures bound to numbered units ● Limited number of texture units ● State changes between draws ● Driver controls residency
  • 47. Texture Units - Recap ● Binding textures - API ● Very hard to coalesce draws glGenTextures(10, &tex[0]); glBindTexture(GL_TEXTURE_2D, tex[n]); glTexStorage2D(GL_TEXTURE_2D, ...); foreach (draw in draws) { foreach (texture in draw->textures) { glBindTexture(GL_TEXTURE_2D, tex[texture]); } // Other stuff glDrawElements(...); }
  • 48. Texture Units - Recap ● Binding textures - shader ● Limited textures per shader ● All declared at global scope layout (binding = 0) uniform sampler2D uTexture1; layout (binding = 1) uniform sampler3D uTexture2; out vec4 oColor; void main(void){ oColor = texture(uTexture1, ...) + texture(uTexture2, ...); }
  • 49. Bindless Textures ● Remove texture bindings! ● Unlimited* virtual texture bindings ● Application controls residency ● Shader accesses textures by handle * Virtually unlimited
  • 50. Bindless Textures ● Bindless textures - API ● No texture binds between draws // Create textures as normal, get handles from textures GLuint64 handle = glGetTextureHandleARB(tex); // Make resident glMakeTextureHandleResidentARB(handle); // Communicate ‘handle’ to shader... somehow foreach (draw) { glDrawElements(...); }
  • 51. Bindless Textures ● Bindless textures - shader ● Shader accesses textures by handle ● Must communicate handles to shader uniform Samplers { sampler2D tex[500]; // Limited only by storage }; out vec4 oColor; void main(void) { oColor = texture(tex[123], ...) + texture(tex[456], ...); }
  • 52. Bindless Textures ● Handles are 64-bit integers ● Stick them in uniform buffers ● Switch set of textures – glBindBufferRange ● Number of accessible textures limited by buffer size ● Put them in structures (AoS) ● Index with gl_DrawIDARB, gl_InstanceID
  • 53. Bindless Textures – DANGER!!! ● Some caveats with bindless textures ● Divergence rules apply ● Just like indexing arrays of textures ● Bindless handle must be constant across instance ● Divergence might work ● On some implementations, it Just Works ● On others, it Just Doesn‘t ● Even when it works, it could be expensive
  • 54. Sparse Textures ● Very large virtual textures ● Separate virtual and physical allocation ● Partially populated arrays, mips, cubes, etc. ● Stream data on demand
  • 55. Sparse Textures ● Textures arranged as tiles ● Each tile may be resident or not
  • 56. Sparse Textures ● Sparse textures – API ● That‘s it – now you have a virtual texture // Tell OpenGL you want a sparse texture glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_SPARSE_ARB, GL_TRUE); // Allocate storage glTexStorage2D(GL_TEXTURE_2D, 10, GL_RGBA8, 1024, 1024);
  • 57. Sparse Textures ● Sparse textures – page sizes // Query number of available page sizes glGetInternalformativ(GL_TEXTURE_2D, GL_NUM_VIRTUAL_PAGE_SIZES_ARB, GL_RGBA8, sizeof(GLint), &num_sizes); // Get actual page sizes glGetInternalformativ(GL_TEXTURE_2D, GL_VIRTUAL_PAGE_SIZE_X_ARB, GL_RGBA8, sizeof(page_sizes_x), &page_sizes_x[0]); glGetInternalformativ(GL_TEXTURE_2D, GL_VIRTUAL_PAGE_SIZE_Y_ARB, GL_RGBA8, sizeof(page_sizes_y), &page_sizes_y[0]); // Choose a page size glTexParameteri(GL_TEXTURE_2D, GL_VIRTUAL_PAGE_SIZE_INDEX_ARB, n);
  • 58. Sparse Textures ● Reserve and commit ● In ‗Operating System‘ terms ● Reserve – virtual allocation without physical store ● Commit – back virtual allocation with real memory
  • 59. Sparse Textures ● Sparse textures – commitment ● Commitment is controlled by a single function ● Uncommitted pages use no memory ● Committed pages may contain data void glTexPageCommitmentARB(GLenum target, GLint level, GLint xoffset, GLint yoffset, GLint zoffset, GLsizei width, GLsizei height, GLsizei depth, GLboolean commit);
  • 60. Sparse Textures ● Sparse textures – data storage ● Put data into sparse textures as normal ● glTexSubImage, glCopyTextureImage, etc. ● Use a (persistent mapped) PBO for this! ● Attach to framebuffer object + draw ● Read from sparse textures ● glReadPixels, glGetTexImage*, etc.
  • 61. Sparse Textures ● Sparse textures – in-shader use ● No changes to shaders ● Reads from committed regions behave normally ● Reads from uncommitted regions return junk ● Probably not junk – most likely zeros ● The spec doesn‘t mandate this, however
  • 62. Sparse Texture Arrays ● Combine sparse textures and arrays ● Create very long (sparse) array textures ● Some layers are resident, some are not ● Allocate new layers on demand ● New layer = glTexPageCommitmentARB
  • 63. Sparse Texture Arrays ● Manage your own texture memory ● Create a huge virtual array texture ● Need a new texture? ● Allocate a new layer ● Don‘t need it any more? ● Recycle or make non-resident
  • 64. Sparse Bindless Texture Arrays ● Use all the features! ● Create a sparse array per texture size ● As textures become needed, commit pages ● Run out of pages? Make another texture... ● Get texture bindless handles ● Use as many handles as you like
  • 65. Sparse Bindless Texture Arrays ● Indexing sparse bindless arrays requires: ● 64-bit texture handle ● N-bit layer index ● Remember... ● Index can diverge, handle cannot ● Need one array per-size
  • 66. Building Data Structures ● Okay, so how do we use these things? ● Option 1 – Build on the CPU ● It‘s just memory writes ● Use a bunch of threads ● Persistent maps ● Option 2 – Use the GPU ● Much fun. Wow.
  • 67. Building Data Structures ● Using the GPU to set the scene (1) ● Create SSBO with AoS for draw parameters struct DrawParams { uint count; uint instanceCount; uint firstIndex; uint baseIndex; uint baseInstance; }; layout (binding = 0) { DrawParams draw_params[]; };
  • 68. Building Data Structures ● Using the GPU to set the scene (2) ● Create another SSBO for draw metadata struct DrawMeta { uint material_index; // More per-draw meta-stuff goes here... }; layout (binding = 0) { DrawMeta draw_meta[]; };
  • 69. Building Data Structures ● Using the GPU to set the scene (3) ● Use atomic counter to append to buffers layout (binding = 0, offset = 0) atomic_uint draw_count; void append_draw(DrawParams params, DrawMeta meta) { uint index = atomicCounterIncrement(draw_count); draw_params[index] = params; draw_meta[index] = meta; }
  • 70. Building Data Structures ● Using the GPU to set the scene (4) ● Dump counter, do MultiDraw*IndirectCount glCopyBufferSubData(GL_ATOMIC_COUNTER_BUFFER, GL_PARAMETER_BUFFER_ARB, 0, 0, sizeof(GLuint)); glMultiDrawElementsIndirectCountARB(GL_TRIANLGES, GL_UNSIGNED_SHORT, nullptr, MAX_DRAWS, 0);
  • 71. Building Data Structures ● Using the GPU to set the scene (5) ● In draw, use meta with gl_DrawIDARB struct Material { sampler2D tex1; }; layout (binding = 0) uniform MaterialData { Material material[]; }; ... oColor = texture(material[draw_meta[gl_DrawIDARB].material_index], ...);
  • 73. Putting it all into practice ● Introducing apitest ● Results ● Code review
  • 74. apitest ● https://github.com/nvMcJohn/apitest ● Extensible OSS Framework (Public Domain) ● Uses SDL 2.0 (Thanks SDL!) ● Initially developed by Patrick Doane OS OpenGL D3D11 Windows Yes Yes Linux Yes No OSX Sorta No
  • 75. The Framework ● Code is segmented into Problems and Solutions ● A Problem is a dataset to render ● A Solution is one targeted approach to rendering that dataset (Problem) ● Support code to create shaders, load textures, etc.
  • 76. The Problems So Far ● DynamicStreaming ● Render 160,000 ―particles‖ that are dynamically generated each frame ● UntexturedObjects ● Render 643 different, untextured objects ● Different matrices per object ● No instancing allowed!
  • 77. The Problems So Far - Continued ● Textured Quads ● 10,000 quads using different textures ● Texture is changed between every object ● Null ● Clear and SwapBuffer ● Not going to discuss today—included as a sanity startup.
  • 78. Result discussion ● Results gathered on a GTX 680, using public driver 335.23. ● But are shown normalized. ● AMD and Intel have very similar performance ratios between solutions.
  • 79. Decoder Ring ● SBTA = Sparse Bindless Texture Array ● SDP = Shader Draw Parameters
  • 80. DynamicStreaming ● Demo! ● Problem: Render 160,000 ―particles‖ that are dynamically generated each frame
  • 82. 0% 50% 100% 150% 200% 250% GLMapPersistent D3D11MapNoOverwrite GLBufferSubData D3D11UpdateSubresource GLMapUnsynchronized DynamicStreaming - Normalized Obj/s
  • 83. GLMapPersistent ● Map the buffer at the beginning of time ● Keep it mapped forever. ● You are responsible for safety (proper fencing) ● Do not stomp on data in flight ● src/solutions/dynamicstreaming/gl/mappersistent.*
  • 84. Required Extensions ● ARB_buffer_storage ● ARB_map_buffer_range ● ARB_sync
  • 85. Buffer Creation GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT; mDestHead = 0; mBuffSize = 3 * maxVerts * kVertexSizeBytes; glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer); glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags); mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, mBuffSize, mapFlags);
  • 86. Dem Flags GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT; mDestHead = 0; mBuffSize = 3 * maxVerts * kVertexSizeBytes; glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer); glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags); mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, mBuffSize, mapFlags);
  • 87. Set circular buffer head GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT; mDestHead = 0; mBuffSize = 3 * maxVerts * kVertexSizeBytes; glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer); glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags); mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, mBuffSize, mapFlags);
  • 88. Triple Buffering ftw GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT; mDestHead = 0; mBuffSize = 3 * maxVerts * kVertexSizeBytes; glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer); glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags); mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, mBuffSize, mapFlags);
  • 89. Buffer Create GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT; mDestHead = 0; mBuffSize = 3 * maxVerts * kVertexSizeBytes; glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer); glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags); mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, mBuffSize, mapFlags);
  • 90. Map me… forever. GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_MAP_DYNAMIC_STORAGE_BIT; mDestHead = 0; mBuffSize = 3 * maxVerts * kVertexSizeBytes; glBindBuffer(GL_ARRAY_BUFFER, mVertexBuffer); glBufferStorage(GL_ARRAY_BUFFER, mBuffSize, nullptr, createFlags); mVertexDataPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, mBuffSize, mapFlags);
  • 91. Buffer Update / Render mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes); for (int i = 0; i < particleCount; ++i) { const int vertexOffset = i * kVertsPerParticle; const int thisDstOffset = mDstHead + (i * kParticleSizeBytes); void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset; memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes); DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle); } mBufferLockManager.LockRange(mDstHead, vertSizeBytes); mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
  • 92. Safety Third! mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes); for (int i = 0; i < particleCount; ++i) { const int vertexOffset = i * kVertsPerParticle; const int thisDstOffset = mDstHead + (i * kParticleSizeBytes); void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset; memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes); DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle); } mBufferLockManager.LockRange(mDstHead, vertSizeBytes); mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
  • 93. Write those particles mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes); for (int i = 0; i < particleCount; ++i) { const int vertexOffset = i * kVertsPerParticle; const int thisDstOffset = mDstHead + (i * kParticleSizeBytes); void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset; memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes); DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle); } mBufferLockManager.LockRange(mDstHead, vertSizeBytes); mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
  • 94. Now draw (inefficiently) mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes); for (int i = 0; i < particleCount; ++i) { const int vertexOffset = i * kVertsPerParticle; const int thisDstOffset = mDstHead + (i * kParticleSizeBytes); void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset; memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes); DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle); } mBufferLockManager.LockRange(mDstHead, vertSizeBytes); mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
  • 95. Update circular buffer head mBufferLockManager.WaitForLockedRange(mDstHead, vertSizeBytes); for (int i = 0; i < particleCount; ++i) { const int vertexOffset = i * kVertsPerParticle; const int thisDstOffset = mDstHead + (i * kParticleSizeBytes); void* dst = (unsigned char*) mVertexDataPtr + thisDstOffset; memcpy(dst, &_vertices[vertexOffset], kParticleSizeBytes); DrawArrays(TRIANGLES, kStartIndex + vertexOffset, kVertsPerParticle); } mBufferLockManager.LockRange(mDstHead, vertSizeBytes); mDstHead = (mDstHead + vertSizeBytes) % mBuffSize;
  • 96. UntexturedObjects ● Demo! ● Problem: Render 643 unique, untextured objects
  • 98. 0% 100% 200% 300% 400% 500% 600% 700% 800% 900% GLBufferStorage-NoSDP GLMultiDrawBuffer-NoSDP GLMultiDraw-NoSDP GLBufferStorage-SDP GLMultiDrawBuffer-SDP GLMultiDraw-SDP GLMapPersistent GLDrawLoop GLBindlessIndirect GLTexCoord GLUniform D3D11Naive GLBindless GLDynamicBuffer GLBufferRange GLMapUnsynchronized Untextured Object - Normalized Obj/s
  • 99. 0% 100% 200% 300% 400% 500% 600% 700% 800% 900% GLBufferStorage-NoSDP GLMultiDrawBuffer-NoSDP GLMultiDraw-NoSDP GLBufferStorage-SDP GLMultiDrawBuffer-SDP GLMultiDraw-SDP GLMapPersistent GLDrawLoop GLBindlessIndirect GLTexCoord GLUniform D3D11Naive GLBindless GLDynamicBuffer GLBufferRange GLMapUnsynchronized Untextured Object - Normalized Obj/s
  • 100. 0% 100% 200% 300% 400% 500% 600% 700% 800% 900% GLBufferStorage-NoSDP GLMultiDrawBuffer-NoSDP GLMultiDraw-NoSDP GLBufferStorage-SDP GLMultiDrawBuffer-SDP GLMultiDraw-SDP GLMapPersistent GLDrawLoop GLBindlessIndirect GLTexCoord GLUniform D3D11Naive GLBindless GLDynamicBuffer GLBufferRange GLMapUnsynchronized Untextured Object - Normalized Obj/s
  • 101. 0% 100% 200% 300% 400% 500% 600% 700% 800% 900% GLBufferStorage-NoSDP GLMultiDrawBuffer-NoSDP GLMultiDraw-NoSDP GLBufferStorage-SDP GLMultiDrawBuffer-SDP GLMultiDraw-SDP GLMapPersistent GLDrawLoop GLBindlessIndirect GLTexCoord GLUniform D3D11Naive GLBindless GLDynamicBuffer GLBufferRange GLMapUnsynchronized Untextured Object - Normalized Obj/s
  • 102. GLBufferStorage-(ε|No)SDP ● Set up a giant uniform or storage buffer with data for all objects for a frame. ● Use MDI to render many objects at once ● And PMB for dynamic data (matrix transforms, MDI entries) ● Need a way to index data in shader (SDP)
  • 103. Required Extensions ● ARB_buffer_storage ● ARB_map_buffer_range ● ARB_multi_draw_indirect ● ARB_shader_draw_parameters ● ARB_shader_storage_buffer_object ● ARB_sync
  • 104. NoSDP ● Can be used when instancing isn‘t needed ● Very simple improvement to SDP approach ● Not going to cover today ● So check the source code!
  • 105. DrawElementsIndirectCommand struct DrawElementsIndirectCommand { uint count; uint instanceCount; uint firstIndex; uint baseVertex; uint baseInstance; }; typedef DrawElementsIndirectCommand DEICmd;
  • 106. GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_DYNAMIC_STORAGE_BIT; mCmdHead = 0; mCmdSize = 3 * objCount * sizeof(DEICmd); glBindBuffer(GL_DRAW_INDIRECT_BUFFER, mCmdBuffer); glBufferStorage(GL_DRAW_INDIRECT_BUFFER, mCmdSize, 0, createFlags); mCmdPtr = glMapBufferRange(GL_DRAW_INDIRECT_BUFFER, 0, mCmdSize, mapFlags); Cmd Buffer Creation
  • 107. Obj Buffer Creation GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; GLbitfield createFlags = mapFlags | GL_DYNAMIC_STORAGE_BIT; mObjHead = 0; mObjSize = 3 * objCount * sizeof(Matrix); glBindBuffer(GL_SHADER_STORAGE_BUFFER, mObjBuffer); glBufferStorage(GL_SHADER_STORAGE_BUFFER, mObjSize, 0, createFlags); mObjPtr = glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, mObjSize, mapFlags);
  • 108. Cmd Buffer Update mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount); for (size_t u = 0; u < objCount; ++u) { DEICmd *cmd = (mCmdPtr + mCmdHead) + u; cmd->count = mIndexCount; cmd->instanceCount = 1; cmd->firstIndex = 0; cmd->baseVertex = 0; cmd->baseInstance = 0; } oldCmdHead = mCmdHead; mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize; // Next, update the per-Object Data
  • 109. Fencing for fun and profit mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount); for (size_t u = 0; u < objCount; ++u) { DEICmd *cmd = (mCmdPtr + mCmdHead) + u; cmd->count = mIndexCount; cmd->instanceCount = 1; cmd->firstIndex = 0; cmd->baseVertex = 0; cmd->baseInstance = 0; } oldCmdHead = mCmdHead; mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize; // Next, update the per-Object Data
  • 110. Someone Set Up Us The Draws mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount); for (size_t u = 0; u < objCount; ++u) { DEICmd *cmd = (mCmdPtr + mCmdHead) + u; cmd->count = mIndexCount; cmd->instanceCount = 1; cmd->firstIndex = 0; cmd->baseVertex = 0; cmd->baseInstance = 0; } oldCmdHead = mCmdHead; mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize; // Next, update the per-Object Data
  • 111. Manage the Head mCmdLock.WaitForLockedRange(mCmdHead, sizeof(DEICmd) * objCount); for (size_t u = 0; u < objCount; ++u) { DEICmd *cmd = (mCmdPtr + mCmdHead) + u; cmd->count = mIndexCount; cmd->instanceCount = 1; cmd->firstIndex = 0; cmd->baseVertex = 0; cmd->baseInstance = 0; } oldCmdHead = mCmdHead; mCmdHead = (mCmdHead + sizeof(DEICmd) * objCount) % mCmdSize; // Next, update the per-Object Data
  • 112. Obj Buffer Update // Next, update the per-Object Data // Next, update the per-Object Data
  • 113. Obj Buffer Update / Render // Next, update the per-Object Data mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount); for (size_t u = 0; u < objCount; ++u) { Matrix *obj = (mObjPtr + mObjHead) + u; (*obj) = (inObjParameters)[u]; } glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, objCount, 0); mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount); mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount); mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
  • 114. Seriously though, be safe // Next, update the per-Object Data mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount); for (size_t u = 0; u < objCount; ++u) { Matrix *obj = (mObjPtr + mObjHead) + u; (*obj) = (inObjParameters)[u]; } glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, objCount, 0); mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount); mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount); mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
  • 115. Updates to object parameters // Next, update the per-Object Data mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount); for (size_t u = 0; u < objCount; ++u) { Matrix *obj = (mObjPtr + mObjHead) + u; (*obj) = (inObjParameters)[u]; } glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, objCount, 0); mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount); mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount); mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
  • 116. Draw all the things // Next, update the per-Object Data mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount); for (size_t u = 0; u < objCount; ++u) { Matrix *obj = (mObjPtr + mObjHead) + u; (*obj) = (inObjParameters)[u]; } glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, objCount, 0); mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount); mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount); mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
  • 117. Head management // Next, update the per-Object Data mObjLock.WaitForLockedRange(mObjHead, sizeof(Matrix) * objCount); for (size_t u = 0; u < objCount; ++u) { Matrix *obj = (mObjPtr + mObjHead) + u; (*obj) = (inObjParameters)[u]; } glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_SHORT, 0, objCount, 0); mCmdLock.LockRange(oldCmdHead, sizeof(DEICmd) * objCount); mObjLock.LockRange(mObjHead, sizeof(Matrix) * objCount); mObjHead = (mObjHead + sizeof(Matrix) * objCount) % mObjSize;
  • 118. TexturedQuads ● Demo! ● 10,000 quads using different textures ● Texture is changed between every object
  • 120. 0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% 1800% 2000% GLSBTAMultiDraw-NoSDP GLTextureArrayMultiDraw-NoSDP GLBindlessMultiDraw GLSBTAMultiDraw-SDP GLTextureArrayMultiDraw-SDP GLNoTex GLTextureArray GLNoTexUniform GLTextureArrayUniform GLSBTA GLBindless GLNaive GLNaiveUniform D3D11Naive TexturedQuads – Normalized Obj/s
  • 121. 0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% 1800% 2000% GLSBTAMultiDraw-NoSDP GLTextureArrayMultiDraw-NoSDP GLBindlessMultiDraw GLSBTAMultiDraw-SDP GLTextureArrayMultiDraw-SDP GLNoTex GLTextureArray GLNoTexUniform GLTextureArrayUniform GLSBTA GLBindless GLNaive GLNaiveUniform D3D11Naive TexturedQuads – Normalized Obj/s
  • 122. 0% 200% 400% 600% 800% 1000% 1200% 1400% 1600% 1800% 2000% GLSBTAMultiDraw-NoSDP GLTextureArrayMultiDraw-NoSDP GLBindlessMultiDraw GLSBTAMultiDraw-SDP GLTextureArrayMultiDraw-SDP GLNoTex GLTextureArray GLNoTexUniform GLTextureArrayUniform GLSBTA GLBindless GLNaive GLNaiveUniform D3D11Naive TexturedQuads – Normalized Obj/s
  • 123. TexturedQuads notes ● SBTA was covered at Steam Dev Days ● Non-Sparse, Non-Bindless TextureArray is the fallback ● Should use BufferStorage improvements ● SBTA = Sparse Bindless Texture Array
  • 124. GLTextureArrayMultiDraw-(ε|No)SDP ● Instead of loose textures, use arrays of Texture Arrays ● Container contains <=2048 same-shape textures ● Shape is height, width, mipmapcount, format ● Use MDI for kickoffs ● Address is passed as {int; float} pair
  • 125. struct Tex2DAddress { uint Container; float Page; }; layout (std140, binding=1) readonly buffer CB1 { Tex2DAddress texAddress[]; }; uniform sampler2DArray TexContainer[16]; // Elsewhere (in a func, whatever) int drawID = int(In.iDrawID); Tex2DAddress addr = texAddress[drawID]; vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page); vec4 texel = texture(TexContainer[addr.Container], texCoord);
  • 126. struct Tex2DAddress { uint Container; float Page; }; layout (std140, binding=1) readonly buffer CB1 { Tex2DAddress texAddress[]; }; uniform sampler2DArray TexContainer[16]; // Elsewhere (in a func, whatever) int drawID = int(In.iDrawID); Tex2DAddress addr = texAddress[drawID]; vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page); vec4 texel = texture(TexContainer[addr.Container], texCoord);
  • 127. struct Tex2DAddress { uint Container; float Page; }; layout (std140, binding=1) readonly buffer CB1 { Tex2DAddress texAddress[]; }; uniform sampler2DArray TexContainer[16]; // Elsewhere (in a func, whatever) int drawID = int(In.iDrawID); Tex2DAddress addr = texAddress[drawID]; vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page); vec4 texel = texture(TexContainer[addr.Container], texCoord);
  • 128. struct Tex2DAddress { uint Container; float Page; }; layout (std140, binding=1) readonly buffer CB1 { Tex2DAddress texAddress[]; }; uniform sampler2DArray TexContainer[16]; // Elsewhere (in a func, whatever) int drawID = int(In.iDrawID); Tex2DAddress addr = texAddress[drawID]; vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page); vec4 texel = texture(TexContainer[addr.Container], texCoord);
  • 129. struct Tex2DAddress { uint Container; float Page; }; layout (std140, binding=1) readonly buffer CB1 { Tex2DAddress texAddress[]; }; uniform sampler2DArray TexContainer[16]; // Elsewhere (in a func, whatever) int drawID = int(In.iDrawID); Tex2DAddress addr = texAddress[drawID]; vec3 texCoord = vec3(In.v2TexCoord.xy, addr.Page); vec4 texel = texture(TexContainer[addr.Container], texCoord);
  • 130. Questions? ● graham dot sellers at amd dot com @GrahamSellers ● tim dot foley at intel dot com @TangentVector ● cass at nvidia dot com @casseveritt ● jmcdonald at nvidia dot com @basisspace

Editor's Notes

  1. Where tightly packed == sizeof(struct) with no additional data
  2. * OSX is supported, but it currently really only runs the NULL solution.
  3. 64^3 = 262,144
  4. mVertexBuffer was previously gen’d into with glGenBuffers(1, &amp;mVertexBuffer);We set up for triple buffering. You can often get away with a smaller buffer (like 2x). You need to measure.Our flags are the WRITE, PERSISTENT and COHERENT bits.Then we persistently map the whole buffer.
  5. mVertexBuffer was previously gen’d into with glGenBuffers(1, &amp;mVertexBuffer);We set up for triple buffering. You can often get away with a smaller buffer (like 2x). You need to measure.Our flags are the WRITE, PERSISTENT and COHERENT bits.Then we persistently map the whole buffer.
  6. mVertexBuffer was previously gen’d into with glGenBuffers(1, &amp;mVertexBuffer);We set up for triple buffering. You can often get away with a smaller buffer (like 2x). You need to measure.Our flags are the WRITE, PERSISTENT and COHERENT bits.Then we persistently map the whole buffer.
  7. mVertexBuffer was previously gen’d into with glGenBuffers(1, &amp;mVertexBuffer);We set up for triple buffering. You can often get away with a smaller buffer (like 2x). You need to measure.Our flags are the WRITE, PERSISTENT and COHERENT bits.Then we persistently map the whole buffer.
  8. mVertexBuffer was previously gen’d into with glGenBuffers(1, &amp;mVertexBuffer);We set up for triple buffering. You can often get away with a smaller buffer (like 2x). You need to measure.Our flags are the WRITE, PERSISTENT and COHERENT bits.Then we persistently map the whole buffer.
  9. mVertexBuffer was previously gen’d into with glGenBuffers(1, &amp;mVertexBuffer);We set up for triple buffering. You can often get away with a smaller buffer (like 2x). You need to measure.Our flags are the WRITE, PERSISTENT and COHERENT bits.Then we persistently map the whole buffer.
  10. BufferStorage improvements are probably worth another ~15%, bringing the total speedup to ~22x over D3D11.