SlideShare a Scribd company logo
Siggraph Asia 2014
Tristan Lorach
Manager of Devtech for Professional Visualization group
OPENGL NVIDIA "COMMAND-LIST":
"APPROACHING ZERO DRIVER OVERHEAD"
2
Siggraph Asia 2014
GPU Scalability
GPUs are extremely powerful
But only if used correctly
Mix of Application and Graphic API (Driver) responsibility
Increase amount of work per batch (Job)
Minimize CPUGPU interactions
Lower Memory traffic
Lower API calls: Batch things together
Factorize Data
Re-use data uploaded to Video memory (instancing…)
3
Siggraph Asia 2014
Use of the GPU
Games
~1500 to 3000 Drawcalls for a scene
Intensive use of Multi-layer image processing over the scene
Heavy shaders
CAD/FCC/professional applications
Hard to batch user’s works (CAD applications)
~10,000 to… 300,000 Drawcalls for a scene: heavy CPU workload for our driver
Shaders simpler than games (but catching-up these days...)
Post-Processing more and more used (SSAO)
4
Siggraph Asia 2014
Use of the GPU
New Graphic APIs are trying to address these concerns
Khronos (Announced it at Siggraph Vancouver 2014)
Metal
Mantle
DX12
All propose better ways to issue commands and render-states
But can we improve OpenGL to go the same path ?
Yes: NV_command_list
5
Siggraph Asia 2014
OpenGL
OpenGL is ~25 years old (!!)
1.x: fixed function
2.x: Shading language (GLSL)
3.x/4.x: more as a massive compute system
More graphic stages (Geom.Shader; Tessellation)
High flexibility & efficiency on memory accesses
High programability
Design "flaws": still “State Machine”;
ServerClient; poor multi-threading management
1992
2012
6
Siggraph Asia 2014
Challenge of Issuing Commands
Issuing drawcalls and state changes can be a real bottleneck
 650,000 Triangles
 68,000 Parts
 ~ 10 Triangles per part
 3,700,000 Triangles
 98 000 Parts
 ~ 37 Triangles per part
 14,338,275 Triangles/lines
 300,528 drawcalls (parts)
 ~ 48 Triangles per part
App+driver
GPU
GPU
idle
CPU
Excessive
Work from
App & Driver
On CPU
!
!
courtesy of PTC
7
Siggraph Asia 2014
Big Picture – Typical Case
Vertex Puller (IA)
Vertex Shader
TCS (Tessellation)
TES (Tessellation)
Tessellator
Geometry Shader
Transform Feedback
Rasterization
Fragment Shader
Per-Fragment Ops
Framebuffer
Tr. Feedback buffer
Uniform Block
Texture Fetch
Image Load/Store
Atomic Counter
Shader Storage
Element buffer (EBO)
Draw Indirect Buffer
Vertex Buffer (VBO)
Front-End
(decoder)
Cmd bundles
OpenGL
Driver
Application
Push-Buffer
(FIFO)
cmds
FBO resources
(Textures / RB)
64 bits
pointersHandles
(IDs)
Id  64 bits
Addr.
OpenGL
Commands
OpenGL
resources
8
Siggraph Asia 2014
Application
Token-buffer
OpenGL
Driver
Big Picture
Vertex Puller (IA)
Vertex Shader
TCS (Tessellation)
TES (Tessellation)
Tessellator
Geometry Shader
Transform Feedback
Rasterization
Fragment Shader
Per-Fragment Ops
Framebuffer
Tr. Feedback buffer
Uniform Block
Texture Fetch
Image Load/Store
Atomic Counter
Shader Storage
Element buffer (EBO)
Draw Indirect Buffer
Vertex Buffer (VBO)
Front-End
(decoder)
Cmd bundles
Push-Buffer
(FIFO)
cmds
FBO resources
(Textures / RB)
OpenGL
Commands
resources
64 bits
Pointers
(bindless)
64 bits
GPU Address
Offload cmd bundles
creation to the App.
9
Siggraph Asia 2014
Application
Command-list
Token-buffer
OpenGL
Driver
Big Picture
Vertex Puller (IA)
Vertex Shader
TCS (Tessellation)
TES (Tessellation)
Tessellator
Geometry Shader
Transform Feedback
Rasterization
Fragment Shader
Per-Fragment Ops
Framebuffer
Tr. Feedback buffer
Uniform Block
Texture Fetch
Image Load/Store
Atomic Counter
Shader Storage
Element buffer (EBO)
Draw Indirect Buffer
Vertex Buffer (VBO)
Front-End
(decoder)
Cmd bundles
Push-Buffer
(FIFO)
cmds
FBO resources
(Textures / RB)
OpenGL
Commands
Token-buffers
+ state objects
resources
64 bits
Pointers
(bindless)
State Object
More work for
FE – but fast !
10
Siggraph Asia 2014
Demo – Quadro K5000
CAD Car model
14,338,275 Primitives
300,528 drawcalls
348,862 attribute update
12,004 uniform update
Set of CAD models together
29,344,075 Primitives
26,144 drawcalls
18,371 attribute update
13,632 uniform update
11
Siggraph Asia 2014
DemoS – Quadro K5000
CAD Car model
K5000:
920,363,200 primitives/S
19,290,668 drawcalls/S
12
Siggraph Asia 2014
DemoS – Quadro K5000
Set of CAD models together
K5000
1,211,684,096 primitives/S
1,079,545 drawcalls/S
13
Siggraph Asia 2014
DemoS – Maxwell
CAD Car model
K5000:
920,363,200 primitives/S
19,290,668 drawcalls/S
Maxwell GM204 (GeForce):
1,782,257,920 primitives/S
37,355,848 drawcalls/S
Drawcall time : 8 Micro-
seconds on CPU (!)
Set of CAD models together
K5000
1,211,684,096 primitives/S
1,079,545 drawcalls/S
Maxwell GM204 (GeForce)
3,012,795,392 primitives/S
2,684,239 drawcalls/S
Drawcall time: 42 Micro-
seconds on CPU
!
14
Siggraph Asia 2014
More Performances
Another test: always using bindless UBO and VBO
CPU-Emulation: Token stream executed as CPU "switch-case"
Strategies
Un-grouped: ~ 1 MB, 79,000 Tokens, ~68,000 drawcalls
Material-grouped: ~300 KB, 22,000 Tokens, ~11,000 drawcalls
Strategy Draw time Material-grouped
0.73
HW TOKEN-Buffers 0.56
Technique GPU (ms)
1.4 1.14 x
~0 BIG x
CPU (ms)
Draw time un-grouped
2.4
1.1
GPU (ms)
4.5 1.6 x
~0 BIG x
CPU (ms)
CPU-emulated TOKEN-buffers
glBindBufferRange + drawcalls 0.73 1.6 3.5 7.2
Preliminary results K6000
15
Siggraph Asia 2014
More Performances
5 000 shader changes: toggling between two shaders in „shaded & edges“
5 000 fbo changes: similar as above but with fbo toggle instead of shader
Almost no additional cost compared to rendering without fbo changes
Timing GPU (ms) CPU (ms)
CPU-emulated TOKEN-buffers 12.7 15.1
2.9 4.3 x 0.4 37 xHW TOKEN-buffers
2.8 4.5 x 0.005 BIG xCommand-Lists
Timer GPU (ms) CPU (ms)
CPU-emulated TOKEN-buffers 60.0 60.0
1.8 33 x 0.9 66 xHW TOKEN-buffers
1.7 35 x 0.022 BIG xCommand-Lists
Preliminary results on K5000
16
Siggraph Asia 2014
GPU
Virtual
Memory
Bindless Technology
What is it about?
Work from native GPU pointers/handles (NVIDIA
pioneered this technology)
A lot less CPU work (memory hopping, validation...)
Allow GPU to use flexible data structures
Bindless Buffers
Vertex & Global memory since Tesla Generation (CUDA
capable)
Bindless Textures
Since Kepler
Bindless Constants (UBO)
New driver feature, support for Fermi and above
Bindless plays a central role for Command-List
Vertex Puller (IA)
Vertex Shader
TCS (Tessellation)
TES (Tessellation)
Tessellator
Geometry Shader
Transform Feedback
Rasterization
Fragment Shader
Per-Fragment Ops
Framebuffer
Uniform Block
Texture Fetch
Element buffer (EBO)
Vertex Buffer (VBO)
Front-End
(decoder)
Push buffer
64bitsaddress
Attr.s
Idx
Uniform Block
Send Ptrs
17
Siggraph Asia 2014
#define UBA UNIFORM_BUFFER_ADDRESS_NV
UpdateBuffers();
glEnableClientState(UNIFORM_BUFFER_UNIFIED_NV);
glBufferAddressRangeNV (UBA, 0, addrView, viewSize);
foreach (obj in scene) {
...
// glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize);
glBufferAddressRangeNV(UBA, 1, addrMatrices + obj.matrixOffset, maSize);
foreach ( batch in obj.primitive_group_material) {
// glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize);
glBufferAddressRangeNV(UBA, 2, addrMaterial + batch.materialOffset, mtlSize);
...
}
}
Example On Using Bindless UBO
New pointer for UBO#1 updated per Object
New pointer for UBO#2 updated per
Primitive group
pointer for UBO#0 updated once for all objects
regular UBO binds are now
ignored!
18
Siggraph Asia 2014
Pointers must be aligned
So do Ptr Offsets
glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT,
&offsetAlignment);
Normally: 256 bytes
 gaps between each item
Try to fit as much data as possible…
Not the case if passing an index for
array access (MDI + “Base-instance”)
But requires special GLSL code (indexing)
Bindless And Memory Alignement
GPU
Virtual
Memory
64bitsaddress
Uniform Material 0
Uniform Material 1
Uniform Material 2
…
glBufferAddressRangeNV
(UBA, 2, addrMaterial
+ batch.materialOffset
, mtlSize);
256b
Uniform array materials[]
Material 0
Material 1
Material 2
…
Uniform …
19
Siggraph Asia 2014
NV_command_list Key Concepts
1. Tokenized Rendering:
some state changes and draw commands are encoded into binary data stream
Depends on bindless technology
2. State Objects
Whole OpenGL States (program, blending...) captured as an object
Allows pre-validation of state combinations, later reuse of objects is very fast
3. Command List Object
„Display-list“ paradigm but more flexible: buffer are outside (referenced by
address), so content can still be modified (matrics, vertices...)
20
Siggraph Asia 2014
UpdateBuffers();
glBindBufferBase (UBO, 0, uboView);
foreach (obj in scene) {
// redundancy filter for these (if (used != last)...
glBindVertexBuffer (0, obj.geometry->vbo, 0, vtxSize);
glBindBuffer (ELEMENT, obj.geometry->ibo);
glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize);
// iterate over cached material groups
foreach ( batch in obj.materialGroups) {
glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize);
glMultiDrawElements (...);
}
}
Typical Scene Drawing Example
~ 2 500 api drawcalls
~11 000 drawcalls
~55 triangles per call
21
Siggraph Asia 2014
UpdateBuffers();
glBindBufferBase (UBO, 0, uboView);
foreach (obj in scene) {
// redundancy filter for these (if (used != last)...
glBindVertexBuffer (0, obj.geometry->vbo, 0, vtxSize);
glBindBuffer (ELEMENT, obj.geometry->ibo);
glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize);
// iterate over all parts individually
foreach ( part in obj.parts) {
if (part.material != lastMaterial){
glBindBufferRange (UBO, 2, uboMaterial, part.materialOffset, mtlSize);
}
glDrawElements (...);
}
}
Typical Scene Drawing Example
~68 000 drawcalls
~10 triangles per call
22
Siggraph Asia 2014
Scene Drawing Converted To Token Buffer
glDrawCommandsNV (GL_TRIANGLES, tokenBuffer
, offsets[], sizes[], count);
// {0}, {bufferSize}, 1
foreach (obj in scene) {
glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...);
glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...);
glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices ...);
foreach ( batch in obj.materialGroups) {
glBufferAddressRangeNV(UNIFORM, 2, addrMaterial ...);
glMultiDrawElements(...)
}
}
becomes a single drawcall
replaces 80k calls to GL
(for graphic-card model)
Token buffer
Set Attr#0 on
VBO address …
Set Attr#1 on
VBO address …
Set Elements on
EBO address …
Uniform Matrix
on UBO address …
Uniform Material
on UBO address …
DrawElements
Uniform Material
on UBO address …
DrawElements
Obj#1
Obj#2
…
For all
objects
…
23
Siggraph Asia 2014
Tokens-buffers are tightly packed structs in linear memory
Token Buffer Structures
TokenUbo
{
GLuint header;
UniformAddressCommandNV
{
GLushort index;
GLushort stage;
GLuint64 address;
} cmd;
}
TokenVbo
{
GLuint header;
AttributeAddressCommandNV
{
GLuint index;
GLuint64 address;
} cmd;
}
TokenIbo
{
GLuint header;
ElementAddressCommandNV
{
GLuint64 address;
GLuint typeSizeInByte;
} cmd;
}
TokenDrawElements
{
Gluint header;
DrawElementsCommandNV
{
GLuint count;
GLuint firstIndex;
GLuint baseVertex;
} cmd;
}
Tokenbuffer
SetAttr#0on
VBOaddress…
SetAttr#1on
VBOaddress…
SetElementson
EBOaddress…
UniformMatrix
onUBOaddress…
UniformMaterial
onUBOaddress…
DrawElements
UniformMaterial
onUBOaddress…
DrawElements
…
UniformMaterial
onUBOaddress…
DrawElements
UniformMaterial
onUBOaddress…
DrawElements
UniformMaterial
onUBOaddress…
DrawElements
UniformMaterial
onUBOaddress…
DrawElements
UniformMaterial
onUBOaddress…
DrawElements
24
Siggraph Asia 2014
Tokenized Rendering
Token Headers are GPU commands, followed by arguments
FRONTFACE_COMMAND_NV
BLEND_COLOR_COMMAND_NV
STENCIL_REF_COMMAND_NV
LINE_WIDTH_COMMAND_NV
POLYGON_OFFSET_COMMAND_NV
ALPHA_REF_COMMAND_NV
VIEWPORT_COMMAND_NV
SCISSOR_COMMAND_NV
ELEMENT_ADDRESS_COMMAND_NV
ATTRIBUTE_ADDRESS_COMMAND_NV
UNIFORM_ADDRESS_COMMAND_NV
DRAW_ELEMENTS_COMMAND_NV
DRAW_ARRAYS_COMMAND_NV
DRAW_ELEMENTS_STRIP_COMMAND_NV
DRAW_ARRAYS_STRIP_COMMAND_NV
DRAW_ELEMENTS_INSTANCED_COMMAND_NV
DRAW_ARRAYS_INSTANCED_COMMAND_NV
TERMINATE_SEQUENCE_COMMAND_NV
NOP_COMMAND_NV
Final list might change
Render States
Index Buffer;
Position Attribute; Normals; Texcoords...
Matrices; Material data...
Draw Commands !
Special Commands (see later)
25
Siggraph Asia 2014
Tokenized Rendering
Example: UBO Address assignment structure
FRONTFACE_COMMAND_NV
BLEND_COLOR_COMMAND_NV
STENCIL_REF_COMMAND_NV
LINE_WIDTH_COMMAND_NV
POLYGON_OFFSET_COMMAND_NV
ALPHA_REF_COMMAND_NV
VIEWPORT_COMMAND_NV
SCISSOR_COMMAND_NV
ELEMENT_ADDRESS_COMMAND_NV
ATTRIBUTE_ADDRESS_COMMAND_NV
UNIFORM_ADDRESS_COMMAND_NV
DRAW_ELEMENTS_COMMAND_NV
DRAW_ARRAYS_COMMAND_NV
DRAW_ELEMENTS_STRIP_COMMAND_NV
DRAW_ARRAYS_STRIP_COMMAND_NV
DRAW_ELEMENTS_INSTANCED_COMMAND_NV
DRAW_ARRAYS_INSTANCED_COMMAND_NV
TERMINATE_SEQUENCE_COMMAND_NV
NOP_COMMAND_NV
TokenUbo
{
GLuint header;
UniformAddressCommandNV
{
GLushort index;
GLushort stage;
GLuint64 address;
} cmd;
}
Example
Real Token to be queried
header = glGetCommandHeaderNV(
UNIFORM_ADDRESS_COMMAND_NV, Sizeof(TokenUbo) )
One Shading stage to target (not ‘program’)
Stage = glGetStageIndexNV(S)
(S: STAGE_VERTEX|GEOMETRY|FRAGMENT|TESS_CONTROL/EVAL)
Hardware Index
*not* Uniform Program index
26
Siggraph Asia 2014
TOKENIZED RENDERING
Allows rendering of a mix of LIST, STRIP, FAN, LOOP modes in one go
Pass „basic type“ as Token, use „mode“ for more in "Instanced" drawing
TokenDrawElements
{
Gluint header;
DrawElementsCommandNV
{
GLuint count;
GLuint firstIndex;
GLuint baseVertex;
} cmd;
}
TokenDrawElementsInstanced
{
Gluint header;
DrawElementsInstancedCommandNV
{
GLuint mode;
GLuint count;
GLuint instanceCount;
GLuint firstIndex;
GLuint baseVertex;
GLuint baseInstance;
} cmd;
}
LIST, STRIP,
FAN, LOOP types
(GL_LINE_LOOP...)
DRAW_ELEMENTS_COMMAND_NV
DRAW_ARRAYS_COMMAND_NV
DRAW_ELEMENTS_STRIP_COMMAND_NV
DRAW_ARRAYS_STRIP_COMMAND_NV
glGetCommandHeaderNV()
DRAW_ELEMENTS_INSTANCED_COMMAND_NV
DRAW_ARRAYS_INSTANCED_COMMAND_NV
glGetCommandHeaderNV()
27
Siggraph Asia 2014
TERMINATE_SEQUENCE: Occlusion Culling
Now overcomes deficit of previous methods
http://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-
techniques.pdf
Enqueue all tokens for entire object
Shader to Check Visibility; Shader for SCAN operation;
Add TERMINATE_SEQUENCE_COMMAND_NV to end
Technique Draw time GROUPED
Timing GPU (ms) CPU (ms)
6.3 2.6 x 0.9 6 xCULL Old Bindless MDI (*)
CULL NEW TOKEN native 0.2 27 x
glBindBufferRange + drawcalls 16.8 5.4
6.2 2.7 x 6.3 0.85 xCULL Old Readback (stalls pipe) (*)
K5000, Preliminary results, (*) taken from slightly differnt framework
HW TOKEN-buffers ~0 BIG x
No instancing!
(synthetic test, data replication)
SCENE:
materials: 138
objects: 13,032
geometries: 3,312
parts: 789,464
triangles: 29,527,840
vertices: 27,584,376
16.8
5.3 3.1 x
28
Siggraph Asia 2014
Just a little...
“commandBindableNV” turns binding IDs
to Hardware IDs
Texturing must go through bindless
ARB_bindless_texture
Regular Texture binding Turned Off
Changing UBO Ptr Addresses
== texture changes (here ‘myBuf’)
Does Shaders Need Special Code ?
#extension GL_ARB_bindless_texture : require
#extension GL_NV_command_list : enable
#if GL_NV_command_list
layout(commandBindableNV) uniform;
#endif
layout(std140,binding=2) uniform myBuf {
sampler2D mySampler;
//or:
in int whichSampler;
};
layout(std140,binding=1) uniform samplers {
sampler2D allSamplers[NUM_TEXTURES];
};
…
c = texture(mySampler, tc);
//or
c = texture(allSamplers [ whichSampler ] , tc);
…
Example:
29
Siggraph Asia 2014
State Objects
StateObject
Encapsulates majority of state (fbo format, active shader, blend,
depth ...)
State Object immutable: gives more control over validation cost
no bindings captured: bindless used instead; textures passed via UBO
Primitive type is a state: participates to important validation work
Cannot be put in the token buffer
glCaptureStateNV(stateobject, GL_TRIANGLES );
Primitive type as argument:
because it is part of states
30
Siggraph Asia 2014
Render entire scenes with different
shaders/fbos... in one go
Driver caches state transitions !
Draw With State Objects
glDrawCommandsStatesNV(
tokenBuffer,
offsets[], sizes[],
states[], fbos[],
count);
Token buffer
VBO address …
VBO address …
EBO address …
Uniform Matrix
Uniform Material
DrawElements
Uniform Material
DrawElements
…
EBO address …
Uniform Matrix
Uniform Material
DrawElements
Uniform Material
DrawElements
FBO 1
FBO 2
fbos[]
FBO 1
FBO 1
State Obj 1
State Obj 2
State Obj 1
State Obj 4
Count = 4
States[]
offsets[]
sizes[0]sizes[1]sizes[2]sizes[3]
31
Siggraph Asia 2014
Driver bakes „Diff“ between states and apply only the difference
Pseudo Code of glDrawCommandsStatesNV
for i < count {
if (i == 0) set state from states[i];
else set state transition states[i-1] to states[i]
if (fbo[i]) {
// fbo[i] must be compatible with states[i].fbo
glBindFramebuffer( fbo[i] )
} else glBindFramebuffer( states[i].fbo )
ProcessCommandSequence(...tokenBuffer, offsets[i], sizes[i])
}
32
Siggraph Asia 2014
Rendering must always happen in a FBO
Backbuffer Ptr. Address is un-reliable (dynamic re-alloc when resizing…)
Use glBlitFramebuffer() for FBO  final Backbuffer result
FBO Resources Must be Resident (glMakeTextureHandleResidentARB())
Resources can be replaced/swapped if they keep the same base
configuration (attachment formats, # drawbuffers...)
I.e. Same configuration as captured FBO settings from stateobject
This is what happens when resizing:
Detach & Destroy resource  create new ones  attach them to FBO(s)
Frame-Buffer Objects And State Objects
33
Siggraph Asia 2014
Combine multiple token buffers &
states into a CommandList object
glCreateCommandListsNV(N, &list)
glCommandListSegmentsNV(list, segs)
Convenient for concatenation
glListDrawCommandsStatesClientNV(list,
segment, token-buffer-ptrs, token-buffer-
sizes, states, FBOs, count)
glCompileCommandListNV(list);
glCallCommandListNV(list)
Command-List Object
System mem.
VBO address …
VBO address …
EBO address …
Uniform Matrix
Uniform Material
DrawElements
…
Uniform Material
DrawElements
FBO
State Object
System mem.
VBO address …
VBO address …
EBO address …
Uniform Matrix
Uniform Material
DrawElements
…
FBO
State Object
System mem.
VBO address …
VBO address …
Uniform Matrix
Uniform Material
DrawArrays
…
Uniform Material
DrawArrays
FBO
State Object
…
Command-List Segment #0 Segment #1
34
Siggraph Asia 2014
Command-List Object
Less flexibilty compared to token buffer
Token buffers taken from client pointers (CPU memory)
Driver will optimize these token buffers and related state-objects
Optimize State transitions; token cmds for the GPU Front-End
Resulting command-list is immutable
If any resource pointer changed, the command-list must be recompiled
But still rather flexible
possible to change content of referenced buffers! (matrices, materials,
vertices...)
Animation; skinning... No need for command-list recompilation
35
Siggraph Asia 2014
Command-List doesn’t pretend to solve any OpenGL multi-threading
Multi-threading can help if complex CPU work going-on
update of Vertex / Element buffers: transform. Matrices; skinning…
Update of Token Buffers: adding/removing characters in a game scene…
OpenGL & Multithreading issues:
OpenGL is bad at sharing its access from various threads
Make-Current or Multiple Contexts aren’t good solutions
Prefer to keep context ownership to one main thread
OpenGL And Multi-threading ?
36
Siggraph Asia 2014
Token Buffers and Vertex/Elements/Uniform Buffers (VBO/UBO)
Token Buffer Objects; VBOs and UBOs owned by OpenGL
Multi-threading can be used for data update if:
Work on CPU system memory; then push data back to OpenGL (glBufferSubData)
Or Request for a pointer (glMapBuffers); then work with it from the thread
State Objects:
State Object and their 'Capture' must be handled in OpenGL context
No way to do it on a separate threads that don’t have OpenGL context
Multi-threading And Command-List
37
Siggraph Asia 2014
Multi-threading Use-Case Example #1
Thread #0
OpenGL Ctxt
Thread #1
Thread #1
Submit
States
PushAttrib()
set states & and capture
PopAttribs()
Get State ID
PushAttrib()
set states & and capture
PopAttribs()
Submit
States
unmap
Token buf.
ask for
Token buf.
Ptr
Build state
list
ask for
Token buf.
Ptr
unmap
Token buf.
Build token cmds
Get State ID
Build token
cmds
Build token
cmds
ask for
Token buf.
Ptr unmap
Token
buf.
ask for
Token buf.
Ptr ...
ask for
Token buf.
Ptr
Build token
cmds
unmap
Token
buf.
Build token cmds
Build state
list
38
Siggraph Asia 2014
Multi-threading Use-Case Example #2
Split state capture from Token-buffer creation
Thread 1 Thread 2 Thread 3
Build Token buffers
in CPU memory
Thread 0
OpenGL
Ctxt
Walk through whole scene
and build State-Objects
Receive Token-Buffers
and issue them
as Token Buffer Objects
to OpenGL
39
Siggraph Asia 2014
Migration Steps
All rendering via FBO (statecapture doesn‘t support default backbuffer)
No legacy states in GLSL
use custom uniforms no built-in uniforms (gl_ModelView...)
use glVertexAttribPointer (no glTexCoordPointer etc.)
No classic uniforms, pack them all in UBO
ARB_bindless_texture for texturing
Bindless Buffers
ARB_vertex_attrib_binding combined with NV_vertex_buffer_unified_memory (VBUM)...
Organize for StateObject reuse
state capture expensive: Capture them at init. time; Factorize their use in the scene
40
Siggraph Asia 2014
Migration Tips
Vertex Attributes and bindless VBO
http://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-
OpenGL-Scene-Rendering-Techniques.pdf (slide 11-16)
GLSL
// classic attributes
...
normal = gl_Normal;
gl_Position = gl_Vertex;
// generic attributes
// ideally share this definition across C and GLSL
#define VERTEX_POS 0
#define VERTEX_NORMAL 1
in layout(location= VERTEX_POS) vec4 attr_Pos;
in layout(location= VERTEX_NORMAL) vec3 attr_Normal;
...
normal = attr_Normal;
gl_Position = attr_Pos;
41
Siggraph Asia 2014
Migration Tips
UBO Parameter management
http://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-
OpenGL-Scene-Rendering-Techniques.pdf ( 18-27, 44-47)
Ideally group by frequency of change
// classic uniforms
uniform samplerCube viewEnvTex;
uniform vec4 materialColor;
uniform sampler2D materialTex;
...
// UBO usage, bindless texture inside UBO, grouped by change
layout(std140, binding=0, commandBindableNV) uniform view
{
samplerCube texViewEnv;
};
layout(std140, binding=1, commandBindableNV) uniform material
{
vec4 materialColor;
sampler2D texMaterialColor;
};
...
42
Siggraph Asia 2014
OpenGL Devtech 'Proviz' samples soon available on GitHub !
check https://github.com/nvpro-samples on a regular basis
working on ramping-up https://developer.nvidia.com/samples &
https://developer.nvidia.com/professional-graphics
command-lists examples to come here this Dec. 2014
Many other samples will be pushed here as standalone repositories
Beta Driver for Command-Lists: Release 346 for Dec.2014
Special Thanks to Pierre Boudier and Christoph Kubisch
Questions/Feedback: tlorach@nvidia.com / ckubisch@nvidia.com
Thanks!

More Related Content

What's hot

NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL Functionality
Mark Kilgard
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsHolger Gruen
 
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
Mark Kilgard
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
AMD Developer Central
 
Borderless Per Face Texture Mapping
Borderless Per Face Texture MappingBorderless Per Face Texture Mapping
Borderless Per Face Texture Mapping
basisspace
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
AMD Developer Central
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overheadCass Everitt
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareDaniel Blezek
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
otoyinc
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
Narann29
 
Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)
basisspace
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Takahiro Harada
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
Johan Andersson
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
AMD Developer Central
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
Johan Andersson
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
Shinnosuke Furuya
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
Muhaza Liebenlito
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
St1X
 
GFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESGFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ES
Prabindh Sundareson
 

What's hot (20)

NVIDIA's OpenGL Functionality
NVIDIA's OpenGL FunctionalityNVIDIA's OpenGL Functionality
NVIDIA's OpenGL Functionality
 
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
 
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Borderless Per Face Texture Mapping
Borderless Per Face Texture MappingBorderless Per Face Texture Mapping
Borderless Per Face Texture Mapping
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
 
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016 OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
OTOY Presentation - 2016 NVIDIA GPU Technology Conference - April 5 2016
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
Hair in Tomb Raider
Hair in Tomb RaiderHair in Tomb Raider
Hair in Tomb Raider
 
Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)Realtime Per Face Texture Mapping (PTEX)
Realtime Per Face Texture Mapping (PTEX)
 
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
Using GPUs for Collision detection, Recent Advances in Real-Time Collision an...
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by  ...
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?計算力学シミュレーションに GPU は役立つのか?
計算力学シミュレーションに GPU は役立つのか?
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
 
GFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ESGFX Part 4 - Introduction to Texturing in OpenGL ES
GFX Part 4 - Introduction to Texturing in OpenGL ES
 

Similar to Commandlistsiggraphasia2014 141204005310-conversion-gate02

OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
Mark Kilgard
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
Tristan Lorach
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
Mark Kilgard
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
Electronic Arts / DICE
 
Advanced Graphics Workshop - GFX2011
Advanced Graphics Workshop - GFX2011Advanced Graphics Workshop - GFX2011
Advanced Graphics Workshop - GFX2011
Prabindh Sundareson
 
GTC 2009 OpenGL Barthold
GTC 2009 OpenGL BartholdGTC 2009 OpenGL Barthold
GTC 2009 OpenGL Barthold
Mark Kilgard
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Johan Andersson
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
Nitesh Dubey
 
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsGFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
Prabindh Sundareson
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
Electronic Arts / DICE
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
ナム-Nam Nguyễn
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Gpu
GpuGpu
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
Owen Wu
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
Steve Severance
 
Lecture 15 ryuzo okada - vision processors for embedded computer vision
Lecture 15   ryuzo okada - vision processors for embedded computer visionLecture 15   ryuzo okada - vision processors for embedded computer vision
Lecture 15 ryuzo okada - vision processors for embedded computer vision
mustafa sarac
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
Mark Kilgard
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And EffectsThomas Goddard
 

Similar to Commandlistsiggraphasia2014 141204005310-conversion-gate02 (20)

OpenGL 4 for 2010
OpenGL 4 for 2010OpenGL 4 for 2010
OpenGL 4 for 2010
 
Siggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentialsSiggraph 2016 - Vulkan and nvidia : the essentials
Siggraph 2016 - Vulkan and nvidia : the essentials
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
Advanced Graphics Workshop - GFX2011
Advanced Graphics Workshop - GFX2011Advanced Graphics Workshop - GFX2011
Advanced Graphics Workshop - GFX2011
 
GTC 2009 OpenGL Barthold
GTC 2009 OpenGL BartholdGTC 2009 OpenGL Barthold
GTC 2009 OpenGL Barthold
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
graphics processing unit ppt
graphics processing unit pptgraphics processing unit ppt
graphics processing unit ppt
 
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specificationsGFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
GFX Part 1 - Introduction to GPU HW and OpenGL ES specifications
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
 
FIR filter on GPU
FIR filter on GPUFIR filter on GPU
FIR filter on GPU
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Gpu
GpuGpu
Gpu
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Haskell Accelerate
Haskell  AccelerateHaskell  Accelerate
Haskell Accelerate
 
Lecture 15 ryuzo okada - vision processors for embedded computer vision
Lecture 15   ryuzo okada - vision processors for embedded computer visionLecture 15   ryuzo okada - vision processors for embedded computer vision
Lecture 15 ryuzo okada - vision processors for embedded computer vision
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
 

Recently uploaded

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Commandlistsiggraphasia2014 141204005310-conversion-gate02

  • 1. Siggraph Asia 2014 Tristan Lorach Manager of Devtech for Professional Visualization group OPENGL NVIDIA "COMMAND-LIST": "APPROACHING ZERO DRIVER OVERHEAD"
  • 2. 2 Siggraph Asia 2014 GPU Scalability GPUs are extremely powerful But only if used correctly Mix of Application and Graphic API (Driver) responsibility Increase amount of work per batch (Job) Minimize CPUGPU interactions Lower Memory traffic Lower API calls: Batch things together Factorize Data Re-use data uploaded to Video memory (instancing…)
  • 3. 3 Siggraph Asia 2014 Use of the GPU Games ~1500 to 3000 Drawcalls for a scene Intensive use of Multi-layer image processing over the scene Heavy shaders CAD/FCC/professional applications Hard to batch user’s works (CAD applications) ~10,000 to… 300,000 Drawcalls for a scene: heavy CPU workload for our driver Shaders simpler than games (but catching-up these days...) Post-Processing more and more used (SSAO)
  • 4. 4 Siggraph Asia 2014 Use of the GPU New Graphic APIs are trying to address these concerns Khronos (Announced it at Siggraph Vancouver 2014) Metal Mantle DX12 All propose better ways to issue commands and render-states But can we improve OpenGL to go the same path ? Yes: NV_command_list
  • 5. 5 Siggraph Asia 2014 OpenGL OpenGL is ~25 years old (!!) 1.x: fixed function 2.x: Shading language (GLSL) 3.x/4.x: more as a massive compute system More graphic stages (Geom.Shader; Tessellation) High flexibility & efficiency on memory accesses High programability Design "flaws": still “State Machine”; ServerClient; poor multi-threading management 1992 2012
  • 6. 6 Siggraph Asia 2014 Challenge of Issuing Commands Issuing drawcalls and state changes can be a real bottleneck  650,000 Triangles  68,000 Parts  ~ 10 Triangles per part  3,700,000 Triangles  98 000 Parts  ~ 37 Triangles per part  14,338,275 Triangles/lines  300,528 drawcalls (parts)  ~ 48 Triangles per part App+driver GPU GPU idle CPU Excessive Work from App & Driver On CPU ! ! courtesy of PTC
  • 7. 7 Siggraph Asia 2014 Big Picture – Typical Case Vertex Puller (IA) Vertex Shader TCS (Tessellation) TES (Tessellation) Tessellator Geometry Shader Transform Feedback Rasterization Fragment Shader Per-Fragment Ops Framebuffer Tr. Feedback buffer Uniform Block Texture Fetch Image Load/Store Atomic Counter Shader Storage Element buffer (EBO) Draw Indirect Buffer Vertex Buffer (VBO) Front-End (decoder) Cmd bundles OpenGL Driver Application Push-Buffer (FIFO) cmds FBO resources (Textures / RB) 64 bits pointersHandles (IDs) Id  64 bits Addr. OpenGL Commands OpenGL resources
  • 8. 8 Siggraph Asia 2014 Application Token-buffer OpenGL Driver Big Picture Vertex Puller (IA) Vertex Shader TCS (Tessellation) TES (Tessellation) Tessellator Geometry Shader Transform Feedback Rasterization Fragment Shader Per-Fragment Ops Framebuffer Tr. Feedback buffer Uniform Block Texture Fetch Image Load/Store Atomic Counter Shader Storage Element buffer (EBO) Draw Indirect Buffer Vertex Buffer (VBO) Front-End (decoder) Cmd bundles Push-Buffer (FIFO) cmds FBO resources (Textures / RB) OpenGL Commands resources 64 bits Pointers (bindless) 64 bits GPU Address Offload cmd bundles creation to the App.
  • 9. 9 Siggraph Asia 2014 Application Command-list Token-buffer OpenGL Driver Big Picture Vertex Puller (IA) Vertex Shader TCS (Tessellation) TES (Tessellation) Tessellator Geometry Shader Transform Feedback Rasterization Fragment Shader Per-Fragment Ops Framebuffer Tr. Feedback buffer Uniform Block Texture Fetch Image Load/Store Atomic Counter Shader Storage Element buffer (EBO) Draw Indirect Buffer Vertex Buffer (VBO) Front-End (decoder) Cmd bundles Push-Buffer (FIFO) cmds FBO resources (Textures / RB) OpenGL Commands Token-buffers + state objects resources 64 bits Pointers (bindless) State Object More work for FE – but fast !
  • 10. 10 Siggraph Asia 2014 Demo – Quadro K5000 CAD Car model 14,338,275 Primitives 300,528 drawcalls 348,862 attribute update 12,004 uniform update Set of CAD models together 29,344,075 Primitives 26,144 drawcalls 18,371 attribute update 13,632 uniform update
  • 11. 11 Siggraph Asia 2014 DemoS – Quadro K5000 CAD Car model K5000: 920,363,200 primitives/S 19,290,668 drawcalls/S
  • 12. 12 Siggraph Asia 2014 DemoS – Quadro K5000 Set of CAD models together K5000 1,211,684,096 primitives/S 1,079,545 drawcalls/S
  • 13. 13 Siggraph Asia 2014 DemoS – Maxwell CAD Car model K5000: 920,363,200 primitives/S 19,290,668 drawcalls/S Maxwell GM204 (GeForce): 1,782,257,920 primitives/S 37,355,848 drawcalls/S Drawcall time : 8 Micro- seconds on CPU (!) Set of CAD models together K5000 1,211,684,096 primitives/S 1,079,545 drawcalls/S Maxwell GM204 (GeForce) 3,012,795,392 primitives/S 2,684,239 drawcalls/S Drawcall time: 42 Micro- seconds on CPU !
  • 14. 14 Siggraph Asia 2014 More Performances Another test: always using bindless UBO and VBO CPU-Emulation: Token stream executed as CPU "switch-case" Strategies Un-grouped: ~ 1 MB, 79,000 Tokens, ~68,000 drawcalls Material-grouped: ~300 KB, 22,000 Tokens, ~11,000 drawcalls Strategy Draw time Material-grouped 0.73 HW TOKEN-Buffers 0.56 Technique GPU (ms) 1.4 1.14 x ~0 BIG x CPU (ms) Draw time un-grouped 2.4 1.1 GPU (ms) 4.5 1.6 x ~0 BIG x CPU (ms) CPU-emulated TOKEN-buffers glBindBufferRange + drawcalls 0.73 1.6 3.5 7.2 Preliminary results K6000
  • 15. 15 Siggraph Asia 2014 More Performances 5 000 shader changes: toggling between two shaders in „shaded & edges“ 5 000 fbo changes: similar as above but with fbo toggle instead of shader Almost no additional cost compared to rendering without fbo changes Timing GPU (ms) CPU (ms) CPU-emulated TOKEN-buffers 12.7 15.1 2.9 4.3 x 0.4 37 xHW TOKEN-buffers 2.8 4.5 x 0.005 BIG xCommand-Lists Timer GPU (ms) CPU (ms) CPU-emulated TOKEN-buffers 60.0 60.0 1.8 33 x 0.9 66 xHW TOKEN-buffers 1.7 35 x 0.022 BIG xCommand-Lists Preliminary results on K5000
  • 16. 16 Siggraph Asia 2014 GPU Virtual Memory Bindless Technology What is it about? Work from native GPU pointers/handles (NVIDIA pioneered this technology) A lot less CPU work (memory hopping, validation...) Allow GPU to use flexible data structures Bindless Buffers Vertex & Global memory since Tesla Generation (CUDA capable) Bindless Textures Since Kepler Bindless Constants (UBO) New driver feature, support for Fermi and above Bindless plays a central role for Command-List Vertex Puller (IA) Vertex Shader TCS (Tessellation) TES (Tessellation) Tessellator Geometry Shader Transform Feedback Rasterization Fragment Shader Per-Fragment Ops Framebuffer Uniform Block Texture Fetch Element buffer (EBO) Vertex Buffer (VBO) Front-End (decoder) Push buffer 64bitsaddress Attr.s Idx Uniform Block Send Ptrs
  • 17. 17 Siggraph Asia 2014 #define UBA UNIFORM_BUFFER_ADDRESS_NV UpdateBuffers(); glEnableClientState(UNIFORM_BUFFER_UNIFIED_NV); glBufferAddressRangeNV (UBA, 0, addrView, viewSize); foreach (obj in scene) { ... // glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize); glBufferAddressRangeNV(UBA, 1, addrMatrices + obj.matrixOffset, maSize); foreach ( batch in obj.primitive_group_material) { // glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize); glBufferAddressRangeNV(UBA, 2, addrMaterial + batch.materialOffset, mtlSize); ... } } Example On Using Bindless UBO New pointer for UBO#1 updated per Object New pointer for UBO#2 updated per Primitive group pointer for UBO#0 updated once for all objects regular UBO binds are now ignored!
  • 18. 18 Siggraph Asia 2014 Pointers must be aligned So do Ptr Offsets glGetIntegerv(GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT, &offsetAlignment); Normally: 256 bytes  gaps between each item Try to fit as much data as possible… Not the case if passing an index for array access (MDI + “Base-instance”) But requires special GLSL code (indexing) Bindless And Memory Alignement GPU Virtual Memory 64bitsaddress Uniform Material 0 Uniform Material 1 Uniform Material 2 … glBufferAddressRangeNV (UBA, 2, addrMaterial + batch.materialOffset , mtlSize); 256b Uniform array materials[] Material 0 Material 1 Material 2 … Uniform …
  • 19. 19 Siggraph Asia 2014 NV_command_list Key Concepts 1. Tokenized Rendering: some state changes and draw commands are encoded into binary data stream Depends on bindless technology 2. State Objects Whole OpenGL States (program, blending...) captured as an object Allows pre-validation of state combinations, later reuse of objects is very fast 3. Command List Object „Display-list“ paradigm but more flexible: buffer are outside (referenced by address), so content can still be modified (matrics, vertices...)
  • 20. 20 Siggraph Asia 2014 UpdateBuffers(); glBindBufferBase (UBO, 0, uboView); foreach (obj in scene) { // redundancy filter for these (if (used != last)... glBindVertexBuffer (0, obj.geometry->vbo, 0, vtxSize); glBindBuffer (ELEMENT, obj.geometry->ibo); glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize); // iterate over cached material groups foreach ( batch in obj.materialGroups) { glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize); glMultiDrawElements (...); } } Typical Scene Drawing Example ~ 2 500 api drawcalls ~11 000 drawcalls ~55 triangles per call
  • 21. 21 Siggraph Asia 2014 UpdateBuffers(); glBindBufferBase (UBO, 0, uboView); foreach (obj in scene) { // redundancy filter for these (if (used != last)... glBindVertexBuffer (0, obj.geometry->vbo, 0, vtxSize); glBindBuffer (ELEMENT, obj.geometry->ibo); glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize); // iterate over all parts individually foreach ( part in obj.parts) { if (part.material != lastMaterial){ glBindBufferRange (UBO, 2, uboMaterial, part.materialOffset, mtlSize); } glDrawElements (...); } } Typical Scene Drawing Example ~68 000 drawcalls ~10 triangles per call
  • 22. 22 Siggraph Asia 2014 Scene Drawing Converted To Token Buffer glDrawCommandsNV (GL_TRIANGLES, tokenBuffer , offsets[], sizes[], count); // {0}, {bufferSize}, 1 foreach (obj in scene) { glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...); glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...); glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices ...); foreach ( batch in obj.materialGroups) { glBufferAddressRangeNV(UNIFORM, 2, addrMaterial ...); glMultiDrawElements(...) } } becomes a single drawcall replaces 80k calls to GL (for graphic-card model) Token buffer Set Attr#0 on VBO address … Set Attr#1 on VBO address … Set Elements on EBO address … Uniform Matrix on UBO address … Uniform Material on UBO address … DrawElements Uniform Material on UBO address … DrawElements Obj#1 Obj#2 … For all objects …
  • 23. 23 Siggraph Asia 2014 Tokens-buffers are tightly packed structs in linear memory Token Buffer Structures TokenUbo { GLuint header; UniformAddressCommandNV { GLushort index; GLushort stage; GLuint64 address; } cmd; } TokenVbo { GLuint header; AttributeAddressCommandNV { GLuint index; GLuint64 address; } cmd; } TokenIbo { GLuint header; ElementAddressCommandNV { GLuint64 address; GLuint typeSizeInByte; } cmd; } TokenDrawElements { Gluint header; DrawElementsCommandNV { GLuint count; GLuint firstIndex; GLuint baseVertex; } cmd; } Tokenbuffer SetAttr#0on VBOaddress… SetAttr#1on VBOaddress… SetElementson EBOaddress… UniformMatrix onUBOaddress… UniformMaterial onUBOaddress… DrawElements UniformMaterial onUBOaddress… DrawElements … UniformMaterial onUBOaddress… DrawElements UniformMaterial onUBOaddress… DrawElements UniformMaterial onUBOaddress… DrawElements UniformMaterial onUBOaddress… DrawElements UniformMaterial onUBOaddress… DrawElements
  • 24. 24 Siggraph Asia 2014 Tokenized Rendering Token Headers are GPU commands, followed by arguments FRONTFACE_COMMAND_NV BLEND_COLOR_COMMAND_NV STENCIL_REF_COMMAND_NV LINE_WIDTH_COMMAND_NV POLYGON_OFFSET_COMMAND_NV ALPHA_REF_COMMAND_NV VIEWPORT_COMMAND_NV SCISSOR_COMMAND_NV ELEMENT_ADDRESS_COMMAND_NV ATTRIBUTE_ADDRESS_COMMAND_NV UNIFORM_ADDRESS_COMMAND_NV DRAW_ELEMENTS_COMMAND_NV DRAW_ARRAYS_COMMAND_NV DRAW_ELEMENTS_STRIP_COMMAND_NV DRAW_ARRAYS_STRIP_COMMAND_NV DRAW_ELEMENTS_INSTANCED_COMMAND_NV DRAW_ARRAYS_INSTANCED_COMMAND_NV TERMINATE_SEQUENCE_COMMAND_NV NOP_COMMAND_NV Final list might change Render States Index Buffer; Position Attribute; Normals; Texcoords... Matrices; Material data... Draw Commands ! Special Commands (see later)
  • 25. 25 Siggraph Asia 2014 Tokenized Rendering Example: UBO Address assignment structure FRONTFACE_COMMAND_NV BLEND_COLOR_COMMAND_NV STENCIL_REF_COMMAND_NV LINE_WIDTH_COMMAND_NV POLYGON_OFFSET_COMMAND_NV ALPHA_REF_COMMAND_NV VIEWPORT_COMMAND_NV SCISSOR_COMMAND_NV ELEMENT_ADDRESS_COMMAND_NV ATTRIBUTE_ADDRESS_COMMAND_NV UNIFORM_ADDRESS_COMMAND_NV DRAW_ELEMENTS_COMMAND_NV DRAW_ARRAYS_COMMAND_NV DRAW_ELEMENTS_STRIP_COMMAND_NV DRAW_ARRAYS_STRIP_COMMAND_NV DRAW_ELEMENTS_INSTANCED_COMMAND_NV DRAW_ARRAYS_INSTANCED_COMMAND_NV TERMINATE_SEQUENCE_COMMAND_NV NOP_COMMAND_NV TokenUbo { GLuint header; UniformAddressCommandNV { GLushort index; GLushort stage; GLuint64 address; } cmd; } Example Real Token to be queried header = glGetCommandHeaderNV( UNIFORM_ADDRESS_COMMAND_NV, Sizeof(TokenUbo) ) One Shading stage to target (not ‘program’) Stage = glGetStageIndexNV(S) (S: STAGE_VERTEX|GEOMETRY|FRAGMENT|TESS_CONTROL/EVAL) Hardware Index *not* Uniform Program index
  • 26. 26 Siggraph Asia 2014 TOKENIZED RENDERING Allows rendering of a mix of LIST, STRIP, FAN, LOOP modes in one go Pass „basic type“ as Token, use „mode“ for more in "Instanced" drawing TokenDrawElements { Gluint header; DrawElementsCommandNV { GLuint count; GLuint firstIndex; GLuint baseVertex; } cmd; } TokenDrawElementsInstanced { Gluint header; DrawElementsInstancedCommandNV { GLuint mode; GLuint count; GLuint instanceCount; GLuint firstIndex; GLuint baseVertex; GLuint baseInstance; } cmd; } LIST, STRIP, FAN, LOOP types (GL_LINE_LOOP...) DRAW_ELEMENTS_COMMAND_NV DRAW_ARRAYS_COMMAND_NV DRAW_ELEMENTS_STRIP_COMMAND_NV DRAW_ARRAYS_STRIP_COMMAND_NV glGetCommandHeaderNV() DRAW_ELEMENTS_INSTANCED_COMMAND_NV DRAW_ARRAYS_INSTANCED_COMMAND_NV glGetCommandHeaderNV()
  • 27. 27 Siggraph Asia 2014 TERMINATE_SEQUENCE: Occlusion Culling Now overcomes deficit of previous methods http://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering- techniques.pdf Enqueue all tokens for entire object Shader to Check Visibility; Shader for SCAN operation; Add TERMINATE_SEQUENCE_COMMAND_NV to end Technique Draw time GROUPED Timing GPU (ms) CPU (ms) 6.3 2.6 x 0.9 6 xCULL Old Bindless MDI (*) CULL NEW TOKEN native 0.2 27 x glBindBufferRange + drawcalls 16.8 5.4 6.2 2.7 x 6.3 0.85 xCULL Old Readback (stalls pipe) (*) K5000, Preliminary results, (*) taken from slightly differnt framework HW TOKEN-buffers ~0 BIG x No instancing! (synthetic test, data replication) SCENE: materials: 138 objects: 13,032 geometries: 3,312 parts: 789,464 triangles: 29,527,840 vertices: 27,584,376 16.8 5.3 3.1 x
  • 28. 28 Siggraph Asia 2014 Just a little... “commandBindableNV” turns binding IDs to Hardware IDs Texturing must go through bindless ARB_bindless_texture Regular Texture binding Turned Off Changing UBO Ptr Addresses == texture changes (here ‘myBuf’) Does Shaders Need Special Code ? #extension GL_ARB_bindless_texture : require #extension GL_NV_command_list : enable #if GL_NV_command_list layout(commandBindableNV) uniform; #endif layout(std140,binding=2) uniform myBuf { sampler2D mySampler; //or: in int whichSampler; }; layout(std140,binding=1) uniform samplers { sampler2D allSamplers[NUM_TEXTURES]; }; … c = texture(mySampler, tc); //or c = texture(allSamplers [ whichSampler ] , tc); … Example:
  • 29. 29 Siggraph Asia 2014 State Objects StateObject Encapsulates majority of state (fbo format, active shader, blend, depth ...) State Object immutable: gives more control over validation cost no bindings captured: bindless used instead; textures passed via UBO Primitive type is a state: participates to important validation work Cannot be put in the token buffer glCaptureStateNV(stateobject, GL_TRIANGLES ); Primitive type as argument: because it is part of states
  • 30. 30 Siggraph Asia 2014 Render entire scenes with different shaders/fbos... in one go Driver caches state transitions ! Draw With State Objects glDrawCommandsStatesNV( tokenBuffer, offsets[], sizes[], states[], fbos[], count); Token buffer VBO address … VBO address … EBO address … Uniform Matrix Uniform Material DrawElements Uniform Material DrawElements … EBO address … Uniform Matrix Uniform Material DrawElements Uniform Material DrawElements FBO 1 FBO 2 fbos[] FBO 1 FBO 1 State Obj 1 State Obj 2 State Obj 1 State Obj 4 Count = 4 States[] offsets[] sizes[0]sizes[1]sizes[2]sizes[3]
  • 31. 31 Siggraph Asia 2014 Driver bakes „Diff“ between states and apply only the difference Pseudo Code of glDrawCommandsStatesNV for i < count { if (i == 0) set state from states[i]; else set state transition states[i-1] to states[i] if (fbo[i]) { // fbo[i] must be compatible with states[i].fbo glBindFramebuffer( fbo[i] ) } else glBindFramebuffer( states[i].fbo ) ProcessCommandSequence(...tokenBuffer, offsets[i], sizes[i]) }
  • 32. 32 Siggraph Asia 2014 Rendering must always happen in a FBO Backbuffer Ptr. Address is un-reliable (dynamic re-alloc when resizing…) Use glBlitFramebuffer() for FBO  final Backbuffer result FBO Resources Must be Resident (glMakeTextureHandleResidentARB()) Resources can be replaced/swapped if they keep the same base configuration (attachment formats, # drawbuffers...) I.e. Same configuration as captured FBO settings from stateobject This is what happens when resizing: Detach & Destroy resource  create new ones  attach them to FBO(s) Frame-Buffer Objects And State Objects
  • 33. 33 Siggraph Asia 2014 Combine multiple token buffers & states into a CommandList object glCreateCommandListsNV(N, &list) glCommandListSegmentsNV(list, segs) Convenient for concatenation glListDrawCommandsStatesClientNV(list, segment, token-buffer-ptrs, token-buffer- sizes, states, FBOs, count) glCompileCommandListNV(list); glCallCommandListNV(list) Command-List Object System mem. VBO address … VBO address … EBO address … Uniform Matrix Uniform Material DrawElements … Uniform Material DrawElements FBO State Object System mem. VBO address … VBO address … EBO address … Uniform Matrix Uniform Material DrawElements … FBO State Object System mem. VBO address … VBO address … Uniform Matrix Uniform Material DrawArrays … Uniform Material DrawArrays FBO State Object … Command-List Segment #0 Segment #1
  • 34. 34 Siggraph Asia 2014 Command-List Object Less flexibilty compared to token buffer Token buffers taken from client pointers (CPU memory) Driver will optimize these token buffers and related state-objects Optimize State transitions; token cmds for the GPU Front-End Resulting command-list is immutable If any resource pointer changed, the command-list must be recompiled But still rather flexible possible to change content of referenced buffers! (matrices, materials, vertices...) Animation; skinning... No need for command-list recompilation
  • 35. 35 Siggraph Asia 2014 Command-List doesn’t pretend to solve any OpenGL multi-threading Multi-threading can help if complex CPU work going-on update of Vertex / Element buffers: transform. Matrices; skinning… Update of Token Buffers: adding/removing characters in a game scene… OpenGL & Multithreading issues: OpenGL is bad at sharing its access from various threads Make-Current or Multiple Contexts aren’t good solutions Prefer to keep context ownership to one main thread OpenGL And Multi-threading ?
  • 36. 36 Siggraph Asia 2014 Token Buffers and Vertex/Elements/Uniform Buffers (VBO/UBO) Token Buffer Objects; VBOs and UBOs owned by OpenGL Multi-threading can be used for data update if: Work on CPU system memory; then push data back to OpenGL (glBufferSubData) Or Request for a pointer (glMapBuffers); then work with it from the thread State Objects: State Object and their 'Capture' must be handled in OpenGL context No way to do it on a separate threads that don’t have OpenGL context Multi-threading And Command-List
  • 37. 37 Siggraph Asia 2014 Multi-threading Use-Case Example #1 Thread #0 OpenGL Ctxt Thread #1 Thread #1 Submit States PushAttrib() set states & and capture PopAttribs() Get State ID PushAttrib() set states & and capture PopAttribs() Submit States unmap Token buf. ask for Token buf. Ptr Build state list ask for Token buf. Ptr unmap Token buf. Build token cmds Get State ID Build token cmds Build token cmds ask for Token buf. Ptr unmap Token buf. ask for Token buf. Ptr ... ask for Token buf. Ptr Build token cmds unmap Token buf. Build token cmds Build state list
  • 38. 38 Siggraph Asia 2014 Multi-threading Use-Case Example #2 Split state capture from Token-buffer creation Thread 1 Thread 2 Thread 3 Build Token buffers in CPU memory Thread 0 OpenGL Ctxt Walk through whole scene and build State-Objects Receive Token-Buffers and issue them as Token Buffer Objects to OpenGL
  • 39. 39 Siggraph Asia 2014 Migration Steps All rendering via FBO (statecapture doesn‘t support default backbuffer) No legacy states in GLSL use custom uniforms no built-in uniforms (gl_ModelView...) use glVertexAttribPointer (no glTexCoordPointer etc.) No classic uniforms, pack them all in UBO ARB_bindless_texture for texturing Bindless Buffers ARB_vertex_attrib_binding combined with NV_vertex_buffer_unified_memory (VBUM)... Organize for StateObject reuse state capture expensive: Capture them at init. time; Factorize their use in the scene
  • 40. 40 Siggraph Asia 2014 Migration Tips Vertex Attributes and bindless VBO http://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117- OpenGL-Scene-Rendering-Techniques.pdf (slide 11-16) GLSL // classic attributes ... normal = gl_Normal; gl_Position = gl_Vertex; // generic attributes // ideally share this definition across C and GLSL #define VERTEX_POS 0 #define VERTEX_NORMAL 1 in layout(location= VERTEX_POS) vec4 attr_Pos; in layout(location= VERTEX_NORMAL) vec3 attr_Normal; ... normal = attr_Normal; gl_Position = attr_Pos;
  • 41. 41 Siggraph Asia 2014 Migration Tips UBO Parameter management http://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117- OpenGL-Scene-Rendering-Techniques.pdf ( 18-27, 44-47) Ideally group by frequency of change // classic uniforms uniform samplerCube viewEnvTex; uniform vec4 materialColor; uniform sampler2D materialTex; ... // UBO usage, bindless texture inside UBO, grouped by change layout(std140, binding=0, commandBindableNV) uniform view { samplerCube texViewEnv; }; layout(std140, binding=1, commandBindableNV) uniform material { vec4 materialColor; sampler2D texMaterialColor; }; ...
  • 42. 42 Siggraph Asia 2014 OpenGL Devtech 'Proviz' samples soon available on GitHub ! check https://github.com/nvpro-samples on a regular basis working on ramping-up https://developer.nvidia.com/samples & https://developer.nvidia.com/professional-graphics command-lists examples to come here this Dec. 2014 Many other samples will be pushed here as standalone repositories Beta Driver for Command-Lists: Release 346 for Dec.2014 Special Thanks to Pierre Boudier and Christoph Kubisch Questions/Feedback: tlorach@nvidia.com / ckubisch@nvidia.com Thanks!