SlideShare a Scribd company logo
Don’t Throw it all Away: Efficient
Buffer Management
John McDonald
Developer Technology, NVIDIA Corporation
What are we talking about?
● General Performance/Functional Guidance
● CPU-GPU Sync Points
● Buffer Usage Patterns
● Contention-Free Buffers
● Constant Buffers
● Performance Investigation
“Buffers” is really generic…
● Vertex Buffers
● Index Buffers
● Constant Buffers
General Guidance
General Guidance
● D3D11 >> D3D9 (generally)
● It’s much harder to hit the ultra-slow path (aka CPU-
GPU Sync Points)
● Reduce your API calls where possible
● Batch up buffer updates
● Alignment matters! (16-byte, please)
● Aligned copies can be ~30x faster
More General Guidance
● D3D11Device will grab a mutex for you, but each
DeviceContext can only be called from one
thread at a time
● This is the source of many crashes blamed on the driver
● UpdateSubresource requires more CPU time
● When possible, prefer Map/Unmap
● D3D11 Debug Runtime is awesome!
● Please use it, ensure you are running clean
CPU-GPU Sync Points
CPU-GPU Sync Points
● CPU-GPU Sync Points are caused when the CPU
needs the GPU to complete work before an API
call can return
● These make us sad
CPU-GPU sync point examples
● Explicit
● Spin-lock waiting for query results
● Readback of Framebuffer you just rendered to
● Implicit (potential sync points)
● GPU Memory Allocation after Deallocation
● Buffer Rename operation (MAP_DISCARD) after
deallocation
● Immediate update of a buffer still in use
Why are they bad?
● Ideal frame time should be max(CPU time, GPU
time)
● CPU-GPU Sync point turns this into CPU Time +
GPU Time.
Ideal
GPU
CPU
With Sync point
Presents Presents
Really? That bad?
● One bad sync point can halve your frame rate
● Even worse: the more sync points you have, the
harder they are to find.
● Performance will just seem generally slow
● The badness depends, in part, on where in the
frame the sync-point occurs
● Generally, the later the sync point, the worse it is
● Early sync-points are also bad if your workload is very
lopsided towards either the CPU or the GPU
Check your middleware
● Middleware is generally written in a vacuum
● What works best in the small might not scale well
● Especially check for CPU-GPU sync points
A quick D3D9 interlude
● CPU-GPU sync points are trivial to introduce in
D3D9
● Locking any buffer in D3D9 with flags=0 is a
virtually guaranteed CPU-GPU Sync point if that
buffer is still in use. 
Buffer Usage Patterns
Buffer Usage Patterns
UpdatesMoreOften
“Forever”
Long Lived
Transient
Temporary
Constants
- Level BSPs
- Character Geometry
- UI, Text (New!)
- Particle Systems (Streaming)
- Shader Parameters
“Forever” Buffers
● Useful for geometry that is loaded
once
● Ex: Level BSPs, loaded behind a load
screen
● Don’t use this for streaming data
● Hitching during allocation is possible/likely
● IMMUTABLE flag at creation time
● Cannot update these!
UpdatesMoreOften
“Forever”
Long Lived
Transient
Temporary
Constants
Long Lived Buffers
● Data that is streamed in from disk,
but is expected to last for “awhile”
● Ex: Character geometry
● Reuse these; stream into them
● DEFAULT flag at creation time
● UpdateSubresource to update
UpdatesMoreOften
“Forever”
Long Lived
Transient
Temporary
Constants
Temporary buffers
● Fire-and-forget data
● E.g. Particle systems
● Almost certainly lives in system
RAM
● DYNAMIC flag at create time
● Prefer Map/Unmap to update these
● UpdateSubresource involves an extra copy
UpdatesMoreOften
“Forever”
Long Lived
Transient
Temporary
Constants
Constant Buffers
● These are different than other
buffers in D3D11.
● The GPU can deal with many of
them in flight at once
● Create with DYNAMIC
● Map/DISCARD to Update
● More on these in a bit
UpdatesMoreOften
“Forever”
Long Lived
Transient
Temporary
Constants
We skipped one…
● Transient Buffers
● New informal class of Buffer
● Used for (e.g.) UI/Text
● Things that are dynamic, but few vertices each—and
may need to be updated on odd schedules
● DYNAMIC flag at creation time
● Transient Buffers are part of a new class of
buffer…
Contention-Free Buffers
Transient Buffer Overview
● Treat Buffer as a Memory Heap,
with a twist
● On CPU, Freed memory available now
● On GPU, Freed memory is available
when GPU is finished with it
● Assume memory is in use until told otherwise
● Determine when GPU must be finished with Freed
memory, then return to the “really free” list
UpdatesMoreOften
“Forever”
Long Lived
Transient
Temporary
Constants
CTransientBuffer
● On Alloc, walk a Free list
looking for best fit
● Data is updated using
Map/NO_OVERWRITE
● Return opaque, immutable
handle
● On Free, record that chunk
was freed—into
RetiredFrames.back()
● Just after present, an
“OnPresent” function is
called
class CTransientBuffer
{
ID3D11Buffer* mBuffer;
UINT mLengthBytes;
ID3D11Device* mOwner;
vector<CSubAlloc> mFreeList;
list<RetiredFrame> mRetiredFrames;
public:
CSubAlloc* Alloc(UINT, void*,
ID3D11DeviceContext*);
void Free(CSubAlloc*);
void OnPresent(ID3D11DeviceContext*);
CTransientBuffer Guts
class CTransientBuffer
{
ID3D11Buffer* mBuffer;
UINT mLengthBytes;
ID3D11Device* mOwner;
vector<CSubAlloc> mFreeList;
list<RetiredFrame> mRetiredFrames;
public:
CSubAlloc* Alloc(UINT, void*,
ID3D11DeviceContext*);
void Free(CSubAlloc*);
void OnPresent(ID3D11DeviceContext*);
...
struct RetiredFrame
{
list<CSubAlloc*> mPendingFrees;
ID3D11Query* mFrameCompleteQuery;
};
class CSubAlloc
{
UINT mOffset;
UINT mLength;
...
CTransientBuffer::OnPresent
void CTransientBuffer::OnPresent(ID3D11DeviceContext* _dc)
{
// First, deal with deletes from this frame
RetiredFrame& retFrame = mRetiredFrames.back();
if (!retFrame.mPendingFrees.empty()) {
retFrame.mFrameCompleteQuery = CreateAndIssueEventQuery(_dc);
// Append a new (empty) RetiredFrame to mRetiredFrames
mRetiredFrames.push_back(RetiredFrame());
}
// Second, return pending frees to mFreeList
CTransientBuffer::OnPresent
// Second, return pending frees to mFreeList
FOREACH(frameIt, mRetiredFrames) {
auto query = frameIt->mFrameCompleteQuery;
if (!(query && IsQueryComplete(query)))
break;
FOREACH(suballocIt, frameIt->mPendingFrees) {
ReallyFree(*subAllocIt);
}
}
}
CTransientBuffer Visualized
Free List Retired Frames
CTransientBuffer Visualized
Free List Retired Frames
Allocating four
Buffers
CTransientBuffer Visualized
Free List Retired Frames
Nothing
CTransientBuffer Visualized
Free List Retired Frames
Deallocating
Yellow and Green
EQ
CTransientBuffer Visualized
Free List Retired Frames
EvEEEVEentE
Deallocating
Yellow and Green
EQ
CTransientBuffer Visualized
Free List Retired Frames
EQ Returns for
Retired Frame
CTransientBuffer: Handling OOM
● Ways to handle Out of Memory on Alloc:
● Spin-lock waiting for RetiredFrame Queries to return
● Allocate a new, larger buffer
● Release current buffer
● Requires a system memory copy to initially fill new buffer
● These will (probably) stall
● But in your code
● can be easily logged -and/or-
● Recorded to adjust and avoid for subsequent runs
Transient Buffer Pattern
● Works in D3D9 as well
● Can be extended and simplified to contention-
free Temporary Buffers, too!
● Let’s take a quick look at that.
Discard-Free Temporary Buffers
● Allocate out of Buffer as a circular buffer
● No opaque handle needed
● Remember ending address of the last allocation
● Per frame: Assuming any allocations, issue query
● Later: When query returns, move the end pointer
to indicate additional available space
● Credit: Blizzard’s StarCraft 2 Team (thanks!)
Discard-Free Temp Buffer Visualized
Start Retired FramesEnd
Start State
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
Allocate some
stuff
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
Go on…
NextEnd
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
Queries start to
return…
NextEnd
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
etc…
NextEnd
Discard-Free Temp Buffer Visualized
Start Retired Frames
NextEnd
End
etc…
NextEnd
Constant Buffers
Constant Buffer Organization
● Group by frequency of update
● The cheapest buffers are the ones you never
update
● You can bind multiple buffers in one call (Reduce
those API calls!)
Proposed Buffer Grouping
● Assuming you are not vertex shading limited
● Don’t solve the travelling salesman in your VS
● Seriously: this isn’t common
Multiple Constant Buffers
● One for per-frame constants (GI values, lights)
● One for per-camera constants (ViewProj matrix,
camera position in world, RT dimensions)
oPos = in.Position
* cWorldViewPos;
oPos = in.Position
* cWorld
* cViewPos;
^
One extra 3x3 matrix
multiply in the VS.
No biggie.
Old HLSL New HLSL
Multiple Constant Buffers cont’d
● One for per-object constants (World matrix,
dynamic material properties, etc)
● One for per-material constants (if these are
shared—if not then drop them
in with per-object constants)
● Splitting constants this way
eliminates constant updates
for static objects.
Constant Buffer Tricks
● Use shared structs to update when possible
● Struct can be included from both hlsl and C++
● Makes buffer updates trivial!
● Assign them to slots by convention:
● b0: Per-Frame, b1: Per-Camera, etc
● Slot assignment can live in shared header, too.
Performance Investigations
Performance Investigation
● Scene from a Typical D3D11 Application
(unreleased)
● 115 Dynamic Vertex Buffer Updates (particles) per
frame
● Total Time: 4.36ms / frame
Per- Call Frame
Map/Unmap 0.036 ms 3.79 ms
Memcpy ~0.004 ms 0.4 ms
Let’s buffer the updates
● All Dynamic Updates during one update
● 1 Map per frame (using MAP_DISCARD)
● Still 115 memcpys (I’m lazy)
● Total Time: 0.267ms / frame (savings: 4.1ms!)
Per- Call Frame
Map/Unmap 0.036 ms 0.036 ms
Memcpy ~0.002 ms 0.231 ms
Buffered update, no discards
● One update into a triple buffer
● 1 Map per frame (using MAP_NOOVERWRITE)
● Still 115 memcpys (I’m still lazy)
● Total Time: 0.217ms / frame (savings: 4.15ms)
● Bonus: No hitching ever
● Downside: 3x the memory
Per- Call Frame
Map/Unmap 0.031 ms 0.031 ms
Memcpy ~0.002 ms 0.231 ms
Performance Results
● Reducing API usage was a huge CPU-side savings
(4.09 ms). GPU Perf Unaffected
● Discard-Free updates were marginally faster still—
but would never hitch.
Total Frame Time
Original 4.360 ms
Buffered Updates 0.267 ms
Discard-Free 0.217 ms
GPUView
● Covered by Jon Story earlier today
● Hopefully you caught it!
● Great for finding CPU-GPU sync points
Questions?
● jmcdonald at nvidia dot com
Nifty Buffer Summary Table
Type Usage (e.g) Create Flag Update Method
“Forever” Level BSPs IMMUTABLE Cannot Update
Long-Lived Characters DEFAULT UpdateSubResource
Transient UI/Text DYNAMIC CTransientBuffer
Temporary Particles DYNAMIC Map/NO_OVERWRITE
Constant Material Props DYNAMIC Map/DISCARD

More Related Content

What's hot

vkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan APIvkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan API
Tristan Lorach
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
Cass Everitt
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overheadCass Everitt
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017
Mark Kilgard
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Electronic Arts / DICE
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...Johan Andersson
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
AMD Developer Central
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Tiago Sousa
 
New Addressable Asset System for Speed and Performance
New Addressable Asset System for Speed and PerformanceNew Addressable Asset System for Speed and Performance
New Addressable Asset System for Speed and Performance
Unity Technologies
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
Johan Andersson
 
NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017
Mark Kilgard
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
Electronic Arts / DICE
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
Electronic Arts / DICE
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
AMD Developer Central
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
Electronic Arts / DICE
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
Mark Kilgard
 
Checkerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOC
QLOC
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Johan Andersson
 

What's hot (20)

vkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan APIvkFX: Effect(ive) approach for Vulkan API
vkFX: Effect(ive) approach for Vulkan API
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
 
New Addressable Asset System for Speed and Performance
New Addressable Asset System for Speed and PerformanceNew Addressable Asset System for Speed and Performance
New Addressable Asset System for Speed and Performance
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
 
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
 
Checkerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOC
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 

Similar to Efficient Buffer Management

strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
Matthew Dennis
 
Cephalocon apac china
Cephalocon apac chinaCephalocon apac china
Cephalocon apac china
Vikhyat Umrao
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Ceph Community
 
Java under the hood
Java under the hoodJava under the hood
Java under the hood
Vachagan Balayan
 
Avoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance LossAvoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance Loss
basisspace
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Anne Nicolas
 
Debugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to LinuxDebugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to Linux
Serapheim-Nikolaos Dimitropoulos
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningFromDual GmbH
 
JVM Performance Tuning
JVM Performance TuningJVM Performance Tuning
JVM Performance Tuning
Jeremy Leisy
 
Cassandra from tarball to production
Cassandra   from tarball to productionCassandra   from tarball to production
Cassandra from tarball to production
Ron Kuris
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
VMware Tanzu
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Red_Hat_Storage
 
LCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis Tools
Linaro
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
Wei Shan Ang
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Haim Yadid
 

Similar to Efficient Buffer Management (20)

strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
Cephalocon apac china
Cephalocon apac chinaCephalocon apac china
Cephalocon apac china
 
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
Common Support Issues And How To Troubleshoot Them - Michael Hackett, Vikhyat...
 
Optimizing Linux Servers
Optimizing Linux ServersOptimizing Linux Servers
Optimizing Linux Servers
 
Java under the hood
Java under the hoodJava under the hood
Java under the hood
 
Avoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance LossAvoiding Catastrophic Performance Loss
Avoiding Catastrophic Performance Loss
 
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farmKernel Recipes 2016 - Speeding up development by setting up a kernel build farm
Kernel Recipes 2016 - Speeding up development by setting up a kernel build farm
 
Debugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to LinuxDebugging ZFS: From Illumos to Linux
Debugging ZFS: From Illumos to Linux
 
Cassandra On EC2
Cassandra On EC2Cassandra On EC2
Cassandra On EC2
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
JVM Performance Tuning
JVM Performance TuningJVM Performance Tuning
JVM Performance Tuning
 
Multicore
MulticoreMulticore
Multicore
 
Cassandra from tarball to production
Cassandra   from tarball to productionCassandra   from tarball to production
Cassandra from tarball to production
 
Volatile
VolatileVolatile
Volatile
 
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...Customize and Secure the Runtime and Dependencies of Your Procedural Language...
Customize and Secure the Runtime and Dependencies of Your Procedural Language...
 
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStackCeph at Work in Bloomberg: Object Store, RBD and OpenStack
Ceph at Work in Bloomberg: Object Store, RBD and OpenStack
 
LCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis ToolsLCU14 201- Binary Analysis Tools
LCU14 201- Binary Analysis Tools
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Efficient Buffer Management

  • 1. Don’t Throw it all Away: Efficient Buffer Management John McDonald Developer Technology, NVIDIA Corporation
  • 2. What are we talking about? ● General Performance/Functional Guidance ● CPU-GPU Sync Points ● Buffer Usage Patterns ● Contention-Free Buffers ● Constant Buffers ● Performance Investigation
  • 3. “Buffers” is really generic… ● Vertex Buffers ● Index Buffers ● Constant Buffers
  • 5. General Guidance ● D3D11 >> D3D9 (generally) ● It’s much harder to hit the ultra-slow path (aka CPU- GPU Sync Points) ● Reduce your API calls where possible ● Batch up buffer updates ● Alignment matters! (16-byte, please) ● Aligned copies can be ~30x faster
  • 6. More General Guidance ● D3D11Device will grab a mutex for you, but each DeviceContext can only be called from one thread at a time ● This is the source of many crashes blamed on the driver ● UpdateSubresource requires more CPU time ● When possible, prefer Map/Unmap ● D3D11 Debug Runtime is awesome! ● Please use it, ensure you are running clean
  • 8. CPU-GPU Sync Points ● CPU-GPU Sync Points are caused when the CPU needs the GPU to complete work before an API call can return ● These make us sad
  • 9. CPU-GPU sync point examples ● Explicit ● Spin-lock waiting for query results ● Readback of Framebuffer you just rendered to ● Implicit (potential sync points) ● GPU Memory Allocation after Deallocation ● Buffer Rename operation (MAP_DISCARD) after deallocation ● Immediate update of a buffer still in use
  • 10. Why are they bad? ● Ideal frame time should be max(CPU time, GPU time) ● CPU-GPU Sync point turns this into CPU Time + GPU Time. Ideal GPU CPU With Sync point Presents Presents
  • 11. Really? That bad? ● One bad sync point can halve your frame rate ● Even worse: the more sync points you have, the harder they are to find. ● Performance will just seem generally slow ● The badness depends, in part, on where in the frame the sync-point occurs ● Generally, the later the sync point, the worse it is ● Early sync-points are also bad if your workload is very lopsided towards either the CPU or the GPU
  • 12. Check your middleware ● Middleware is generally written in a vacuum ● What works best in the small might not scale well ● Especially check for CPU-GPU sync points
  • 13. A quick D3D9 interlude ● CPU-GPU sync points are trivial to introduce in D3D9 ● Locking any buffer in D3D9 with flags=0 is a virtually guaranteed CPU-GPU Sync point if that buffer is still in use. 
  • 15. Buffer Usage Patterns UpdatesMoreOften “Forever” Long Lived Transient Temporary Constants - Level BSPs - Character Geometry - UI, Text (New!) - Particle Systems (Streaming) - Shader Parameters
  • 16. “Forever” Buffers ● Useful for geometry that is loaded once ● Ex: Level BSPs, loaded behind a load screen ● Don’t use this for streaming data ● Hitching during allocation is possible/likely ● IMMUTABLE flag at creation time ● Cannot update these! UpdatesMoreOften “Forever” Long Lived Transient Temporary Constants
  • 17. Long Lived Buffers ● Data that is streamed in from disk, but is expected to last for “awhile” ● Ex: Character geometry ● Reuse these; stream into them ● DEFAULT flag at creation time ● UpdateSubresource to update UpdatesMoreOften “Forever” Long Lived Transient Temporary Constants
  • 18. Temporary buffers ● Fire-and-forget data ● E.g. Particle systems ● Almost certainly lives in system RAM ● DYNAMIC flag at create time ● Prefer Map/Unmap to update these ● UpdateSubresource involves an extra copy UpdatesMoreOften “Forever” Long Lived Transient Temporary Constants
  • 19. Constant Buffers ● These are different than other buffers in D3D11. ● The GPU can deal with many of them in flight at once ● Create with DYNAMIC ● Map/DISCARD to Update ● More on these in a bit UpdatesMoreOften “Forever” Long Lived Transient Temporary Constants
  • 20. We skipped one… ● Transient Buffers ● New informal class of Buffer ● Used for (e.g.) UI/Text ● Things that are dynamic, but few vertices each—and may need to be updated on odd schedules ● DYNAMIC flag at creation time ● Transient Buffers are part of a new class of buffer…
  • 22. Transient Buffer Overview ● Treat Buffer as a Memory Heap, with a twist ● On CPU, Freed memory available now ● On GPU, Freed memory is available when GPU is finished with it ● Assume memory is in use until told otherwise ● Determine when GPU must be finished with Freed memory, then return to the “really free” list UpdatesMoreOften “Forever” Long Lived Transient Temporary Constants
  • 23. CTransientBuffer ● On Alloc, walk a Free list looking for best fit ● Data is updated using Map/NO_OVERWRITE ● Return opaque, immutable handle ● On Free, record that chunk was freed—into RetiredFrames.back() ● Just after present, an “OnPresent” function is called class CTransientBuffer { ID3D11Buffer* mBuffer; UINT mLengthBytes; ID3D11Device* mOwner; vector<CSubAlloc> mFreeList; list<RetiredFrame> mRetiredFrames; public: CSubAlloc* Alloc(UINT, void*, ID3D11DeviceContext*); void Free(CSubAlloc*); void OnPresent(ID3D11DeviceContext*);
  • 24. CTransientBuffer Guts class CTransientBuffer { ID3D11Buffer* mBuffer; UINT mLengthBytes; ID3D11Device* mOwner; vector<CSubAlloc> mFreeList; list<RetiredFrame> mRetiredFrames; public: CSubAlloc* Alloc(UINT, void*, ID3D11DeviceContext*); void Free(CSubAlloc*); void OnPresent(ID3D11DeviceContext*); ... struct RetiredFrame { list<CSubAlloc*> mPendingFrees; ID3D11Query* mFrameCompleteQuery; }; class CSubAlloc { UINT mOffset; UINT mLength; ...
  • 25. CTransientBuffer::OnPresent void CTransientBuffer::OnPresent(ID3D11DeviceContext* _dc) { // First, deal with deletes from this frame RetiredFrame& retFrame = mRetiredFrames.back(); if (!retFrame.mPendingFrees.empty()) { retFrame.mFrameCompleteQuery = CreateAndIssueEventQuery(_dc); // Append a new (empty) RetiredFrame to mRetiredFrames mRetiredFrames.push_back(RetiredFrame()); } // Second, return pending frees to mFreeList
  • 26. CTransientBuffer::OnPresent // Second, return pending frees to mFreeList FOREACH(frameIt, mRetiredFrames) { auto query = frameIt->mFrameCompleteQuery; if (!(query && IsQueryComplete(query))) break; FOREACH(suballocIt, frameIt->mPendingFrees) { ReallyFree(*subAllocIt); } } }
  • 28. CTransientBuffer Visualized Free List Retired Frames Allocating four Buffers
  • 29. CTransientBuffer Visualized Free List Retired Frames Nothing
  • 30. CTransientBuffer Visualized Free List Retired Frames Deallocating Yellow and Green EQ
  • 31. CTransientBuffer Visualized Free List Retired Frames EvEEEVEentE Deallocating Yellow and Green EQ
  • 32. CTransientBuffer Visualized Free List Retired Frames EQ Returns for Retired Frame
  • 33. CTransientBuffer: Handling OOM ● Ways to handle Out of Memory on Alloc: ● Spin-lock waiting for RetiredFrame Queries to return ● Allocate a new, larger buffer ● Release current buffer ● Requires a system memory copy to initially fill new buffer ● These will (probably) stall ● But in your code ● can be easily logged -and/or- ● Recorded to adjust and avoid for subsequent runs
  • 34. Transient Buffer Pattern ● Works in D3D9 as well ● Can be extended and simplified to contention- free Temporary Buffers, too! ● Let’s take a quick look at that.
  • 35. Discard-Free Temporary Buffers ● Allocate out of Buffer as a circular buffer ● No opaque handle needed ● Remember ending address of the last allocation ● Per frame: Assuming any allocations, issue query ● Later: When query returns, move the end pointer to indicate additional available space ● Credit: Blizzard’s StarCraft 2 Team (thanks!)
  • 36. Discard-Free Temp Buffer Visualized Start Retired FramesEnd Start State
  • 37. Discard-Free Temp Buffer Visualized Start Retired Frames NextEnd End Allocate some stuff
  • 38. Discard-Free Temp Buffer Visualized Start Retired Frames NextEnd End Go on… NextEnd
  • 39. Discard-Free Temp Buffer Visualized Start Retired Frames NextEnd End Queries start to return… NextEnd
  • 40. Discard-Free Temp Buffer Visualized Start Retired Frames NextEnd End etc… NextEnd
  • 41. Discard-Free Temp Buffer Visualized Start Retired Frames NextEnd End etc… NextEnd
  • 43. Constant Buffer Organization ● Group by frequency of update ● The cheapest buffers are the ones you never update ● You can bind multiple buffers in one call (Reduce those API calls!)
  • 44. Proposed Buffer Grouping ● Assuming you are not vertex shading limited ● Don’t solve the travelling salesman in your VS ● Seriously: this isn’t common
  • 45. Multiple Constant Buffers ● One for per-frame constants (GI values, lights) ● One for per-camera constants (ViewProj matrix, camera position in world, RT dimensions) oPos = in.Position * cWorldViewPos; oPos = in.Position * cWorld * cViewPos; ^ One extra 3x3 matrix multiply in the VS. No biggie. Old HLSL New HLSL
  • 46. Multiple Constant Buffers cont’d ● One for per-object constants (World matrix, dynamic material properties, etc) ● One for per-material constants (if these are shared—if not then drop them in with per-object constants) ● Splitting constants this way eliminates constant updates for static objects.
  • 47. Constant Buffer Tricks ● Use shared structs to update when possible ● Struct can be included from both hlsl and C++ ● Makes buffer updates trivial! ● Assign them to slots by convention: ● b0: Per-Frame, b1: Per-Camera, etc ● Slot assignment can live in shared header, too.
  • 49. Performance Investigation ● Scene from a Typical D3D11 Application (unreleased) ● 115 Dynamic Vertex Buffer Updates (particles) per frame ● Total Time: 4.36ms / frame Per- Call Frame Map/Unmap 0.036 ms 3.79 ms Memcpy ~0.004 ms 0.4 ms
  • 50. Let’s buffer the updates ● All Dynamic Updates during one update ● 1 Map per frame (using MAP_DISCARD) ● Still 115 memcpys (I’m lazy) ● Total Time: 0.267ms / frame (savings: 4.1ms!) Per- Call Frame Map/Unmap 0.036 ms 0.036 ms Memcpy ~0.002 ms 0.231 ms
  • 51. Buffered update, no discards ● One update into a triple buffer ● 1 Map per frame (using MAP_NOOVERWRITE) ● Still 115 memcpys (I’m still lazy) ● Total Time: 0.217ms / frame (savings: 4.15ms) ● Bonus: No hitching ever ● Downside: 3x the memory Per- Call Frame Map/Unmap 0.031 ms 0.031 ms Memcpy ~0.002 ms 0.231 ms
  • 52. Performance Results ● Reducing API usage was a huge CPU-side savings (4.09 ms). GPU Perf Unaffected ● Discard-Free updates were marginally faster still— but would never hitch. Total Frame Time Original 4.360 ms Buffered Updates 0.267 ms Discard-Free 0.217 ms
  • 53. GPUView ● Covered by Jon Story earlier today ● Hopefully you caught it! ● Great for finding CPU-GPU sync points
  • 54. Questions? ● jmcdonald at nvidia dot com
  • 55. Nifty Buffer Summary Table Type Usage (e.g) Create Flag Update Method “Forever” Level BSPs IMMUTABLE Cannot Update Long-Lived Characters DEFAULT UpdateSubResource Transient UI/Text DYNAMIC CTransientBuffer Temporary Particles DYNAMIC Map/NO_OVERWRITE Constant Material Props DYNAMIC Map/DISCARD