Your Game Needs Direct3D 11, So Get Started Now! <ul><li>Bill Bilodeau </li></ul><ul><li>ISV Relations </li></ul><ul><li>A...
Topics covered in this session <ul><ul><li>Why your game needs Direct3D 11 </li></ul></ul><ul><ul><li>Porting to Direct 3D...
<ul><li>Faster Rendering -> More Rendering -> Better Graphics </li></ul><ul><ul><li>Direct3D 11 can make rendering more ef...
<ul><ul><li>Superset of Direct 3D 10.1 </li></ul></ul><ul><ul><ul><li>Gather() function speeds up texture fetches </li></u...
<ul><li>You can run Direct3D 11 on downlevel hardware </li></ul><ul><ul><li>If you stay within the feature level of the de...
Porting to Direct3D 11 in the real world <ul><li>Frostbite Engine </li></ul><ul><li>Johan Anderson </li></ul><ul><li>Rende...
Frostbite DX11 port <ul><ul><li>Starting point </li></ul></ul><ul><ul><ul><li>Cross-platform engine (PC, Xenon, PS3) </li>...
Temporary switchable DX10/DX11 wrappers #ifdef DICE_D3D11_ENABLE #include <External/DirectX/Include/d3d11.h> #else #includ...
Switchable DX10/DX11 support examples // using D3D10 requires dxgi.lib and D3D11 beta requires dxgi_beta.lib and if we // ...
Mapping buffers on DX10 vs DX11 #ifdef DICE_D3D11_ENABLE D3D11_MAPPED_SUBRESOURCE mappedResource; DICE_SAFE_DX(m_deviceCon...
Frostbite DX11 parallel dispatch <ul><li>The Killer Feature  for reducing CPU rendering overhead! </li></ul><ul><ul><li>~9...
Frostbite DX11 - Other HW features of interest <ul><li>Short term / easy: </li></ul><ul><ul><li>Read-only depth buffers. S...
Frostbite DX11 port – Questions? ? [email_address] from: igetyourfail.com
<ul><li>Advantages of Hardware Tessellation </li></ul><ul><ul><li>An extremely compact representation of a surface </li></...
<ul><li>3 Tessellation Stages </li></ul><ul><ul><li>2 Programmable Stages </li></ul></ul><ul><ul><ul><li>Hull Shader </li>...
<ul><li>Hull Shader </li></ul><ul><li>Operates in 2 phases </li></ul><ul><ul><li>“Control point phase” allows conversion f...
<ul><li>Tessellator Stage </li></ul><ul><li>Fixed Function Stage </li></ul><ul><ul><li>Generates new vertices within each ...
<ul><li>Domain Shader </li></ul><ul><li>Evaluates the surface at each vertex </li></ul><ul><ul><li>Uses the control points...
<ul><li>ATI Tessellator </li></ul><ul><ul><li>A new fixed function stage </li></ul></ul><ul><ul><li>Can be used for protot...
Comparison: D3D 9 vs D3D 11 Tessellator <ul><ul><li>Various Algorithms can be implemented on both </li></ul></ul><ul><ul><...
Alternate Tessellation Method <ul><li>Instanced Tessellation (Gruen 2005) </li></ul><ul><ul><li>Does not require dedicated...
<ul><ul><li>Allows you to bypass the entire graphics pipeline for GPGPU programming </li></ul></ul><ul><ul><ul><li>Post-pr...
Compute Shader: Threads <ul><ul><li>A thread is the basic CS processing element </li></ul></ul><ul><ul><li>A “thread group...
Compute Shader: Threads and Thread Groups <ul><ul><li>pDev11->Dispatch(3, 2, 1); // D3D API call </li></ul></ul><ul><ul><l...
Compute Shader: Thread Group Shared Memory <ul><ul><li>Shared between threads </li></ul></ul><ul><ul><ul><li>Think of it a...
<ul><li>Compute Shaders are available on some D3D 10 Hardware </li></ul><ul><ul><li>CS 4.x is a subset of CS 5.0 </li></ul...
CS 4.x Limitations <ul><li>Limitations </li></ul><ul><ul><li>Max number of threads per group is 768 total </li></ul></ul><...
CS 4.0 Example: HDR Tone Map Reduction Rendered HDR Image 1D Buffer 1D Buffer 8 8 Final Result
CS 4.0 Example: HDR Tone Map Reduction <ul><li>C++ Code: </li></ul><ul><li>CompileShaderFromFile( L&quot;ReduceTo1DCS.hlsl...
CS 4.0 Example: HDR Tone Map Reduction <ul><li>#define blocksize 8 </li></ul><ul><li>#define blocksizeY 8 </li></ul><ul><l...
CS 4.0 Example: HDR Tone Map Reduction <ul><li>if ( GI < stride ) </li></ul><ul><li>accum[GI] += accum[stride+GI]; </li></...
Comparison: CS 4.x vs CS 5.0 <ul><li>CS 4.x is great to have but CS 5.0 will be better </li></ul><ul><ul><li>Better perfor...
<ul><li>Multithreaded Rendering </li></ul><ul><ul><li>Render calls are now part of the “Immediate” context or the “Deferre...
New Direct3D 11 feature: Multithreading Immediate Deferred Deferred   Thread 1   Thread 2   Thread 3   DrawPrim DrawPrim D...
Deferred Contexts <ul><li>Deferred contexts are intended to run in separate threads </li></ul><ul><ul><li>One immediate co...
New Direct3D 11 feature: Multithreading <ul><li>Multithreaded Resources </li></ul><ul><ul><li>Resources can be created wit...
<ul><li>Multithreading is implemented in the Direct3D 11 runtime </li></ul><ul><ul><li>Independent of driver or hardware <...
Comparison of Multithreading on D3D 11 vs Downlevel Hardware  <ul><li>This is primarily a performance issue </li></ul><ul>...
<ul><li>Fetches 4 point-sampled values in a single texture instruction </li></ul><ul><ul><li>Better/faster shadow kernels ...
<ul><li>Gather() is part of Direct3D 11 SM 4.1 </li></ul><ul><ul><li>Works on all Direct3D 10.1 hardware </li></ul></ul><u...
<ul><li>Consider doing the port in stages </li></ul><ul><ul><li>Use the HAL when you can  </li></ul></ul><ul><ul><ul><li>S...
<ul><li>A simple port from D3D 9 to D3D 11 will not perform well </li></ul><ul><ul><li>Hopefully we’ve all learned this le...
<ul><li>Constant Buffers </li></ul><ul><ul><li>Group constants into buffers by frequency of update </li></ul></ul><ul><ul>...
Direct3D 10 programming review <ul><li>Texture Updates </li></ul><ul><ul><li>Call Map() with the DO_NOT_WAIT flag to updat...
<ul><li>Fairly easy port from Direct3D 10 to D3D 11 with 10 or 10.1 device feature level </li></ul><ul><ul><li>You can sti...
<ul><li>Multithreading </li></ul><ul><ul><li>Requires changes to your rendering code </li></ul></ul><ul><ul><li>Add Window...
Adding in new Direct3D 11 features <ul><li>Compute Shader </li></ul><ul><ul><li>Post Processing </li></ul></ul><ul><ul><ul...
<ul><li>Add new features that require Direct3D 11 hardware </li></ul><ul><ul><li>Not too difficult, since you’ve already d...
<ul><li>Direct 3D 11 features will improve your game </li></ul><ul><ul><li>Multithreading, Compute Shader, Tessellation an...
<ul><li>Johan Andersson, DICE – advice on porting to D3D 11 </li></ul><ul><li>Nicholas Thibieroz, AMD – Compute Shader </l...
<ul><li>Trademark Attribution </li></ul><ul><li>AMD, the AMD Arrow logo and combinations thereof are trademarks of Advance...
Upcoming SlideShare
Loading in...5
×

Your Game Needs Direct3D 11, So Get Started Now!

20,945

Published on

Direct3D 11 will have tessellation for smoother curves and finer details. The new compute shader will make postprocessing faster and easier. You'll need Direct3D 11 to have the best graphics, and this talk will show you how you can get started using current generation hardware.

1 Comment
5 Likes
Statistics
Notes
  • good things!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
20,945
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
354
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide
  • Your Game Needs Direct3D 11, So Get Started Now!

    1. 1. Your Game Needs Direct3D 11, So Get Started Now! <ul><li>Bill Bilodeau </li></ul><ul><li>ISV Relations </li></ul><ul><li>AMD Graphics Products Group </li></ul><ul><li>[email_address] </li></ul>V1.0
    2. 2. Topics covered in this session <ul><ul><li>Why your game needs Direct3D 11 </li></ul></ul><ul><ul><li>Porting to Direct 3D 11 in the real world </li></ul></ul><ul><ul><ul><li>A view from the “Battlefield” trenches with Johan Anderson from DICE </li></ul></ul></ul><ul><ul><li>Important Direct3D 11 features for your game </li></ul></ul><ul><ul><ul><li>How you can use these features on current hardware </li></ul></ul></ul><ul><ul><li>Strategies for moving to Direct3D 11 </li></ul></ul>
    3. 3. <ul><li>Faster Rendering -> More Rendering -> Better Graphics </li></ul><ul><ul><li>Direct3D 11 can make rendering more efficient </li></ul></ul><ul><ul><li>Tessellation </li></ul></ul><ul><ul><ul><li>Faster rendering using less space </li></ul></ul></ul><ul><ul><li>Compute Shaders </li></ul></ul><ul><ul><ul><li>More programming freedom </li></ul></ul></ul><ul><ul><ul><li>Efficient reuse of sampled data </li></ul></ul></ul><ul><ul><li>Multithreading </li></ul></ul><ul><ul><ul><li>Takes advantage of modern multi-core CPUs </li></ul></ul></ul>Why your game needs Direct3D 11
    4. 4. <ul><ul><li>Superset of Direct 3D 10.1 </li></ul></ul><ul><ul><ul><li>Gather() function speeds up texture fetches </li></ul></ul></ul><ul><ul><ul><li>Standard API access to MSAA depth buffers </li></ul></ul></ul><ul><ul><ul><li>MSAA sample patterns/mask, Cube map arrays, etc. </li></ul></ul></ul><ul><ul><li>Supports multiple “device feature levels” </li></ul></ul><ul><ul><ul><li>11.0 </li></ul></ul></ul><ul><ul><ul><li>10.1, 10.0 </li></ul></ul></ul><ul><ul><ul><li>9.3, 9.2, 9.1 </li></ul></ul></ul><ul><ul><ul><li>One API for all of your supported hardware </li></ul></ul></ul><ul><ul><li>Runs on both Windows 7 and Vista </li></ul></ul><ul><ul><ul><li>Not tied to one operating system </li></ul></ul></ul>More reasons to switch to Direct3D 11
    5. 5. <ul><li>You can run Direct3D 11 on downlevel hardware </li></ul><ul><ul><li>If you stay within the feature level of the device you can use the Direct3D 11 API </li></ul></ul><ul><ul><li>Even some new Direct3D 11 features will run on old hardware: </li></ul></ul><ul><ul><ul><li>Multithreading </li></ul></ul></ul><ul><ul><ul><li>Compute shaders </li></ul></ul></ul><ul><ul><ul><ul><li>On some Direct3D hardware with new drivers </li></ul></ul></ul></ul><ul><ul><ul><li>Some restrictions may apply </li></ul></ul></ul>Sometimes you can teach an old dog new tricks.
    6. 6. Porting to Direct3D 11 in the real world <ul><li>Frostbite Engine </li></ul><ul><li>Johan Anderson </li></ul><ul><li>Rendering Architect </li></ul><ul><li>DICE </li></ul>
    7. 7. Frostbite DX11 port <ul><ul><li>Starting point </li></ul></ul><ul><ul><ul><li>Cross-platform engine (PC, Xenon, PS3) </li></ul></ul></ul><ul><ul><ul><li>Engine PC path used DX10 exclusively </li></ul></ul></ul><ul><ul><ul><li>10.0 and 10.1 feature levels </li></ul></ul></ul><ul><ul><li>Ported entire engine from DX10 to DX11 API in 3 hours! </li></ul></ul><ul><ul><ul><li>Mostly search’n’replace  </li></ul></ul></ul><ul><ul><ul><li>70% of time spent changing Map/Unmap calls that has moved to immediate context instead of resource object </li></ul></ul></ul><ul><ul><li>Compile-time switchable DX10 or DX11 API usage </li></ul></ul><ul><ul><ul><li>As it will take a (short) while for the entire eco-system to support DX11 (PIX, NvPerfHud, IHV APIs, etc.) </li></ul></ul></ul><ul><ul><ul><li>#define DICE_D3D11_ENABLE, currently ~100 usages </li></ul></ul></ul><ul><ul><ul><li>Will be removed later when everything DX11 works </li></ul></ul></ul>
    8. 8. Temporary switchable DX10/DX11 wrappers #ifdef DICE_D3D11_ENABLE #include <External/DirectX/Include/d3d11.h> #else #include <External/DirectX/Include/d3d10_1.h> #endif #ifdef DICE_D3D11_ENABLE #define ID3DALLDevice ID3D11Device #define ID3DALLDeviceContext ID3D11DeviceContext #define ID3DALLBuffer ID3D11Buffer #define ID3DALLRenderTargetView ID3D11RenderTargetView #define ID3DALLPixelShader ID3D11PixelShader #define ID3DALLTexture1D ID3D11Texture1D #define D3DALL_BLEND_DESC D3D11_BLEND_DESC1 #define D3DALL_BIND_SHADER_RESOURCE D3D11_BIND_SHADER_RESOURCE #define D3DALL_RASTERIZER_DESC D3D11_RASTERIZER_DESC #define D3DALL_USAGE_IMMUTABLE D3D11_USAGE_IMMUTABLE #else #define ID3DALLDevice ID3D10Device1 #define ID3DALLDeviceContext ID3D10Device1 #define ID3DALLBuffer ID3D10Buffer #define ID3DALLRenderTargetView ID3D10RenderTargetView #define ID3DALLPixelShader ID3D10PixelShader #define ID3DALLTexture1D ID3D10Texture1D #define D3DALL_BLEND_DESC D3D10_BLEND_DESC1 #define D3DALL_BIND_SHADER_RESOURCE D3D10_BIND_SHADER_RESOURCE #define D3DALL_RASTERIZER_DESC D3D10_RASTERIZER_DESC #define D3DALL_USAGE_IMMUTABLE D3D10_USAGE_IMMUTABLE #endif <ul><li>Want the full header-file to save on typing? Drop me an email </li></ul>
    9. 9. Switchable DX10/DX11 support examples // using D3D10 requires dxgi.lib and D3D11 beta requires dxgi_beta.lib and if we // link with only one through the common method then it crashes when creating // the D3D device. so instead conditionally link with the // correct dxgi library here for now --johan #ifdef DICE_D3D11_ENABLE #pragma comment(lib, &quot;dxgi_beta.lib&quot;) #else #pragma comment(lib, &quot;dxgi.lib&quot;) #endif // Setting a shader takes an extra parameter on D3D11: ID3D11ClassLinkage // which is used for the D3D11 subroutine support (which we don’t use) #ifdef DICE_D3D11_ENABLE m_deviceContext->PSSetShader(solution.pixelPermutation->shader, nullptr, 0); #else m_deviceContext->PSSetShader(solution.pixelPermutation->shader); #endif
    10. 10. Mapping buffers on DX10 vs DX11 #ifdef DICE_D3D11_ENABLE D3D11_MAPPED_SUBRESOURCE mappedResource; DICE_SAFE_DX(m_deviceContext->Map( m_functionConstantBuffers[type], // cbuffer 0, // subresource D3D11_MAP_WRITE_DISCARD, // map type 0, // map flags &mappedResource)); // map resource data = reinterpret_cast<Vec*>(mappedResource.pData); // fill in data m_deviceContext->Unmap(m_functionConstantBuffers[type], 0); #else DICE_SAFE_DX(m_functionConstantBuffers[type]->Map( D3D10_MAP_WRITE_DISCARD, // map type 0, // map flags (void**)&data)); // data // fill in data m_functionConstantBuffers[type]->Unmap(); #endif
    11. 11. Frostbite DX11 parallel dispatch <ul><li>The Killer Feature for reducing CPU rendering overhead! </li></ul><ul><ul><li>~90% of our rendering dispatch job is in D3D/driver </li></ul></ul><ul><ul><li>Have a DX11 deferred device context per core </li></ul></ul><ul><ul><ul><li>Together with dynamic resources (cbuffer/vbuffer) for usage on that deferred context </li></ul></ul></ul><ul><ul><li>Renderer has list of all draw calls we want to do for the each rendering “layer” of the frame </li></ul></ul><ul><ul><li>Split draw calls for each layer into chunks of ~256 and dispatch in parallel to the deferred contexts </li></ul></ul><ul><ul><ul><li>Each chunk generates a command list </li></ul></ul></ul><ul><ul><li>Render to immediate context & execute command lists </li></ul></ul><ul><ul><li>Profit! </li></ul></ul><ul><ul><ul><li>Goal: close to linear perf. scaling up to octa-core when we get DX11 driver support (hint hint to the IHVs) </li></ul></ul></ul>
    12. 12. Frostbite DX11 - Other HW features of interest <ul><li>Short term / easy: </li></ul><ul><ul><li>Read-only depth buffers. Saves copy & memory. </li></ul></ul><ul><ul><li>BC6H compression for static HDR envmaps or lightmaps </li></ul></ul><ul><ul><li>BC7 compression for high-quality RGB[A] textures </li></ul></ul><ul><ul><li>Per-resource fractional MinLod. Properly fade in streamed textures. </li></ul></ul><ul><li>Longer term / more complex: </li></ul><ul><ul><li>Compute shaders! (post fx, OIT, particle collision) </li></ul></ul><ul><ul><li>DrawIndirect (procedural generation, keep on GPU) </li></ul></ul><ul><ul><li>Tessellation (characters, terrain, smooth objects) </li></ul></ul>
    13. 13. Frostbite DX11 port – Questions? ? [email_address] from: igetyourfail.com
    14. 14. <ul><li>Advantages of Hardware Tessellation </li></ul><ul><ul><li>An extremely compact representation of a surface </li></ul></ul><ul><ul><ul><li>Each primitive in the low-res input mesh represents up to 64 levels of tessellation </li></ul></ul></ul><ul><ul><li>A faster way to render high resolution meshes </li></ul></ul><ul><ul><ul><li>Vertices are generated by dedicated hardware </li></ul></ul></ul><ul><ul><li>Levels of detail can changed without needing to be uploaded to the GPU </li></ul></ul><ul><ul><ul><li>No need to wait for uploads </li></ul></ul></ul><ul><ul><ul><li>LODs don’t need to be stored in system or GPU memory </li></ul></ul></ul><ul><ul><ul><li>LOD algorithm can run entirely on the GPU </li></ul></ul></ul>New Direct3D 11 Feature: The Tessellator
    15. 15. <ul><li>3 Tessellation Stages </li></ul><ul><ul><li>2 Programmable Stages </li></ul></ul><ul><ul><ul><li>Hull Shader </li></ul></ul></ul><ul><ul><ul><li>Domain shader </li></ul></ul></ul><ul><ul><li>1 Fixed Function Stage </li></ul></ul><ul><ul><ul><li>Tessellator </li></ul></ul></ul>Direct3D 11 Tessellator Stages Tessellator in the D3D 11 Pipeline
    16. 16. <ul><li>Hull Shader </li></ul><ul><li>Operates in 2 phases </li></ul><ul><ul><li>“Control point phase” allows conversion from one surface type to another </li></ul></ul><ul><ul><ul><li>Example: sub-division surface to Bezier patches </li></ul></ul></ul><ul><ul><ul><li>Runs once per control point </li></ul></ul></ul><ul><ul><li>“Patch constant phase” sets tessellation factors and other per-patch constants </li></ul></ul><ul><ul><ul><li>Runs once per input primitive </li></ul></ul></ul>Direct3D 11 Tessellator Stages Tessellator in the D3D 11 Pipeline
    17. 17. <ul><li>Tessellator Stage </li></ul><ul><li>Fixed Function Stage </li></ul><ul><ul><li>Generates new vertices within each of the input primitives </li></ul></ul><ul><ul><li>The number of new vertices is based on the tessellation factors computed by the hull shader </li></ul></ul>Direct3D 11 Tessellator Stages Tessellator in the D3D 11 Pipeline Level 1.0 Level 1.5 Level 3.0
    18. 18. <ul><li>Domain Shader </li></ul><ul><li>Evaluates the surface at each vertex </li></ul><ul><ul><li>Uses the control points generated by the hull shader </li></ul></ul><ul><ul><ul><li>Can implement various types of surfaces, for example Beziers </li></ul></ul></ul><ul><ul><li>Displacement Mapping </li></ul></ul><ul><ul><ul><li>Fetch displacements from a displacement map texture </li></ul></ul></ul><ul><ul><ul><li>Translate the vertex position along the normal </li></ul></ul></ul>Direct3D 11 Tessellator Stages Tessellator in the D3D 11 Pipeline
    19. 19. <ul><li>ATI Tessellator </li></ul><ul><ul><li>A new fixed function stage </li></ul></ul><ul><ul><li>Can be used for prototyping D3D 11 algorithms </li></ul></ul><ul><ul><li>Available on all ATI Direct3D 10 capable hardware and Xbox 360 </li></ul></ul><ul><ul><li>Tessellation SDK now available for Direct3D 9 </li></ul></ul><ul><ul><ul><li>http:// developer.amd.com/gpu/radeon/Tessellation </li></ul></ul></ul>You can do tessellation on today’s hardware. ATI Tessellator in the D3D 9 Pipeline
    20. 20. Comparison: D3D 9 vs D3D 11 Tessellator <ul><ul><li>Various Algorithms can be implemented on both </li></ul></ul><ul><ul><li>D3D11 Tessellator algorithms can usually be done in one pass. </li></ul></ul><ul><ul><li>Even with extra passes hardware tessellation is still faster than rendering high polygon count geometry without a tessellator. </li></ul></ul><ul><ul><ul><li>3 times faster with less than 1/100 th the size! </li></ul></ul></ul><ul><ul><li>D3D11 Tessellator has more tessellation levels 64 vs 15 </li></ul></ul><ul><ul><ul><li>More polygons per mesh </li></ul></ul></ul><ul><ul><li>D3D11 Tessellator has a cleaner API </li></ul></ul><ul><ul><ul><li>Control points are passed to hull and domain shader </li></ul></ul></ul><ul><ul><ul><li>ATI tessellator relies on vertex texture fetch </li></ul></ul></ul>
    21. 21. Alternate Tessellation Method <ul><li>Instanced Tessellation (Gruen 2005) </li></ul><ul><ul><li>Does not require dedicated tessellation hardware </li></ul></ul><ul><ul><li>Uses hardware instancing to render tessellated surfaces </li></ul></ul><ul><ul><ul><li>Create a vertex buffer that contains a tessellation of a generic triangle </li></ul></ul></ul><ul><ul><ul><li>Use instancing to instance that vertex buffer for every triangle in the mesh </li></ul></ul></ul><ul><ul><ul><li>The vertex shader can be used to transform the instanced triangles according to patch control points and/or a displacement map </li></ul></ul></ul>
    22. 22. <ul><ul><li>Allows you to bypass the entire graphics pipeline for GPGPU programming </li></ul></ul><ul><ul><ul><li>Post-processing, OIT, AI, Physics, and more </li></ul></ul></ul><ul><ul><ul><li>Avoid too many context switches </li></ul></ul></ul><ul><ul><li>Application has control over dispatching and synchronization of threads </li></ul></ul><ul><ul><li>Shared memory between Compute Shader threads </li></ul></ul><ul><ul><ul><li>Thread Group Shared Memory (TGSM) </li></ul></ul></ul><ul><ul><ul><li>Avoids redundant calculations and fetches </li></ul></ul></ul><ul><ul><li>Random access to output buffer </li></ul></ul><ul><ul><ul><li>“Unordered Access View” (UAV) </li></ul></ul></ul><ul><ul><ul><li>Scatter writes – multiple random access writes per shader </li></ul></ul></ul>New Direct3D 11 feature: Compute Shaders
    23. 23. Compute Shader: Threads <ul><ul><li>A thread is the basic CS processing element </li></ul></ul><ul><ul><li>A “thread group” is a 3 dimensional array of threads </li></ul></ul><ul><ul><ul><li>CS declares the number of threads in a group </li></ul></ul></ul><ul><ul><ul><ul><li>eg. [numthreads(X, Y, Z)] </li></ul></ul></ul></ul><ul><ul><ul><li>Each thread in the group executes the same code </li></ul></ul></ul><ul><ul><li>Thread groups are also organized as 3D arrays </li></ul></ul><ul><ul><li>Execution of threads is started by calling the device Dispatch( nX, nY, nZ ) function </li></ul></ul><ul><ul><ul><li>Where nX, nY, nZ are the number of thread groups to execute </li></ul></ul></ul>
    24. 24. Compute Shader: Threads and Thread Groups <ul><ul><li>pDev11->Dispatch(3, 2, 1); // D3D API call </li></ul></ul><ul><ul><li>[numthreads(4, 4, 1)] // CS 5.0 HLSL </li></ul></ul><ul><ul><li>Total threads = 3*2*4*4 = 96 </li></ul></ul>
    25. 25. Compute Shader: Thread Group Shared Memory <ul><ul><li>Shared between threads </li></ul></ul><ul><ul><ul><li>Think of it as fast local memory reserved for threads </li></ul></ul></ul><ul><ul><li>Read/write access at any location </li></ul></ul><ul><ul><li>Declared in the shader </li></ul></ul><ul><ul><ul><li>groupshared float4 vCacheMemory[1024]; </li></ul></ul></ul><ul><ul><li>Limited to 32 KB </li></ul></ul><ul><ul><li>Need synchronization before reading back data written by other threads </li></ul></ul><ul><ul><ul><li>To ensure all threads have finished writing </li></ul></ul></ul><ul><ul><ul><li>GroupMemoryBarrier(); </li></ul></ul></ul><ul><ul><ul><li>GroupMemoryBarrierWithGroupSync(); </li></ul></ul></ul>
    26. 26. <ul><li>Compute Shaders are available on some D3D 10 Hardware </li></ul><ul><ul><li>CS 4.x is a subset of CS 5.0 </li></ul></ul><ul><ul><ul><li>Includes CS 4.0 and CS 4.1 </li></ul></ul></ul><ul><ul><ul><li>CS 4.1 includes instructions from SM 4.1 (D3D 10.1) </li></ul></ul></ul><ul><ul><li>Requires support in the driver </li></ul></ul><ul><ul><ul><li>Use CheckFeatureSupport() </li></ul></ul></ul><ul><ul><ul><ul><li>D3D11_Feature enum: D3D11_FEATURE_D3D10_X_HARDWARE_OPTIONS </li></ul></ul></ul></ul><ul><ul><ul><ul><li>boolean value: ComputeShaders_Plus_RawAndStructuredBuffers_Via_Shader_4_x </li></ul></ul></ul></ul><ul><ul><ul><li>Drivers are now available! </li></ul></ul></ul><ul><ul><ul><ul><li>Contact us for details </li></ul></ul></ul></ul>You can use compute shaders on today’s hardware.
    27. 27. CS 4.x Limitations <ul><li>Limitations </li></ul><ul><ul><li>Max number of threads per group is 768 total </li></ul></ul><ul><ul><li>Dispatch Zn==1 & no DispatchIndirect() support </li></ul></ul><ul><ul><li>Thread Group Shared Memory (TGSM) Limitations </li></ul></ul><ul><ul><ul><li>Max size is 16 KB vs 32 KB in CS 5.0 </li></ul></ul></ul><ul><ul><ul><li>Threads can only write to their own offsets in TGSM </li></ul></ul></ul><ul><ul><ul><ul><li>But they can still read from anywhere in the TGSM </li></ul></ul></ul></ul><ul><ul><li>No atomic operations or append/consume </li></ul></ul><ul><ul><li>Only one UAV can be bound </li></ul></ul><ul><ul><ul><li>Must be Raw or Structured, not Typed (no textures) </li></ul></ul></ul>
    28. 28. CS 4.0 Example: HDR Tone Map Reduction Rendered HDR Image 1D Buffer 1D Buffer 8 8 Final Result
    29. 29. CS 4.0 Example: HDR Tone Map Reduction <ul><li>C++ Code: </li></ul><ul><li>CompileShaderFromFile( L&quot;ReduceTo1DCS.hlsl&quot;, &quot;CSMain&quot;, &quot;cs_4_0&quot;, &pBlob ) ); </li></ul><ul><li>HLSL Code (reduction from 2D to 1D): </li></ul><ul><li>Texture2D Input : register( t0 ); </li></ul><ul><li>RWStructuredBuffer<float> Result : register( u0 ); </li></ul><ul><li>cbuffer cbCS : register( b0 ) </li></ul><ul><li>{ </li></ul><ul><li>uint4 g_param; // (g_param.x, g_param.y) is the x and y dimensions of </li></ul><ul><li> // the Dispatch call . </li></ul><ul><li> // (g_param.z, g_param.w) is the size of the above </li></ul><ul><li>// Input Texture2D </li></ul><ul><li>}; </li></ul>
    30. 30. CS 4.0 Example: HDR Tone Map Reduction <ul><li>#define blocksize 8 </li></ul><ul><li>#define blocksizeY 8 </li></ul><ul><li>#define groupthreads (blocksize*blocksizeY) </li></ul><ul><li>groupshared float accum[groupthreads]; </li></ul><ul><li>static const float4 LUM_VECTOR = float4(.299, .587, .114, 0); </li></ul><ul><li>[numthreads(blocksize,blocksizeY,1)] </li></ul><ul><li>void CSMain( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex ) </li></ul><ul><li>{ </li></ul><ul><li>float4 s = Input.Load( uint3((float)DTid.x/81.0f*g_param.z, (float)DTid.y/81.0f*g_param.w, 0) ); </li></ul><ul><li>accum[GI] = dot( s, LUM_VECTOR ); </li></ul><ul><li>uint stride = groupthreads/2; </li></ul><ul><li>GroupMemoryBarrierWithGroupSync(); </li></ul>
    31. 31. CS 4.0 Example: HDR Tone Map Reduction <ul><li>if ( GI < stride ) </li></ul><ul><li>accum[GI] += accum[stride+GI]; </li></ul><ul><li>if ( GI < 16 ) </li></ul><ul><li>{ </li></ul><ul><li> accum[GI] += accum[16+GI]; </li></ul><ul><li>accum[GI] += accum[8+GI]; </li></ul><ul><li>accum[GI] += accum[4+GI]; </li></ul><ul><li>accum[GI] += accum[2+GI]; </li></ul><ul><li>accum[GI] += accum[1+GI]; </li></ul><ul><li>} </li></ul><ul><li>if ( GI == 0 ) </li></ul><ul><li>{ </li></ul><ul><li>Result[Gid.y*g_param.x+Gid.x] = accum[0]; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul>
    32. 32. Comparison: CS 4.x vs CS 5.0 <ul><li>CS 4.x is great to have but CS 5.0 will be better </li></ul><ul><ul><li>Better performance – D3D 11 Hardware will be faster </li></ul></ul><ul><ul><li>Better Thread Group Shared Memory </li></ul></ul><ul><ul><ul><li>More storage 32K vs 16K </li></ul></ul></ul><ul><ul><ul><li>Better access – threads can write anywhere in TGSM, not just within their thread group </li></ul></ul></ul><ul><ul><li>Better interaction with graphics pipeline </li></ul></ul><ul><ul><ul><li>Can output to textures (typed UAVs) </li></ul></ul></ul><ul><ul><ul><ul><li>No need to draw a full screen quad </li></ul></ul></ul></ul><ul><ul><li>Better precision – Double Precision (optional) </li></ul></ul><ul><ul><li>Better synchronization - Atomics </li></ul></ul><ul><li>CS 4.x is still your best alternative on downlevel hardware </li></ul>
    33. 33. <ul><li>Multithreaded Rendering </li></ul><ul><ul><li>Render calls are now part of the “Immediate” context or the “Deferred” context </li></ul></ul><ul><ul><li>Immediate context calls get executed right away, just like D3D 9 and D3D 10 rendering </li></ul></ul><ul><ul><li>Deferred context calls are used for building “command lists” i.e. display lists. </li></ul></ul><ul><ul><ul><li>Draw calls and other rendering calls are recorded by the deferred context and stored in the command list </li></ul></ul></ul><ul><ul><ul><li>When the command list is finished, it can then be placed in the queue on the immediate thread using the ExecuteCommandList() function </li></ul></ul></ul>New Direct3D 11 feature: Multithreading
    34. 34. New Direct3D 11 feature: Multithreading Immediate Deferred Deferred Thread 1 Thread 2 Thread 3 DrawPrim DrawPrim DrawPrim DrawPrim DrawPrim DrawPrim DrawPrim DrawPrim DrawPrim Execute Execute
    35. 35. Deferred Contexts <ul><li>Deferred contexts are intended to run in separate threads </li></ul><ul><ul><li>One immediate context on the main render thread </li></ul></ul><ul><ul><li>Multiple deferred contexts on worker threads </li></ul></ul><ul><ul><li>Running each deferred context in it’s own thread takes advantage of modern multi-core CPUs </li></ul></ul><ul><ul><li>Re-play of command lists, like the traditional use of display lists in OpenGL may not be the best use of this feature </li></ul></ul><ul><ul><li>Some overhead with multiple contexts, so make sure you’re doing enough work in each context </li></ul></ul><ul><ul><li>Scale the number of deferred contexts (in threads) with the number of CPU cores. </li></ul></ul>
    36. 36. New Direct3D 11 feature: Multithreading <ul><li>Multithreaded Resources </li></ul><ul><ul><li>Resources can be created with the device interface in a separate thread, concurrent to a device context. </li></ul></ul><ul><ul><ul><li>D3D 11 Device interface creation methods are free threaded </li></ul></ul></ul><ul><ul><ul><li>Create VBs, Textures, CBs, State, and Shaders while rendering in another thread. </li></ul></ul></ul><ul><ul><li>Resources can be uploaded asynchronously as well </li></ul></ul><ul><ul><ul><li>Concurrent with shader compilation </li></ul></ul></ul>
    37. 37. <ul><li>Multithreading is implemented in the Direct3D 11 runtime </li></ul><ul><ul><li>Independent of driver or hardware </li></ul></ul><ul><ul><li>Runtime will emulate features not supported by driver </li></ul></ul><ul><ul><li>Easy for testing and backwards compatibility! </li></ul></ul><ul><li>Limitations on downlevel hardware </li></ul><ul><ul><li>Concurrency is limited by driver support </li></ul></ul><ul><ul><ul><li>Check for driver support using ID3D11Device::CheckFeatureSupport() </li></ul></ul></ul><ul><ul><ul><ul><li>D3D11_FEATURE_DATA_THREADING </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>DriverConcurrentCreates, DriverCommandLists </li></ul></ul></ul></ul></ul>You can do multithreading with today’s hardware.
    38. 38. Comparison of Multithreading on D3D 11 vs Downlevel Hardware <ul><li>This is primarily a performance issue </li></ul><ul><ul><li>You may get some improvement with multithreading, even without driver/hardware support </li></ul></ul><ul><ul><ul><li>See the latest Microsoft DirectX SDK sample </li></ul></ul></ul><ul><ul><li>Multithreaded driver support will allow more concurrency = better performance </li></ul></ul><ul><ul><li>Direct3D 11 Hardware will be faster </li></ul></ul>
    39. 39. <ul><li>Fetches 4 point-sampled values in a single texture instruction </li></ul><ul><ul><li>Better/faster shadow kernels </li></ul></ul><ul><ul><li>Optimized SSAO implementations </li></ul></ul><ul><ul><li>Can also select which components to sample: </li></ul></ul><ul><ul><ul><li>GatherRed(), GatherGreen(), GatherBlue(), GatherAlpha() </li></ul></ul></ul><ul><ul><li>Compare version which can be used for shadow mapping: </li></ul></ul><ul><ul><ul><li>GatherCmp (), GatherCmpRed(), GatherCmpGreen(), GatherCmpBlue(), GatherCmpAlpha() </li></ul></ul></ul>New Direct3D 11 SM 5.0 Feature: Gather() X Y Z W
    40. 40. <ul><li>Gather() is part of Direct3D 11 SM 4.1 </li></ul><ul><ul><li>Works on all Direct3D 10.1 hardware </li></ul></ul><ul><li>Limitations with SM4.1 Gather() </li></ul><ul><ul><li>Only works with single component formats </li></ul></ul><ul><ul><ul><li>No ability to select which component to gather </li></ul></ul></ul><ul><ul><li>Comparison form - GatherCmp() - is not supported </li></ul></ul><ul><li>Still works great for custom shadow map kernels and SSA0, since the depth buffer is a single component. </li></ul>You can use Gather() on today’s hardware
    41. 41. <ul><li>Consider doing the port in stages </li></ul><ul><ul><li>Use the HAL when you can </li></ul></ul><ul><ul><ul><li>Software rendering isn’t fun </li></ul></ul></ul><ul><ul><li>If your starting with D3D 9, the D3D 10 feature level should be your first target </li></ul></ul><ul><ul><li>First, get the engine working with D3D 10.1 feature level before adding Direct3D 11 specific features </li></ul></ul><ul><ul><ul><li>10.1 is the highest level that will work with the HAL </li></ul></ul></ul><ul><ul><li>Next, add new features on downlevel hardware where available </li></ul></ul><ul><ul><li>Finally, some new features will need to use the reference rasterizer without D3D 11 hardware </li></ul></ul>Strategies for Transitioning to Direct3D 11
    42. 42. <ul><li>A simple port from D3D 9 to D3D 11 will not perform well </li></ul><ul><ul><li>Hopefully we’ve all learned this lesson from D3D 10 </li></ul></ul><ul><ul><li>Going from D3D 9 to device feature level 10 will be a big chunk of the work </li></ul></ul><ul><ul><ul><li>Very similar to the Direct3D 10 API </li></ul></ul></ul><ul><ul><li>Direct 3D 10 fundamentals are still important </li></ul></ul><ul><ul><li>You can still use SM 3.0 for this stage </li></ul></ul>Starting with Direct3D 9
    43. 43. <ul><li>Constant Buffers </li></ul><ul><ul><li>Group constants into buffers by frequency of update </li></ul></ul><ul><ul><li>Remember: when one constant is updated, the whole buffer needs to get uploaded </li></ul></ul><ul><li>State Changes </li></ul><ul><ul><li>State objects are immutable for better performance </li></ul></ul><ul><ul><li>Initialize the state you need before you need it </li></ul></ul><ul><ul><li>Avoid creating lots of state objects on the fly </li></ul></ul><ul><li>Resources </li></ul><ul><ul><li>Resource creation and deletion is slow </li></ul></ul><ul><ul><li>Create most of your resources at the beginning </li></ul></ul>Direct3D 10 programming review
    44. 44. Direct3D 10 programming review <ul><li>Texture Updates </li></ul><ul><ul><li>Call Map() with the DO_NOT_WAIT flag to update staging textures, then CopyResource() to update the video memory texture </li></ul></ul><ul><ul><li>Do not use UpdateSubResource() – slow </li></ul></ul><ul><li>Batch Counts </li></ul><ul><ul><li>Keep batch counts low with instancing </li></ul></ul><ul><li>Alpha test is now done with clip()/discard() </li></ul><ul><ul><li>Don’t put this in every shader – it may disable early z! </li></ul></ul><ul><ul><li>Try to do the clip early to avoid unnecessary shader instructions </li></ul></ul>
    45. 45. <ul><li>Fairly easy port from Direct3D 10 to D3D 11 with 10 or 10.1 device feature level </li></ul><ul><ul><li>You can still use the HAL </li></ul></ul><ul><ul><li>Modify the existing Direct3D 10 code to use a Rendering Context </li></ul></ul><ul><ul><ul><li>You should only need the Immediate Context for now </li></ul></ul></ul><ul><ul><ul><li>Essentially just replacing API calls </li></ul></ul></ul><ul><ul><li>Get the simple port working first </li></ul></ul><ul><ul><li>You can still use your SM 4.0 or 4.1 shaders at this point in the process </li></ul></ul>Going from Direct3D 10 to Direct3D 11
    46. 46. <ul><li>Multithreading </li></ul><ul><ul><li>Requires changes to your rendering code </li></ul></ul><ul><ul><li>Add Windows multithreading support </li></ul></ul><ul><ul><li>Run deferred contexts in separate threads </li></ul></ul><ul><ul><ul><li>Need to break up your rendering workload in to logical chunks </li></ul></ul></ul><ul><ul><ul><li>Parallelize the command list building to improve performance </li></ul></ul></ul><ul><ul><li>Fortunately the runtime will emulate this feature </li></ul></ul><ul><ul><ul><li>Performance improvements may not be fully realized until new drivers and new hardware is released. </li></ul></ul></ul>Adding in new Direct3D 11 features
    47. 47. Adding in new Direct3D 11 features <ul><li>Compute Shader </li></ul><ul><ul><li>Post Processing </li></ul></ul><ul><ul><ul><li>Replace your old pixel shader implementations with faster compute shader versions </li></ul></ul></ul><ul><ul><li>Use CS 4.x on current hardware </li></ul></ul><ul><ul><ul><li>Good for testing and backwards compatibility </li></ul></ul></ul><ul><li>Tessellation </li></ul><ul><ul><li>Prototype tessellation algorithms using the ATI tessellator on Direct3D 9 </li></ul></ul><ul><ul><li>Use instanced tessellation for Direct3D 11 on downlevel hardware </li></ul></ul><ul><ul><li>Consider how Tessellation will affect your art pipeline – better to prepare early </li></ul></ul>
    48. 48. <ul><li>Add new features that require Direct3D 11 hardware </li></ul><ul><ul><li>Not too difficult, since you’ve already done most of the work! </li></ul></ul><ul><ul><li>Tesellation </li></ul></ul><ul><ul><ul><li>Simplify your algorithms by using the hull shader </li></ul></ul></ul><ul><ul><li>Compute Shader </li></ul></ul><ul><ul><ul><li>Start using CS 5.0 </li></ul></ul></ul><ul><ul><ul><ul><li>More local storage, write anywhere, can output to textures </li></ul></ul></ul></ul><ul><ul><li>Multithreading </li></ul></ul><ul><ul><ul><li>Should automatically see improvements with new hardware and drivers </li></ul></ul></ul>Full Direct3D 11 Implementation
    49. 49. <ul><li>Direct 3D 11 features will improve your game </li></ul><ul><ul><li>Multithreading, Compute Shader, Tessellation and more </li></ul></ul><ul><li>Current Hardware will take you close to a full Direct3D 11 implementation </li></ul><ul><ul><li>Downlevel support is good for prototyping and for backwards compatibility </li></ul></ul><ul><li>Have your game ready to ship when Direct3D 11 ships </li></ul><ul><ul><li>Windows 7 and powerful new hardware will help spotlight your game! </li></ul></ul>There’s nothing stopping you from starting now
    50. 50. <ul><li>Johan Andersson, DICE – advice on porting to D3D 11 </li></ul><ul><li>Nicholas Thibieroz, AMD – Compute Shader </li></ul><ul><li>Holger Gruen, Efficient Tessellation on the GPU through Instancing, Journal of Game Development Volume 1, Issue 3, December 2005 </li></ul><ul><li>Tatarchuk, Barczak, Bilodeau, Programming for Real-Time Tessellation on GPU, 2009 AMD whitepaper on tessellation </li></ul><ul><li>Microsoft Corporation, DirectX 11 Software Development Kit, November, 2008 </li></ul>Acknowledgements
    51. 51. <ul><li>Trademark Attribution </li></ul><ul><li>AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. </li></ul><ul><li>©2008 Advanced Micro Devices, Inc. All rights reserved. </li></ul>Questions ? [email_address]
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×