Windows to reality getting the most out of direct3 d 10 graphics in your games


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Windows to reality getting the most out of direct3 d 10 graphics in your games

  1. 1. Windows to Reality:Getting the Most out ofDirect3D 10 Graphics inYour GamesShanon DroneSoftware Development EngineerXNA Developer ConnectionMicrosoft
  2. 2. Key areas Debug Layer Draw Calls Constant Updates State Management Shader Linkage Resource Updates Dynamic Geometry Porting Tips
  3. 3. Debug LayerUse it! The D3D10 layer can help find performance issues App controlled by passing D3D10_CREATE_DEVICE_DEBUG into D3D10CreateDevice. Use the D3DX10 Debug Runtime Link against D3DX10d.lib Only do this for debug builds! Look for performance warnings in the debug output
  4. 4. Draw Calls Draw calls are still “not free” Draw overhead is reduced in D3D10 But not enough that you can be lazy Efficiency in the number of draw calls will still give a performance win
  5. 5. Draw CallsExcess baggage An increase in the number of draw calls generally increases the number of API calls associated with those draws ConstantBuffer updates Resource changes (VBs, IBs, Textures) InputLayout changes These all have effects on performance that vary with draw call count
  6. 6. Constant Updates Updating shader constants was often a bottleneck in D3D9 It can still be a bottleneck in D3D10 The main difference between the two is the new Constant Buffer object in D3D10 This is the largest section of this talk
  7. 7. Constant UpdatesConstant Buffer Recap Constant Buffers are buffer objects that hold shader constant data They are updated using D3D10_MAP_WRITE_DISCARD or by calling UpdateSubresource There are 16 Constant Buffer slots available to each shader in the pipeline Try not to use all 16 to leave some headroom
  8. 8. Constant UpdatesPorting Issues D3D9 constants were updated individually by calling SetXXXXXShaderConstantX In D3D10, you have to update the entire constant buffer all at once A naïve port from D3D9 to D3D10 can have crippling performance implications if Constant Buffers are not handled correctly! Rule of thumb: Do not update more data than you need to
  9. 9. Constant UpdatesNaïve Port: AKA how to cripple perf Each shader uses one big constant buffer Submitting one value submits them all! If you have one 4096 byte Constant Buffer, and you only need to update your World matrix, you will still have to update 4096 bytes of data and send it across the bus Don’t do this!
  10. 10. Constant UpdatesNaïve Port: AKA how to cripple perf 100 skinned meshes (100 materials), 900 static meshes (400 materials), 1 shadow + 1 lighting pass Shadow Pass Update VSGlobalCB 6560 Bytes x 100 = 656000 Bytes cbuffer VSGlobalsCB Update VSGlobalCB { 6560 matrix ViewProj; Bytes 6560 Bytes x 900 = 5904000 Bytes matrix Bones[100]; Light Pass matrix World; Update VSGlobalCB float SpecPower; 6560 Bytes x 100 = 656000 Bytes float4 BDRFCoefficients; float AppTime; Update VSGlobalCB uint2 RenderTargetSize; 6560 Bytes x 900 = 5904000 Bytes }; = 13,120,000 Bytes
  11. 11. Constant UpdatesOrganize Constants The first step is to organize constants by frequency of update One shader will generally be used to draw several objects Some data in this shader doesn’t need to be set for every draw For example: Time, ViewProj matrices Split these out into their own buffers
  12. 12. Begin Framecbuffer VSGlobalPerFrameCB Update VSGlobalPerFrameCB{ 4 Bytes float AppTime; 4 Bytes x 1 = 4 Bytes}; Update VSPerSkinnedCBscbuffer VSPerSkinnedCB 6400 Bytes x 100 = 640000 Bytes{ 6400 Bytes Update VSPerStaticCBs matrix Bones[100];}; 64 Bytes x 900 = 57600 Bytescbuffer VSPerStaticCB Shadow Pass{ 64 Bytes Update VSPerPassCB matrix World}; 72 Bytes x 1 = 72 Bytescbuffer VSPerPassCB Light Pass{ Update VSPerPassCB matrix ViewProj; 72 Bytes 72 Bytes x 1 = 72 Bytes uint2 RenderTargetSize;}; Update VSPerMaterialCBscbuffer VSPerMaterialCB 20 Bytes x 500 = 10000 Bytes{ 20 Bytes float SpecPower; float4 BDRFCoefficients; = 707,748 Bytes};
  13. 13. Constant Updates 13,120,000 Bytes / 707,748 Bytes = 18x
  14. 14. Constant UpdatesManaging Buffers Constant buffers need to be managed in the application Creating a few buffers that are used for all shader constants just won’t work We update more data than necessary due to large buffers
  15. 15. Constant UpdatesManaging Buffers Solution 1 (Fastest) Create Constant Buffers that line up exactly with the number of elements of each frequency group Global CBs CBs per Mesh CBs per Material CBs per Pass This ensures that EVERY constant buffer is no larger than it absolutely needs to be This also ensures the most efficient update of CBs based upon frequency
  16. 16. Constant UpdatesManaging Buffers Solution 2 (Second Best) If you cannot create a CBs that line up exactly with elements, you can create a tiered constant buffer system Create arrays of 32-byte, 64-byte, 128-byte, 256- byte, etc. constant buffers Keep a shadow copy of the constant data in system memory When it comes time to render, select the smallest CB from the array that will hold the necessary constant data May have to resubmit redundant data for separate passes Hybrid approach?
  17. 17. Constant UpdatesCase Study: Skinning using Solution 1 Skinning in D3D9 (or a bad D3D10 port) Multiple passes causes redundant bone data uploads to the GPU Skinning in D3D10 Using Constant Buffers we only need to upload it once
  18. 18. Constant UpdatesD3D9 Version / or Naïve D3D10 Version Pass1 Mesh2 Bone0 Mesh1 Set Mesh1 Bones Mesh2 Bone1 Mesh1 Bone1 Draw Mesh1 Mesh2 Bone2 Mesh1 Set Mesh2 Bones Constant Mesh2 Bone3 Mesh1 Draw Mesh2 Data Pass2 Mesh2 Bone4 Mesh1 Set Mesh1 Bones … Draw Mesh1 Mesh2 BoneN Mesh1 Set Mesh2 Bones Draw Mesh2
  19. 19. Constant UpdatesPreferred D3D10 Version Mesh1 CB Mesh2 CB Frame Start Mesh1 Bone0 Mesh2 Bone0 Update Mesh1 CB Mesh1 Bone1 Mesh2 Bone1 Update Mesh2 CB Mesh1 Bone2 Mesh2 Bone2 Pass1 Mesh1 Bone3 Mesh2 Bone3 Bind Mesh1 CB Draw Mesh1 Mesh1 Bone4 Mesh2 Bone4 Bind Mesh2 CB … … Draw Mesh2 Mesh1 BoneN Mesh2 BoneN Pass2 Bind Mesh1 CB Draw Mesh1 Bind Mesh2 CB CB Slot 0 Mesh1 Mesh2 CB Draw Mesh2
  20. 20. Constant UpdatesAdvanced D3D10 Version Why not store all of our characters’ bones in a 128-bit FP texture? We can upload bones for all visible characters at the start of a frame We can draw similar characters using instancing instead of individual draws Use SV_InstanceID to select the start of the character’s bone data in the texture Stream the skinned meshes to memory using Stream Output and render all subsequent passes from the post-skinned buffer
  21. 21. State Management Individual state setting is no longer possible in D3D10 State in D3D10 is stored in state objects These state objects are immutable To change even one aspect of a state object requires that you create an entirely new state object with that one change
  22. 22. State ManagementManaging State Objects Solution 1 (Fastest) If you have a known set of materials and required states, you can create all state objects at load time State objects are small and there are finite set of permutations With all state objects created at runtime, all that needs to be done during rendering is to bind the object
  23. 23. State ManagementManaging State Objects Solution 2 (Second Best) If your content is not finalized, or if you CANNOT get your engine to lump state together Create a state object hash table Hash off of the setting that has the most unique states Grab pre-created states from the hash-table Why not give your tools pipeline the ability to do this for a level and save out the results?
  24. 24. Shader Linkage D3D9 shader linkage was based off of semantics (POSITION, NORMAL, TEXCOORDN) D3D10 linkage is based off of offsets and sizes This means stricter linkage rules This also means that the driver doesn’t have to link shaders together at every draw call!
  25. 25. Shader LinkageNo Holes Allowed! Elements must be read in the order they are output from the previous stage Cannot have “holes” between linkagesStruct VS_OUTPUT Struct PS_INPUT{ { float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float3 Norm NORMAL; float2 Tex : TEXCOORD0; float3 Norm : NORMAL; Tex TEXCOORD0; float2 Tex2 : TEXCOORD1; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;}; }; Holes at the end are OK
  26. 26. Shader LinkageInput Assembler to Vertex Shader Input Layouts define the signature of the vertex stream data Input Layouts are the similar to Vertex Declarations in D3D9 Strict linkage rules are a big difference Creating Input Layouts on the fly is not recommended CreateInputLayout requires a shader signature to validate against
  27. 27. Shader LinkageInput Assembler to Vertex Shader Solution 1 (Fastest) Create an Input Layout for each unique Vertex Stream / Vertex Shader combination up front Input Layouts are small This assumes that the shader input signature is available when you call CreateInputLayout Try to normalize Input Layouts across level or be art directed
  28. 28. Shader LinkageInput Assembler to Vertex Shader Solution 2 (Second Best) If you load meshes and create input layouts before loading shaders, you might have a problem You can use a similar hashing scheme as the one used for State Objects When the Input Layout is needed, search the hash for an Input Layout that matches the Vertex Stream and Vertex Shader signature Why not store this data to a file and pre- populate the Input Layouts after your content is tuned?
  29. 29. Shader LinkageAside: Instancing Instancing is a first class citizen on D3D10! Stream source frequency is now part of the Input Layout Multiple frequencies will mean multiple Input Layouts
  30. 30. Resource Updates Updating resources is different in D3D10 Create / Lock / Fill / Unlock paradigm is no longer necessary (although you can still do it) Texture data can be passed into the texture at create time
  31. 31. Resource UpdatesResource Usage Types D3D10_USAGE_DEFAULT D3D10_USAGE_IMMUTABLE D3D10_USAGE_DYNAMIC D3D10_USAGE_STAGING
  32. 32. Resource UpdatesD3D10_USAGE_DEFAULT Use for resources that need fast GPU read and write access Can only be updated using UpdateSubresource Render targets are good candidates Textures that are updated infrequently (less than once per frame) are good candidates
  33. 33. Resource UpdatesD3D10_USAGE_IMMUTABLE Use for resources that need fast GPU read access only Once they are created, they cannot be updated... ever Initial data must be passed in during the creation call Resources that will never change (static textures, VBs / Ibs) are good candidates Don’t bend over backwards trying to make everything D3D10_USAGE_IMMUTABLE
  34. 34. Resource UpdatesD3D10_USAGE_DYNAMIC Use for resources that need fast CPU write access (at the expense of slower GPU read access) No CPU read access Can only be updated using Map with: D3D10_MAP_WRITE_DISCARD D3D10_MAP_WRITE_NO_OVERWRITE Dynamic Vertex Buffers are good candidates Dynamic (> once per frame) textures are good candidates
  35. 35. Resource UpdatesD3D10_USAGE_STAGING This is the only way to read data back from the GPU Can only be updated using Map Cannot map with D3D10_MAP_WRITE_DISCARD or D3D10_MAP_WRITE_NO_OVERWRITE Might want to double buffer to keep from stalling GPU The GPU cannot directly use these
  36. 36. Resource UpdatesSummary CPU updates the resource frequently (more than once per frame) Use D3D10_USAGE_DYNAMIC CPU updates the resource infrequently (once per frame or less) Use D3D10_USAGE_DEFAULT CPU doesn’t update the resource Use D3D10_USAGE_IMMUTABLE CPU needs to read the resource Use D3D10_USAGE_STAGING
  37. 37. Resource UpdatesExample: Vertex Buffer The vertex buffer is touched by the CPU less than once per frame Create it with D3D10_USAGE_DEFAULT Update it with UpdateSubresource The vertex buffer is used for dynamic geometry and CPU need to update if multiple times per frame Create it with D3D10_USAGE_DYNAMIC Update it with Map
  38. 38. Resource UpdatesThe Exception: Constant Buffers CBs are always expected to be updated frequently Select CB usage based upon which one causes the least amount of system memory to be transferred Not just to the GPU, but system-to-system memory copies as well
  39. 39. Resource UpdatesUpdateSubresource UpdateSubresource requires a system memory buffer and incurs an extra copy Use if you have system copies of your constant data already in one place
  40. 40. Resource UpdatesMap Map requires no extra system memory but may hit driver renaming limits if abused Use if compositing values on the fly or collecting values from other places
  41. 41. Resource UpdatesA note on overusing discard Use D3D10_MAP_WRITE_DISCARD carefully with buffers! D3D10_MAP_WRITE_DISCARD tells the driver to give us a new memory buffer if the current one is busy There are a LIMITED set of temporary buffers If these run out, then your app will stall until another buffer can be freed This can happen if you do dynamic geometry using one VB and D3D10_MAP_WRITE_DISCARD
  42. 42. Dynamic Geometry DrawIndexedPrimitiveUP is gone! DrawPrimitiveUP is gone! Your well-behaved D3D9 app isn’t using these anyway, right?
  43. 43. Dynamic GeometrySolution: Same as in D3D9 Use one large buffer, and map it with D3D10_MAP_WRITE_NO_OVERWRITE Advance the write position with every draw Wrap to the beginning Make sure your buffer is large enough that you’re not overwriting data that the GPU is reading This is what happens under the covers for D3D9 when using DIPUP or DUP in Windows Vista
  44. 44. Porting Tips StretchRect is Gone Work around using render-to-texture A8R8G8B8 have been replaced with R8G8B8A8 formats Swizzle on texture load or swizzle in the shader Fixed Function AlphaTest is Gone Add logic to the shader and call discard Fixed Function Fog is Gone Add it to the shader
  45. 45. Porting TipsContinued User Clip Planes usage has changed They’ve move to the shader Experiment with the SV_ClipDistance SEMANTIC vs discard in the PS to determine which is faster for your shader Query data sizes might have changed Occlusion queries are UINT64 vs DWORD No Triangle Fan Support Work around in content pipeline or on load SetCursorProperties, ShowCursor are gone Use Win32 APIs to handle cursors now
  46. 46. Porting TipsContinued No offsets on Map calls This was basically API clutter in D3D9 Calculate the offset from the returned pointer Clears are no longer bound to pipeline state If you want a clear call to respect scissor, stencil, or other state, draw a full-screen quad This is closer to the HW The Driver/HW has been doing for you for years OMSetBlendState Never set the SampleMask to 0 in OMSetBlendState
  47. 47. Porting TipsContinued Input Layout conversions tightened up D3DDECLTYPE_UBYTE4 in the vertex stream could be converted to a float4 in the VS in D3D9 IE. 255u in the stream would show up as 255.0 in the VS In D3D10 you either get a normalized [0..1] value or 255 (u)int Register keyword It doesn’t mean the same thing in D3D10 Use register to determine which CB slot a CB binds to Use packoffset to place a variable inside a CB
  48. 48. Porting TipsContinued Sampler and Texture bindings Samplers can be bound independently of textures This is very flexible! Sampler and Texture slots are not always the same Register Packing In D3D9 all variables took up at least one float4 register (even if you only used a single float!) In D3D10 variables are packed together This saves a lot of space Make sure your engine doesn’t do everything based upon register offsets or your variables might alias
  49. 49. Porting TipsContinued D3DSAMP_SRGBTEXTURE This sampler state setting does not exist on D3D10 Instead it’s included in the texture format This is more like the Xbox 360 Consider re-optimizing resource usage and upload for better D3D10 performance But use D3D10_USAGE_DEFAULT resources and UpdateSubresource and a baseline
  50. 50. Summary Use the debug runtime! More draw calls usually means more constant updating and state changing calls Be frugal with constant updates Avoid resubmitting redundant data! Create as much state and input layout information up front as possible Select D3D10_USAGE for resources based upon the CPU access patterns needed Use D3D10_MAP_NO_OVERWRITE and a big buffer as a replacement for DIPUP and DUP
  51. 51. Call to Action Actually exploit D3D10! This talk tells you how to get performance gains from a straight port You can get a whole lot more by using D3D10’s advanced features! StreamOut to minimize skinning costs First class instancing support Store some vertex data in textures Move some systems to the GPU (Particles?) Aggressive use of Constant Buffers
  52. 52. © 2007 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.